How to Use Word Embedding Layers for Deep Learning with Keras

Word embeddings provide a dense representation of words and their relative meanings.

They are an improvement over sparse representations used in simpler bag of word model representations.

Word embeddings can be learned from text data and reused among projects. They can also be learned as part of fitting a neural network on text data.

In this tutorial, you will discover how to use word embeddings for deep learning in Python with Keras.

After completing this tutorial, you will know:

  • About word embeddings and that Keras supports word embeddings via the Embedding layer.
  • How to learn a word embedding while fitting a neural network.
  • How to use a pre-trained word embedding in a neural network.

Let’s get started.

  • Updated Feb/2018: Fixed a bug due to a change in the underlying APIs.
  • Updated Oct/2019: Updated for Keras 2.3 and TensorFlow 2.0.
How to Use Word Embedding Layers for Deep Learning with Keras

How to Use Word Embedding Layers for Deep Learning with Keras
Photo by thisguy, some rights reserved.

Tutorial Overview

This tutorial is divided into 3 parts; they are:

  1. Word Embedding
  2. Keras Embedding Layer
  3. Example of Learning an Embedding
  4. Example of Using Pre-Trained GloVe Embedding

1. Word Embedding

A word embedding is a class of approaches for representing words and documents using a dense vector representation.

It is an improvement over more the traditional bag-of-word model encoding schemes where large sparse vectors were used to represent each word or to score each word within a vector to represent an entire vocabulary. These representations were sparse because the vocabularies were vast and a given word or document would be represented by a large vector comprised mostly of zero values.

Instead, in an embedding, words are represented by dense vectors where a vector represents the projection of the word into a continuous vector space.

The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used.

The position of a word in the learned vector space is referred to as its embedding.

Two popular examples of methods of learning word embeddings from text include:

  • Word2Vec.
  • GloVe.

In addition to these carefully designed methods, a word embedding can be learned as part of a deep learning model. This can be a slower approach, but tailors the model to a specific training dataset.

2. Keras Embedding Layer

Keras offers an Embedding layer that can be used for neural networks on text data.

It requires that the input data be integer encoded, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras.

The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.

It is a flexible layer that can be used in a variety of ways, such as:

  • It can be used alone to learn a word embedding that can be saved and used in another model later.
  • It can be used as part of a deep learning model where the embedding is learned along with the model itself.
  • It can be used to load a pre-trained word embedding model, a type of transfer learning.

The Embedding layer is defined as the first hidden layer of a network. It must specify 3 arguments:

It must specify 3 arguments:

  • input_dim: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.
  • output_dim: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem.
  • input_length: This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.

For example, below we define an Embedding layer with a vocabulary of 200 (e.g. integer encoded words from 0 to 199, inclusive), a vector space of 32 dimensions in which words will be embedded, and input documents that have 50 words each.

The Embedding layer has weights that are learned. If you save your model to file, this will include weights for the Embedding layer.

The output of the Embedding layer is a 2D vector with one embedding for each word in the input sequence of words (input document).

If you wish to connect a Dense layer directly to an Embedding layer, you must first flatten the 2D output matrix to a 1D vector using the Flatten layer.

Now, let’s see how we can use an Embedding layer in practice.

3. Example of Learning an Embedding

In this section, we will look at how we can learn a word embedding while fitting a neural network on a text classification problem.

We will define a small problem where we have 10 text documents, each with a comment about a piece of work a student submitted. Each text document is classified as positive “1” or negative “0”. This is a simple sentiment analysis problem.

First, we will define the documents and their class labels.

Next, we can integer encode each document. This means that as input the Embedding layer will have sequences of integers. We could experiment with other more sophisticated bag of word model encoding like counts or TF-IDF.

Keras provides the one_hot() function that creates a hash of each word as an efficient integer encoding. We will estimate the vocabulary size of 50, which is much larger than needed to reduce the probability of collisions from the hash function.

The sequences have different lengths and Keras prefers inputs to be vectorized and all inputs to have the same length. We will pad all input sequences to have the length of 4. Again, we can do this with a built in Keras function, in this case the pad_sequences() function.

We are now ready to define our Embedding layer as part of our neural network model.

The Embedding has a vocabulary of 50 and an input length of 4. We will choose a small embedding space of 8 dimensions.

The model is a simple binary classification model. Importantly, the output from the Embedding layer will be 4 vectors of 8 dimensions each, one for each word. We flatten this to a one 32-element vector to pass on to the Dense output layer.

Finally, we can fit and evaluate the classification model.

The complete code listing is provided below.

Running the example first prints the integer encoded documents.

Then the padded versions of each document are printed, making them all uniform length.

After the network is defined, a summary of the layers is printed. We can see that as expected, the output of the Embedding layer is a 4×8 matrix and this is squashed to a 32-element vector by the Flatten layer.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Finally, the accuracy of the trained model is printed, showing that it learned the training dataset perfectly (which is not surprising).

You could save the learned weights from the Embedding layer to file for later use in other models.

You could also use this model generally to classify other documents that have the same kind vocabulary seen in the test dataset.

Next, let’s look at loading a pre-trained word embedding in Keras.

4. Example of Using Pre-Trained GloVe Embedding

The Keras Embedding layer can also use a word embedding learned elsewhere.

It is common in the field of Natural Language Processing to learn, save, and make freely available word embeddings.

For example, the researchers behind GloVe method provide a suite of pre-trained word embeddings on their website released under a public domain license. See:

The smallest package of embeddings is 822Mb, called ““. It was trained on a dataset of one billion tokens (words) with a vocabulary of 400 thousand words. There are a few different embedding vector sizes, including 50, 100, 200 and 300 dimensions.

You can download this collection of embeddings and we can seed the Keras Embedding layer with weights from the pre-trained embedding for the words in your training dataset.

This example is inspired by an example in the Keras project:

After downloading and unzipping, you will see a few files, one of which is “glove.6B.100d.txt“, which contains a 100-dimensional version of the embedding.

If you peek inside the file, you will see a token (word) followed by the weights (100 numbers) on each line. For example, below are the first line of the embedding ASCII text file showing the embedding for “the“.

As in the previous section, the first step is to define the examples, encode them as integers, then pad the sequences to be the same length.

In this case, we need to be able to map words to integers as well as integers to words.

Keras provides a Tokenizer class that can be fit on the training data, can convert text to sequences consistently by calling the texts_to_sequences() method on the Tokenizer class, and provides access to the dictionary mapping of words to integers in a word_index attribute.

Next, we need to load the entire GloVe word embedding file into memory as a dictionary of word to embedding array.

This is pretty slow. It might be better to filter the embedding for the unique words in your training data.

Next, we need to create a matrix of one embedding for each word in the training dataset. We can do that by enumerating all unique words in the Tokenizer.word_index and locating the embedding weight vector from the loaded GloVe embedding.

The result is a matrix of weights only for words we will see during training.

Now we can define our model, fit, and evaluate it as before.

The key difference is that the embedding layer can be seeded with the GloVe word embedding weights. We chose the 100-dimensional version, therefore the Embedding layer must be defined with output_dim set to 100. Finally, we do not want to update the learned word weights in this model, therefore we will set the trainable attribute for the model to be False.

The complete worked example is listed below.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example may take a bit longer, but then demonstrates that it is just as capable of fitting this simple problem.

In practice, I would encourage you to experiment with learning a word embedding using a pre-trained embedding that is fixed and trying to perform learning on top of a pre-trained embedding.

See what works best for your specific problem.

Further Reading

This section provides more resources on the topic if you are looking go deeper.


In this tutorial, you discovered how to use word embeddings for deep learning in Python with Keras.

Specifically, you learned:

  • About word embeddings and that Keras supports word embeddings via the Embedding layer.
  • How to learn a word embedding while fitting a neural network.
  • How to use a pre-trained word embedding in a neural network.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

    Alex April 19, 2019 at 12:20 pm #

    Hello, Jason,

    thanks for the post. I have a question about the embedding data you actually fit. I print the padded_docs after the model compile. It seems to me that the printed matrix is not an embedding matrix. It’s a integer matrix. So I think what you fit in the CNN is not embedding but the integer matrix you define. Could you please help me explain it? Thanks a lot.

      Jason Brownlee April 19, 2019 at 3:05 pm #

      Yes, the padded_docs is integers that are fed to the embedding layer that maps each integer to an 8-element vector.

      The values of these vectors are then defined by training the network.

    Oscar April 21, 2019 at 4:25 am #

    Hi Jason,

    I am working on character embedding. My dataset consists of raw HTTP traffic both normal and malicious. I have used the Tokenizer API to integer encode my data with each character having an index assigned to it.

    Please let me know if I understood this correctly:

    My data is integer encoded to values between 1-55, therefore my input_dim is 55.

    I will start with output-dim of 32 and modify this value a needed.

    Now for the input_length, I am a bit confused how to set this value.
    I have different lengths for the numerical strings in my dataset the longest is 666. Do I set input-length to 666? And if I do this what will happen to the sequences with shorter length?

    Thank you for your help!

      Oscar April 21, 2019 at 6:39 am #

      Also, should I set the input dim to a value higher than 55?

      Jason Brownlee April 21, 2019 at 8:27 am #

      Do you mean word embedding instead of char embedding?

      I don’t have any examples of embedding char’s – I’m not sure it would be effective.

        Oscar April 21, 2019 at 5:34 pm #

        In meant character embedding. I used tokenizer and set the character level to True.
        I am not sure how to use word embedding for query strings of http traffic when they are not made of real words and just strings of characters.
        I am designing a character level neural network for detecting parameter injection in http requests.
        The result would be in a binary format 0 if request is normal and 1 if it’s malicious.
        So you don’t think character embedding is helpful here?

          Jason Brownlee April 22, 2019 at 6:21 am #

          Sorry, I don’t have an example of a character embedding.

          Nevertheless, you should be able to provided strings of integer encoded chars to the embedding in your model. It will look much like an embedding for words, just lower carnality (<100 chars perhaps). Also I don't expect good results.

          What problem are you having exactly?

    Mario May 2, 2019 at 6:09 pm #

    I have found vocab_size = len(t.word_index) + 1 to be wrong. This index not only ignores the Tokenizer(num_words=X parameter, but also stores more words than are actually ever going to be encoded.

    I fit my text without word limit, and then encode the same text using the tokenizer, and the length of the word_index is larger than the max(max(encoded_texts)).

    lamesa May 17, 2019 at 6:46 pm #

    hello jason, how are you?
    I am doing my masters thesis on text summarization using word embeddings, and now i am in the middle of many questions, how could I use these features and which neural network alg is best. please could you give some guide….

    lamesa May 20, 2019 at 9:36 pm #

    thanks jason…it is really helpful..

    Siddhartha June 9, 2019 at 6:06 pm #

    Hi Jason,

    This entire article is very useful. It helped me in writing my initial implementation.
    I have one question related to Input() and Embedding() in Keras.
    If I already have a pretrained word embedding. In that case, should I use Embedding or Input ?

      Jason Brownlee June 10, 2019 at 7:36 am #

      Yes, the embedding vectors are loaded into the Embedding layer and the layer may be marked as not trainable.

    alphabeta June 23, 2019 at 7:44 pm #

    what if in test we have a new words which were not there in the training text ?
    It will not proceed from the embedding layer correct?

    Zeinab June 29, 2019 at 10:53 pm #

    Hi, Jason
    I want to ask you about how can i save my learned word embedding?

    Zeinab July 2, 2019 at 4:03 am #

    Can I construct a network for learning the embedding matrix only?

    zeinab July 10, 2019 at 6:28 pm #

    I have a text similarity application where I measure Pearson correlation coefficient as keras metrics.
    In many epochs, I noticed that the correlation value is nan.
    Is this is normal or there is a problem in the model?

      Jason Brownlee July 11, 2019 at 9:46 am #

      You may have to debug the model to see what the cause of the NAN is.

      Perhaps an exploding gradient or vanishing gradient?

        Zeinab July 12, 2019 at 2:55 pm #

        Do you mean that I have to adjust the activation function?

        I use elu activation function and Adam optimization function,

        Do you mean that I have to change any of them and see the results?

        • Avatar
          Jason Brownlee July 13, 2019 at 6:52 am #


          Try relu.
          Try batch norm.
          Try smaller learning rate.

        Zeinab July 13, 2019 at 10:19 pm #

        Can I know what do you mean by debuging the model?

        • Avatar
          Jason Brownlee July 14, 2019 at 8:10 am #

          Yes, here are some ideas:

          – Consider aggressively cutting the code back to the minimum required. This will help you isolate the problem and focus on it.
          – Consider cutting the problem back to just one or a few simple examples.
          – Consider finding other similar code examples that do work and slowly modify them to meet your needs. This might expose your misstep.
          – Consider posting your question and code to StackOverflow.

    Leon July 20, 2019 at 12:19 pm #

    vocab_size = len(t.word_index) + 1

    why do we need to increase the vocabulary size by 1???

    • Avatar
      Jason Brownlee July 21, 2019 at 6:23 am #

      This is so that we can reserve 0 for unknown words and start known words at 1.

    Zineb_Morocco July 26, 2019 at 5:48 am #

    Thanks a lot for this wonderful work .. we’re in July 2019 and still taking advange of it.

    • Avatar
      Jason Brownlee July 26, 2019 at 8:33 am #

      Thanks. I try to write evergreen tutorials that remain useful.

    Christiane July 29, 2019 at 5:24 am #

    Dear Jason,

    thanks a lot for your detailed explanations and the code examples.

    What I am still wondering about, however, is how I would combine embeddings with other variables, i.e. having a few numeric or categorical variables and then one or two text variables. As I see you input the padded docs (only text) in But how would I add the other variables? It doesn’t seem realistic to always only have one text variable as input.

    Zineb_Morocco July 30, 2019 at 11:43 pm #

    Thank you Jason for this wonderful work and examples. That really help.

    Youssef MELLAH August 4, 2019 at 6:19 am #

    Thank u Jason Brownlee, that’s very interesting and clear.

    And if i have two inputs?

    for example i am working on Text-to-SQL task and that necessit 2 inputs : user question and table schema (columns names).

    how can i process? how can i do embeddig? with 2 embeddings layers?

    Thank u for help.

    Youssef MELLAH August 4, 2019 at 7:30 am #

    Ah okay, that’s interesting too, thanks!!

    Can you please confirm me the architecture above to encode both user Questions and table Schema in the same model?

    (1) userQuestion ==> One-hote-endoding ==> Embedding(GloVe) ==> LSTM
    (2) tableShema ==> One-hote-endoding ==> Embedding(GloVe) ==> LSTM

    (1) concatenate (merge) (2) ==> rest of model layers….

    thakns Jason.

      Jason Brownlee August 5, 2019 at 6:42 am #

      No need for the one hot encoding, the text can be encoded as integers, and the integers mapped to vectors via the embedding.

      • Avatar
        Youssef Mellah August 5, 2019 at 7:58 am #

        ok that’s clear, thanks.

        The attention mechanism can be done only with the merge?

        • Avatar
          Jason Brownlee August 5, 2019 at 1:59 pm #

          No, you can use attention on each input if you wish.

            Youssef MELLAH August 5, 2019 at 6:42 pm #

            Should i do attention on both inputs (user Question & table Schema) separately or can i do it after merging the 2 inputs?

          • Avatar
            Jason Brownlee August 6, 2019 at 6:31 am #

            Test many different model types and see what works well for your specific dataset.

          • Avatar
            Youssef MELLAH August 15, 2019 at 12:41 am #

            okay thank you Jason!!

          • Avatar
            Jason Brownlee August 15, 2019 at 8:12 am #

            No problem.

    Elizabeth August 5, 2019 at 12:06 am #

    I wonder about pre-trained word2vec, there is no such good tutorial for that. I am looking to implement pre-trained word2vec but I do not know should I follow the same steps of Glove or look for another source for that?
    thanks MR.Jason I am very inspired by you in machine learning

    Dean August 7, 2019 at 1:06 am #

    Why dont you use mask_zero=True ion your embedding layers? It seems necessary since you are padding sentences with 0’s.

    • Avatar
      Jason Brownlee August 7, 2019 at 8:01 am #

      Great suggestion, not sure if that argument existed back when I wrote these tutorials. I was using masking input layers instead.

    Nathalia August 9, 2019 at 8:04 am #

    hi, can you help me with a question?

    Im working with a dataset that has a city column as a feature and thats has a lot of different cities. So, I create a embeddinglayer for this feature. First, I used this command :
    data[‘city’]= data[‘city’].astype(‘category’)
    data[‘city’]= data[‘city’]
    After that, for each different city a value was assigned starting at 0

    So, Im confused about how this embedded layer works when the test data has a input that was not training. I saw that you said that when this occurs, we had to put 0 as input, but 0
    it’s related with some city. Should i start assigning this values ​​to the city from 1?

    • Avatar
      Jason Brownlee August 9, 2019 at 8:20 am #

      Excellent question.

      Typically, unseen categories are assigned 0 for “unknown”.

      0 should be reserved and real numbering should start at 1.

      • Avatar
        Nathalia August 9, 2019 at 8:25 am #

        thank you, you always help me a lot!

    ravi August 20, 2019 at 1:38 pm #

    Hi Jason..

    Thanks for such a great tutorial. I am confused on when we talk about learned word embeddings , do we consider weighs of the embedding layer or output of embedding layer.

    let me ask in other way as well, when we utilize pretrained embedding let us say “glove.6B.50d.txt” . those word embeddings are weights or the output of the layer?

    • Avatar
      Jason Brownlee August 20, 2019 at 2:14 pm #

      They are the same thing. Integers are mapped to vectors, those vectors are the output of the embedding.

    Ralph August 23, 2019 at 7:39 pm #

    Hi Jason,

    I am new to ML, trying out different things, and your posts are the most helpful I encountered, it helps me a lot to understand, thank you!

    Here I think I understood the procedure, but I still have a deeper question, on the point of embeddings. If I understand correctly, this embedding kind of maps a set of words as points onto another dimensionnal space. The surprising fact in your example is that we pass from a space of dimension 4 to a space of dimension 8, so it might not be seen as an improvement at first.

    Still I imagine that the embedding makes it so that points in the new space are more equally placed, am I right? Then I don’t understand several things:
    -How does the context where one words appear come into play? Other words which are often close by will also be represented by closer points in the new space?
    -Why does it have to be integers? And why is it more applied to word encodings? I mean we could imagine the same process could be helpful for images as well. Or is it just a dimension reduction technique tailored for words documents?

    Thank you for your insights anyway

    • Avatar
      Jason Brownlee August 24, 2019 at 7:50 am #

      Not equally spaced, but spaced in away that preserves or best captures their relationships.

      Context defines the relationships captured in the embedding, e.g. what works appear with what other words. Their closeness.

      Each word gets one vector. The simplest way is to map words to integers and integers to the index of vectors in a matrix. No other reason.

      Great questions!

    Anand August 24, 2019 at 4:43 pm #

    Jason,Thank you so much for your time and effort!

    My question is related to the line-
    “e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)”

    Here you are using-weights=[embedding_matrix], but there is no relation telling for which word which vector. Then, it produces for each document one 4*100 matrix(example for [ 6 2 0 0]).How it will extract the vectors related to 6,2,0,0 accurately?

    • Avatar
      Jason Brownlee August 25, 2019 at 6:34 am #

      The vectors are ordered, with an array index you can retrieve the vector for each word, e.g. word 0, word 1, word 2, work 3, etc.

  128. Avatar
    Youssef MELLAH September 2, 2019 at 12:36 am #

    Hello Jason,

    I m searching for interesting formation on Python (numpy pandas … and tools for ML DL and NLP) and formation on Keras !!

    Some suggestions please?

    Emre Calisir September 3, 2019 at 6:32 pm #

    Thanks for article, I will run it for the Italian language documents. Is there any GoogleNews pretrained word2vec covering Italian vocabulary?

    • Avatar
      Jason Brownlee September 4, 2019 at 5:56 am #

      Good question, I’m not sure off the cuff sorry.

    Muhammad Usgan September 7, 2019 at 5:59 pm #

    Hallo jason, Can I put Dropout into this model ?

    ASHWARYA ASHIYA September 19, 2019 at 6:49 pm #

    e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)

    What does the parameter – weights=[embedding_matrix] – stand for ? weights or inputs for the Embedding Layer ?

    • Avatar
      Jason Brownlee September 20, 2019 at 5:37 am #

      The weights are the vectors, e.g. one vector for each word in your vocab.

    Munisha October 3, 2019 at 7:30 pm #

    Layer (type) Output Shape Param #
    embedding_1 (Embedding) (None, 4, 8) 400
    flatten_1 (Flatten) (None, 32) 0
    dense_1 (Dense) (None, 1) 33
    Total params: 433
    Trainable params: 433
    Non-trainable params: 0

    Could you please clarify no. of weights learnt for embedding layer. Now we have 10 documents and embedding layer from 4 to 8. How many weights parameters will actually be learnt here. My understanding was that there should only be (4+1)*8 = 40 weights to be learnt, including the bias term. Why is it learning weights for all the documents separately (10*(4+1)*8 = 400) ?

    • Avatar
      Jason Brownlee October 4, 2019 at 5:40 am #

      The number of weights in an embedding is vector length times number of vectors (words).

  133. Avatar
    martin October 12, 2019 at 4:29 pm #

    Hi, Jason:

    In this example, what type of neural architecture it is? It is not LSTM, not CNN. Is it a Multi-Layer Perceptron model?

    • Avatar
      Jason Brownlee October 13, 2019 at 8:26 am #

      We are just working with embeddings. I guess with an output layer, you can call it an MLP.

  134. Avatar
    Zafari October 13, 2019 at 5:52 am #

    Hi, Thanks for this excellent article. I tried to use a pre-trained word embedding instead of random number in a Keras-based classifier. However, after constructing the embedding matrix and adding it to the embedding layer as follow, during training epochs all of the accuracy values are the same and no learning happens. However, after removing the “weights=[embedding_matrix]” it works well and reached the accuracy of 90%.

    layers.Embedding(input_dim=vocab_size, weights=[embedding_matrix],

    What can be the reason of this strange behavior?

    • Avatar
      Jason Brownlee October 13, 2019 at 8:35 am #

      An embedding specalized on your own data is often better than something generic.

      • Avatar
        Hussain July 28, 2020 at 8:46 am #

        Quick question. While using pre-trained embeddings (MUSE) in an Embeddings layer, is it okay to set trainable=True?

        Note: The model doesn’t overfit when i set trainable=True. The model doesn’t predict well if i set trainable=False.

        • Avatar
          Jason Brownlee July 28, 2020 at 8:52 am #

          Yes, although you might want to use a small learning rate to ensure you don’t wash away the weights.

          • Avatar
            Hussain July 28, 2020 at 9:24 am #

            Thank you very much for your reply. Currently I am using a learning rate of 0.0001.


          • Avatar
            Jason Brownlee July 28, 2020 at 10:54 am #

            Perhaps use SGD instead of Adam as Adam will change the learning rate for each model parameter and could get quite large.

            Or at least compare results with adam vs sgd.

    Xuesong Wang October 13, 2019 at 8:32 pm #

    Hi Jason,
    Thank you for your post. I have an issue. The data I used include both categorical and numerical features. Say, Some features are cost and time while others are post code. How should I write the code? Build separate models and concatenate them together? Thank you

    • Avatar
      Jason Brownlee October 14, 2019 at 8:07 am #

      Great question!

      The features must be prepared separately then aggregated.

      The two ways I like to do this is:

      1. Manually. Prepare each feature type separately then concate into input vectors.
      2. Multi-Input Model. Prepare features separately and feed different features into different inputs of a model, and let the model concate the features.

      Does that help?

      • Avatar
        martin October 15, 2019 at 4:39 pm #

        Should the categorical be converted into numerical using one-hot encoding?

      • Avatar
        Franz Götz-Hahn October 28, 2019 at 9:04 pm #

        If you have multivariate time series of which you know they are meaningfully connected (say trajectories in x and y), does it make sense to put a Conv layer before feeding them into the embedding?

        Could you explain what you mean by “preparing” the features?

        • Avatar
          Jason Brownlee October 29, 2019 at 5:22 am #

          No embedding is typically the first step, e.g. an interpretation of input.

          Prepared means transformed in whatever way you want, such as encoding or scaling.

          • Avatar
            Franz Götz-Hahn October 29, 2019 at 5:03 pm #

            Thanks for the answer! If I may ask a follow up: is embedding of multivariate numerical data uncommon? I have seen fairly little work that uses it.

          • Avatar
            Jason Brownlee October 30, 2019 at 5:57 am #

            Embedding is used for categorical data, not numerical data.

    Zineb_Morocco October 16, 2019 at 1:28 am #

    hi Jason,

    I use an example where a put the vocabulary size = 200 and the training sample contain about 20 different words.
    When I check the embeddings ( the vectors) using ** layers[0].get_weights()[0]** I obtain an array with 200 rows.

    1/ how can I know the vector corresponding to each word (from the 20 words I ‘ve got)?
    2/ where the 180 (200 – 20) vectors come from since I use only 20 words?

    Thanks in advance.

    • Avatar
      Jason Brownlee October 16, 2019 at 8:08 am #

      The vocab size and number of words are the same thing.

      I think you might be confusing the size of the embedding and the vocab?

      Each word is assigned a number (0, 1, 2, etc), the index of vectors maps to the words, vector 0 is word 0, etc.

    Zineb_Morocco October 16, 2019 at 11:38 pm #

    Thanks for your answer Jason,
    I ‘ll clarify my question:
    the vocab size is 200 that means that the number of words is 200.
    But effectively i’m working with 20 words only ( the words of my training sample) : let say word[0] to word[19].
    So, after the embedding, the vector[0] corresponds to word[0] and so on. but vector[20].. vector [30] … what do they match ?
    I have no word[20] or word[30] .

    • Avatar
      Jason Brownlee October 17, 2019 at 6:37 am #

      If you define the vocab with 200 words but only have 20 in the training set, the the words not in the training set will have random vectors.

      • Avatar
        Zineb_Morocco October 18, 2019 at 12:22 am #

  138. Avatar
    Elizabeth October 27, 2019 at 7:28 am #

    I want to save my own pretrained model in the same way Golve saved their model as txt file and the word followed by its vector? How I would do that?
    thank you

    • Avatar
      Jason Brownlee October 28, 2019 at 6:00 am #

      You could extract the weights from the embedding layer via layer.get_weights() then enumerate the vectors and save to a file int he format you prefer.

    Elizabeth October 28, 2019 at 7:47 am #

    beginner in python I did not understand what you mean by enumerating..and which layer should I get weight from?…

    • Avatar
      Jason Brownlee October 28, 2019 at 1:16 pm #

      You can get the vectors from the embedding layer.

      You can either hold a reference to the embedding layer from when you constructed the model, or retrieve the layer by index (e.g. model.get_layers()[0]) or by name, if you name it.

      Enumerating means looping.

    Michael November 19, 2019 at 4:47 am #

    Hello, Jason!

    Thanks for the article!
    I have been wondering about the input_dim of the learnable embedding layer.
    You set it to vocab_size, that in your case is 50 (the hashing trick upper limit), which is much larger than the actual vocabulary size of 15.

    The documentation of Embedding in keras says:
    “Size of the vocabulary, i.e. maximum integer index + 1.”
    Which is ambiguous.

    I have experimented with some numbers for vocab_size, and cannot see any systematic difference.

    Would it actually matter for more realistically sized examples?

    Could you say a couple of words about it?
    Thanks again

      Jason Brownlee November 19, 2019 at 7:50 am #

      Smaller vocabs means you will have fewer words/word vectors and in turn a simpler model which is faster/easier to learn. The cost is it might perform worse.

      This is the trade-off large/slow but good, small/fast but less good.

      • Avatar
        Michael November 19, 2019 at 7:48 pm #

        Thanks, Jason!

        I may have not explained myself properly:
        The *actual* number of words in the vocabulary is the same (14).
        The difference is the value of input_dim to Embedding().

        In the example, you chose 50 as high enough to prevent collisions in encoding, but also
        used it as an input_dim in one of the cases.


        • Avatar
          Jason Brownlee November 20, 2019 at 6:11 am #

          I see.

          • Avatar
            martin November 22, 2019 at 6:28 pm #

            I thought the question is “Size of the vocabulary, i.e. maximum integer index + 1.”. Since there are 14 words in this example, why vocab size isn’t 15, instead of 50?

          • Avatar
            Jason Brownlee November 23, 2019 at 6:49 am #

            There is the size of the vocab, there is also the size of the embedding space. They are different – in case that us causing confusion.

  141. Avatar
    martin November 22, 2019 at 5:52 pm #

    Jason: In this example, ‘one_hot’ function instead of ‘to_categorical’ function is used. The 2nd is the real one-hot representation, and the 1st is simply creating an integer for each word. Why isn’t to_categorical used here? They are different, right?

  142. Avatar
    moSaber November 24, 2019 at 11:11 am #

    Thanks a lot Jason! in 3. Example of Learning an Embedding section, could you please elaborate what is 400 params that are being trained in the embedding layer? Thnx

    • Avatar
      Jason Brownlee November 25, 2019 at 6:19 am #

      Yes, each vector is mapped to an 8 element vector, and the vocab is 50 words. Therefore 50*8 = 400.

        Mohit September 26, 2020 at 6:54 pm #

        Jason, why the output shape of embedding layer is: (4,8)?
        It should be (50,8) as the vocab size is 50 and we are creating the embeddings of all words in our vocabulary.

        • Avatar
          Jason Brownlee September 27, 2020 at 6:51 am #

          vocab size is the total vectors in the layer – the number of words supported, not the output.

          The output is the number of input words (8) where each word has the same vector length (4).

          • Avatar
            Sean February 11, 2023 at 8:57 pm #

            is it typo? The output is the number of input words (4) where each word has the same vector length (8).

          • Avatar
            James Carmichael February 12, 2023 at 9:26 am #

            Hi Sean…We do not believe it is a typo. What occurs when you execute the code?

  143. Avatar
    criz December 30, 2019 at 3:13 am #

    Hi i need some help when running the file.

    (wordembedding) C:\Users\Criz Lee\Desktop\Python Projects\wordembedding>
    Traceback (most recent call last):
    File “C:\Users\Criz Lee\Desktop\Python Projects\wordembedding\”, line 1, in
    from numpy import array
    File “C:\Users\Criz Lee\Anaconda3\lib\site-packages\numpy\”, line 140, in
    from . import _distributor_init
    File “C:\Users\Criz Lee\Anaconda3\lib\site-packages\numpy\”, line 34, in
    from . import _mklinit
    ImportError: DLL load failed: The specified module could not be found.

    • Avatar
      Jason Brownlee December 30, 2019 at 6:02 am #

      Looks like there is a problem with your development environment.

      This tutorial may help:

      • Avatar
        criz December 31, 2019 at 3:20 am #

        Hi jason, i’ve tried the url u provided but still didnt manage to solve it.

        basically i typed
        1. conda create -n wordembedding
        2. activate wordembedding
        3. pip install numpy (installed ver 1.16)
        4. ran

        error shows
        File “C:\Users\Criz Lee\Desktop\Python Projects\wordembedding\”, line 2, in
        from numpy import array
        ModuleNotFoundError: No module named ‘numpy’

        pls advise.. thanks

        • Avatar
          Jason Brownlee December 31, 2019 at 7:35 am #

          Sorry to hear that. I am not an expert in debugging workstations, perhaps try posting to stackoverflow.

    martin December 30, 2019 at 8:23 am #

    Hi, Jason: How to encode a new document using the Tokenizer object fit on training data? It seems there is no function to return an encoder from the tokenizer object.

    • Avatar
      Jason Brownlee December 31, 2019 at 7:24 am #

      You can save the tokenizer and use it to prepare new data in an identical manner as you did the training data after the tokenizer was fit.

  145. Avatar
    congmin min December 31, 2019 at 11:59 am #

    What do you mean by ‘save the tokenizer’? Tokenizer is an object, not a model.

    • Avatar
      Jason Brownlee January 1, 2020 at 6:30 am #

      It is as important as the model, in that sense it is part of the model.

      You can save Python objects to file using pickle.

    Sintayehu January 3, 2020 at 5:08 pm #

  147. Avatar
    Rishang January 25, 2020 at 4:56 pm #

    Hello Sir,

    I am not able to understand the significance of vector space?
    You have given 8 for the first problem, glove vectors has 100 dimension for each word.
    What is the idea behind these vector spaces and what does each value of the dimension tells us?

    Thankyou 🙂

    • Avatar
      Jason Brownlee January 26, 2020 at 5:15 am #

      The size of the vector space does not matter too much.

      More importantly, the model learns a representation where similar works will have a similar representatioN (coordinate) in the vector space. We don’t have to specify these relationships, they are learned automatically.

      • Avatar
        Rishang February 3, 2020 at 1:07 am #

        Thankyou Sir for your answer. I clearly understood what is vector space.
        I have one more question- If I declare the vocabulary size as 50 and if there are more than 50 words in my training data, what happens to those extra words?

        For the same reason I could not understand this line of glove vectors-
        “The smallest package of embeddings is 822Mb, called ““. It was trained on a dataset of one billion tokens (words) with a vocabulary of 400 thousand words.”
        What about the 600 thousand words ?

        • Avatar
          Jason Brownlee February 3, 2020 at 5:46 am #

  148. Avatar
    asma January 31, 2020 at 4:22 pm #

    I have a list of words as my dataset/training data. So i run your code for glove as follows:

    —-Error is——
    ValueError Traceback (most recent call last)
    in ()
    8 values = line.split()
    9 words = values[0]
    —> 10 coefs = asarray(values[1:], dtype=’float32′)
    11 embeddings_index[words] = coefs
    12 f.close()

    /usr/local/lib/python3.6/dist-packages/numpy/core/ in asarray(a, dtype, order)
    84 “””
    —> 85 return array(a, dtype, copy=False, order=order)

    ValueError: could not convert string to float: ‘0.1076.097748’

    could you please help

  149. Avatar
    Sanpreet Singh February 5, 2020 at 5:46 pm #

    Hello Jason Brownlee

    I hope you are fine and thanks for writing such an article. I would like to ask an question to you where I am struck.

    I had created keras deep learning model with glove word embedding on imdb movie set. I had used 50 dimension but during validation there is lot of gap between accuracy of training and validation data. I have done pre-processing very carefully using custom approach but still I am getting overfitting.

    During prediction phase instead of testing dataset, I have taken real time text and goes for prediction but if I do prediction again and again for same input data, results varies. I am trying my best why my results varies.

    I have 8 classes which depicts the score of each review and data is categorical. Softmax layer is predicting different result every time same data in fed to trained model

  150. Avatar
    Saad Farooq February 7, 2020 at 2:40 am #

    I have a dataset in which training data are test data in different files. How do I pass test data to the model for evaluation?

    • Avatar
      Jason Brownlee February 7, 2020 at 8:22 am #

      Load them into memory separately then use one for train and one for test/val.

  151. Avatar
    sachin February 24, 2020 at 10:58 pm #

    How to give text input to a pre-trained LSTM model with input layer shape (200, ) .
    I want to know how convert the text input and give it as an input to the model.

  152. Avatar
    Nati February 26, 2020 at 10:07 pm #

    Thanks you so much for this detailed tutorial. I think that I missed something… When using a Pre-Trained GloVe Embedding, in case you want to do a train test split, when is the right time to perform it?
    Should I need to do it after creating the weight matrix?

    • Avatar
      Jason Brownlee February 27, 2020 at 5:47 am #

      Sorry, I don’t follow your question – how are the embedding and train/test split related? Can you please elaborate?

      • Avatar
        Nati February 27, 2020 at 9:27 pm #

        Sure. I would like to use the embedding to create a model that will predict a class out of 30 different classes. The dataframe that I’m using contains a text column which I want to use to predict the class. In order to do so, I thought about using the pre-trained embedding (same as you did but with more classes). Now, in order to test the model I want to do a split, so my question is when do you recommend to do it? I tried to create a weight matrix for the all data and then split it to train and test but it gave me a very poor results on the test set.

        Any idea what am I doing wrong?

        • Avatar
          Jason Brownlee February 28, 2020 at 6:07 am #

          If you train your own embedding, it is prepared on the training dataset.

          If you use a pre-trained embedding, not split make sense as it is already trained on different data.

          • Avatar
            Nati February 28, 2020 at 10:23 am #

            Thanks I think that I understand (sorry I’m kind of newbie) . And assuming that I want to use the trained model to predict the labels on an unseen data, what should be the input? Should it be a padded docs with the same shape?

          • Avatar
            Jason Brownlee February 28, 2020 at 1:27 pm #

            New input must be prepared in an identical way to the training data.

    Despoina M March 19, 2020 at 9:01 am #

    Hello, another great post!

    I would like to ask, regarding with the Example of Using Pre-Trained Embedding, is it possible to reduce the vocabulary size? I had pre-trained embedding with 300 dimension. My vocabulary size is 7000.
    When I put vocabulary size=300 I have this error:

    embedding_matrix[i] = embedding_vector

    IndexError: index 300 is out of bounds for axis 0 with size 300

    Thanks in advance

    • Avatar
      Jason Brownlee March 19, 2020 at 10:14 am #

      Yes, but you must set the vocab size in the Tokenizer.

      • Avatar
        Despoina M March 20, 2020 at 1:59 am #

        Thank you for the quick response. I did it.

    Tyler March 20, 2020 at 6:57 am #

    Thanks for the tutorials! I’m using the 300d embedding for my image to caption model, but filter out rare words in my vocabulary. Say I have 10k words, should I just:

    tokenizer.word_index.get(word, vocab_size + 1) <= vocab_size

    to filter out the words I don't want?

    Also, do you think it's worth retraining the embedding weights at a later point in training to fine-tune? I'm thinking of it in the same context as freezing a pre-trained encoder, then gradually unfreezing layers as the decoder reaches a certain level of validity.

  155. Avatar
    Aman Savaria March 22, 2020 at 6:23 am #

    Hi Jason,
    I have a question regarding the code. You have taken the vocab_size to be 1 greater than the number of unique words. I am unable to understand why did you do that, can you please tell me.
    I am a total newbie and I’m not from a programming background so I’m sorry if this was a silly question.
    Thank You

    • Avatar
      Jason Brownlee March 22, 2020 at 6:59 am #

      Yes, we use a 1-offset for words and save index 0 for “unknown” words.

  156. Avatar
    Aman Savaria March 24, 2020 at 6:41 am #

  157. Avatar
    Ahmed B. March 27, 2020 at 2:43 am #

    Hi Jason,
    I work on a multivariate time series with categoricals and numericals features. and I use a data of 3 years : 20 stocks during 30 days as window of training and 20 stocks during 7 days as target X_train.shape = (N_samples, 20*30, N_features), y_train.shape = (N_samples, 20*7, N_features), my question is, how I can apply an embedding layer for 3 categorical variables for this 3D arrays ?
    I tried to use this part of code but it doesn’t work :

    cat1_input = Input(shape=(1,), name=’cat1′)
    cat2_input = Input(shape=(1,), name=’cat2′)
    cat3_input = Input(shape=(1,), name=’cat3′)

    cat1_emb = Flatten()(Embedding(33, 1)(cat1_input))
    cat2_emb = Flatten()(Embedding(18, 1)(cat2_input))
    cat3_emb = Flatten()(Embedding( 2, 1)(cat3_input))

  158. Avatar
    Ahmed B. March 27, 2020 at 7:24 am #

    Thank you for your prompt reply Jason, this tutorial is about Embedding for 2D array but in my case I need to build a model that take 3D array (N_samples, time_steps, N_features) as input and (time_steps,N_stocks) as output.

    • Avatar
      Jason Brownlee March 27, 2020 at 8:04 am #

      Perhaps you can adapt the example for your needs.

  159. Avatar
    Despina M April 12, 2020 at 1:50 am #

    Hello again,

    I am wondering this time, if it’s worthwhile to compare two embedding matrices from two different languages. I want to find the similarity about two matrices from different languages.

  160. Avatar
    sangam April 16, 2020 at 12:07 pm #

    Hello, Jason, This post really helped me, I think the parameters have changed,
    the weights parameter is now ” Embedding_initializer ” in keras documentation.

  161. Avatar
    Charles April 20, 2020 at 6:08 am #

    Hi Jason, I am facing a memory issue since my vocab_size is around 400,000. I have around 195847 training examples and each sentence output has maxLen of 5 words.

    So when I try to create a one_hot for the decoder_outputs because this tries to create a 195847, 400000, 5 matrix it obviously runs out of memory.

    How do i get around this problem?

    • Avatar
      Jason Brownlee April 20, 2020 at 7:35 am #

      Good question.

      Try reducing the size of the vocab.
      Try reducing the size of the dataset.
      Try reducing the size of the model.
      Try an alternate representation, such as hashing.
      Try running on an ec2 instance with more RAM.

      • Avatar
        Charles April 22, 2020 at 10:56 pm #

        An additional point could be to run it in batch using fit_generator of keras.

        There is one more query that I have and i know it may sound silly but:

        Why do you need the inference model or decoding model for testing? Then how does the training model help? The only thing that I can see is that it helps in deciding the encoder inputs. Is there something I am missing. It seems to me that the entire purpose of training is lost.

        • Avatar
          Jason Brownlee April 23, 2020 at 6:06 am #

          I don’t understand. You train a model and then use it for inference. Without training it is useless. Without inference, you trained for no reason.

          Perhaps you can elaborate?

  162. Avatar
    Mohamed April 21, 2020 at 11:02 am #


    I would like to ask if I have label not only positive and negative but also neutral how I can add that to y_training?

  163. Avatar
    Charlène April 25, 2020 at 5:38 am #


    Thank you for this very well explained article.

    I don’t understand how the embeddings are learnt in the Embedding layer. Is this common backpropagation ?

  164. Avatar
    Rahim May 5, 2020 at 1:28 am #

    Dear Jason
    Thanks for your wonderful and helpful guides. When I run your suggested code in part 4 of this page (Example of Using Pre-Trained GloVe Embedding), I get the following error, while I have downloaded glove.6B.100d file from the URL you provided above. Can you please help me find the cause of error?

    Traceback (most recent call last):
    File “C:/Users/Dehkhargani/PycharmProjects/test2/”, line 40, in
    coefs = asarray(values[1:], dtype=’float32′)
    File “C:\Users\Dehkhargani\Anaconda\lib\site-packages\numpy\core\”, line 85, in asarray
    return array(a, dtype, copy=False, order=order)
    ValueError: could not convert string to float: ‘ng’

    Mohammad May 17, 2020 at 9:50 pm #

    Hi Jason, thanks for your helpful post.

    Do pre-trained word embedding like Glove or Google news still work fine on non-English texts?


  166. Avatar
    Roshan Nayak May 18, 2020 at 3:36 am #

    Could anyone please tell me why are we actually adding 1 to the vocab size.
    vocab_size = len(t.word_index) + 1
    Please let me know. Thank you!!

    • Avatar
      Jason Brownlee May 18, 2020 at 6:21 am #

      To make space for “0” which is “unknown” words.

    Tanay Gupta May 19, 2020 at 5:00 pm #

    “The model is a simple binary classification model. Importantly, the output from the Embedding layer will be 4 vectors of 8 dimensions each, one for each word.”

    How did you know output will be 4 vectors ?

    • Avatar
      Jason Brownlee May 20, 2020 at 6:20 am #

      We can inspect the shape of the model by calling model.summary()

  168. Avatar
    JG June 22, 2020 at 9:46 pm #

    Hi Jason:

    Great tutorial !. It is my first time I approach NLP and particularly “text classification/sentiment analysis”. Then I am little confused about tech definitions, procedures, meaning of them, etc.

    Let me summarize them for everybody, using my own ideas, in order to share and expect corrections/comments from you.

    1º) From a Dataset, constituted of a list of texts docs, that we are going to train for a texts classification problem (e.g.), we read them and get a words list of each doc (using e.g. keras API “text_to_word_sequence()”.

    2º) We can then apply some sort fo word cleaning/cured process, using string object methods to get only alphabetical words, or suppressing words punctuations, or eliminating minimum words lengths or word repetitions, etc.
    Using a keras “Tokenizer” object (e.g.) , we can define a Vocabulary (or the minimum common number of different words that are constituted the whole dataset docs), besides a words counter, through the tokenizer method “fit_on_texts (apply to all docs words).

    3º) Now that we have all words (cleanups) for all docs, we must perform the most important issue in MachineLearning. That is, to convert words (tokens) to be trained by a ML model into Numbers. How? by Coding them !.
    And this numbers must be arranged in Tensors (or arrays), consisting on samples (rows of tensor), with columns (different features per doc or numbers words or number of vocabulary word ) and even a third dimensions of tensor (e.g. deep axis?), by embedding them into a word vector space (of an arbitrary dimension).

    To perform this encoded work (number conversion) we can use keras tokenizer methods such as “texts_to_matrix()”, “texts_to_sequences()”or others keras methods such as “one_hot() “…helping always with keras “pad_sequences()” method to put the same length for all words(features) for any docs, using e.g. max length of words per doc or simply because we want to use the full vocabulary size representation as our feature length for any doc.

    As a result of all this process we must have a clean Dataset input, X tensor of numbers, with 2D dimensions (docs, features) or 3D dimensions ( docs, features, an extra embedding dimension) if we use techniques of Deep Learning (e.g. embedding).

    4) Also we can take advantage of a sort of “transfer learning”, when we apply embedding DL layers, if we load previous trained weights (trained elsewhere) of embedding layers such as GloVe dictionary (e.g. consisting of 400,000 different words with their own 100 coordinates weights) that we must adapt to our vocabulary used in our own docs (e.g. our own arbitrary selected 100 different words, and 100 coordinates as a vector word space for each of them).

    5º) So now that we have converted a list of docs (of a list of words) into a single unique Tensor of Numbers X, besides the Y labels associated, we must define the Model with an input layer, besides embedding(), Conv1D(), and MaxPool1D() layers (e.g.) if we are using DL techniques, plus a final flatten(), in order to present the feature word extraction to the head fully connected layers ( comprising mainly on Dense() layers, that are going finally to learn the number pattern extraction (representing the set of words of each doc) associated to each document label.

    This is my summed up understanding of NLP text classification issues . thanks

    • Avatar
      Jason Brownlee June 23, 2020 at 6:21 am #

      Not a bad summary, generally accurate from my skim.

  169. Avatar
    JG June 23, 2020 at 3:21 am #

    Sorry for the extensive comment Jason!.

    -I would like to share also my code results, based on your great tutorial. Because I like the way you teach us…learning not theoretically but doing and experiment with final and operative codes…! thanks one more time!.

    – My first main concern it is that even we have string methods, “tokenizer” objects (class), etc… still we have to write down many code lines… I can figure out , that a more high level APIs text processing and coding, may already exist, in order to give them appropriate arguments parameters such as (docs to be read, filters of ways you can clean ups your docs words, types of coding you want to apply to your words, including if you want to use pre-trained weights) order to get , the main output dataset preparation in terms of a “X” input tensor …instead of going to this tedious code lines!

    – Anyway, I went through it I get the following results:

    – I got 100% accuracy on evaluation using a basic ML model, not deep learning, consisting on a word codification by the keras “texts_to_matrix()” but using a X dataset input of 2D (10 different docs samples rows and 2 different options of vocab_size 50 (yours) and even less (15) associated to less words use it on them). I use a fully connected Model that comprise 2 dense layer (the output and previous one of 10 nodes).

    – I got also 100% acc but now using the DL embedding layer with the “one_hot()” encoding (you suggest), and not GloVe pre-trained weights, and in addition to the embedding layer I add a Conv1D layer and MaxPool1D and Flatten layers (CNN part !). I use a 2D X tensor with two option as features cols, with a max_length of 4 and even applying later a 50 vocabulary cols features (using “pad-sequences()” methods for it). I train later the embedding layer with Glove pre-trained weights using trainable false and true. and the same fully connected head model (with 2 dense layers)

    – as third main option I got also got 100% accuracy, but using the DL embedding layer with the “texts_to_sequences()” encoding method of keras, and for two options (max-length and applying the 50 vocab-size (after applying pad_sequences()). I also train the embedding with Glove (tranable false and trainable = true) and without Glove weights. And the same fully connected model as a head.

    My conclusion, it is the more stable and quick maximum level reaching is applying embedded layer with pre-train Glove (trainable = true) and using the coding “texts_to_sequences()”. I measure the best performance because I can reduce the 50 epochs o using less units on my second Dense layers, to get the same 1005 accuracy.


    I expect this could help Jason!

      Jason Brownlee June 23, 2020 at 6:30 am #

      Great write-up!

      Yes, it is possible that text data prep has a come a long way over the last 3-4 years since I wrote this stuff up.

  170. Avatar
    JG June 24, 2020 at 9:43 pm #

  171. Avatar
    Vivek June 25, 2020 at 3:06 am #

    Hi Jason

    Good read ! I would like to have more clarity on this :

    What do we set trainable = False ?

    Should we not set it to ‘True’ if we want to use the GloVe 100-D weights present in the embeddings matrix ?

    I don’t quite understand the difference between setting it to ‘True’ and ‘False’


    • Avatar
      Jason Brownlee June 25, 2020 at 6:28 am #

      It means the weights in the ensemble cannot change during training if set “False”. As in, cannot be trained.

    Murat Karakaya July 13, 2020 at 9:41 pm #

    I think there is a typo here: “For example, below we define an Embedding layer with a vocabulary of 200 (e.g. integer encoded words from 0 to 199, inclusive),…..” Because, since 0 (zero) is used for padding, words are encoded words from 1 (one not zero!) to 199, inclusive. Regards,

  173. Avatar
    Fa July 21, 2020 at 3:32 am #

    For a multilabel classifier where each text could be linked to zero or more topics, what changes to the above example would be needed?

    For example, say I have a list of topics like [“politics”, “sports”, entertainment”, “finance”, “health”].

    A document mentioning heathcare and legistation would be labeled [1,0,0,0,1].

    Would it be enough to encode the “y” labels in a binary array format above?

    • Avatar
      Jason Brownlee July 21, 2020 at 6:12 am #

      Typically multi-class classification we integer encode the labels then one hot encode them and use a softmax activation function in the output layer.

  174. Avatar
    Ramesh Ravula July 27, 2020 at 9:09 pm #

    In “Example of Learning an Embedding” section why 8 dimensions?

  175. Avatar
    Keshav Rawat August 1, 2020 at 6:11 pm #


    In the pretrained Glove embedding section, you have used vocab_size = len(t.word_index) + 1.

    Why 1 is added? I can’t understand the reason.

    • Avatar
      Jason Brownlee August 2, 2020 at 5:40 am #

      To make room for index 0==unknown, so the vocab starts at index 1.

  176. Avatar
    dvl August 4, 2020 at 11:10 pm #

    Hii Joson Brownlee,
    Which one is better according to you?

  177. Avatar
    Ramki August 10, 2020 at 1:06 pm #

    Are there any good word or token embedding models for log files (syslogs, application logs etc.)

    Can you please suggest some links on machine learning applications for log files? e.g.
    – Does a log file have sensitive data? e.g. personal data, topology data like IP address etc,
    – What types of sensitive data are in a log file

      Jason Brownlee August 10, 2020 at 1:38 pm #

      I would recommend training one for your specific dataset.

  178. Avatar
    SJ August 11, 2020 at 3:11 am #

    Where can I get this entire code?

  179. Avatar
    Andre August 27, 2020 at 1:14 pm #

    Hi jason,

    Could you give me some insight with this paper, please?

    I mean, I’ve some confusion when they would use the output of the model to the NCRF++, they said “For converting the instances into embeddings, two features were used namely, sigmoid output from MNN (fe1), character embedding (fe2) of size 30”. What they mean and how?

    Maybe if you have some tutorial sequence tagging using NCRF++ it would be very useful. Thankyou

      Jason Brownlee August 27, 2020 at 1:38 pm #

      This is a common question that I answer here:

      • Avatar
        Andre August 27, 2020 at 4:55 pm #

        Ahh sorry for that, so let me explain it.

        tldr, the author using deep learning model which output a sigmoid probabilities.

        Then the author feed it to the NCRF++, author said “For converting the instances into embeddings, two features were used namely, sigmoid output from MNN (fe1), character embedding (fe2) of size 30”.

        Basically what I want to ask, do you have any articles about how NCRF++ works? Thank you

  180. Avatar
    Elvin Aghammadzada September 13, 2020 at 7:49 am #

    guys, plz consider adding a dark mode (theme) to the website

    MATHAV RAJ J September 17, 2020 at 6:37 pm #

    hi i am trying to concatenate two embedding layers in keras, one will be fixed and other one will be trained,
    model = Sequential()
    fixed_weights = np.array([word2glove[w] for w in words_in_glove])

    e1 = Embedding(input_dim = fixed_weights.shape[0], output_dim = EMBEDDING_DIM, weights = [fixed_weights], input_length = MAX_NEWS_LENGTH, trainable = False)

    e2 = Embedding(input_dim = len(word2index) – fixed_weights.shape[0],output_dim = EMBEDDING_DIM, input_length = MAX_NEWS_LENGTH)
    # model.add()
    model.add(concatenate([e1, e2]))

    but this seems unsuccessful and i get the error, (‘NoneType’ object is not subscriptable).. could you guide me where i am wrong?

    • Avatar
      Jason Brownlee September 18, 2020 at 6:42 am #

      Perhaps have 2 inputs to the model, 1 embedding and LSTM for each input, then concat the outputs of the lstms.

  182. Avatar
    singularityISnear September 25, 2020 at 7:44 pm #

    Hi , Want to write a common python function for a mode building which works for a custom embedding layer AND also with Glove’s pre-trained embeddings

  183. Avatar
    Par September 30, 2020 at 2:05 pm #

    Hi Jason,

    Thanks for the amazing tutorials. Always helpful.

    I have a question. I am trying to extract features from some annotated symptoms (from dialogue). So my sequences are ‘headache extent heavy headache frequency sometimes’ with different lengths each assigned with a time point.

    I want to extract features to later use them for time-series classification (so after flattening I would add a dense layer of size 100 for example). The problem is that I do not want to use the same labels here for embedding (in I’m afraid this will cause bias when I am using these features for training the main classifier with the same labels assigned to the time points containing these samples.

    Is there any way to do the embedding without assigning labels to each sample?

    Thank you,

    • Avatar
      Jason Brownlee September 30, 2020 at 2:16 pm #

      Perhaps you can use a standalone doc2vec technique, e.g. an equilivient of standalone word2vec methods.

  184. Avatar
    Cre October 28, 2020 at 9:01 am #

    Hi Jason,
    short question regarding the accruacy, and predict method. When I use:
    model.predict(pad_sequences([one_hot(‘poor work’, vocab_size)], maxlen=max_length, padding=’post’))
    I get an output of 0.5076578 {array([[0.5076578]], dtype=float32)}, which means by my understanding that it is neither good nor bad. And if I run model.evaluate I get an accuracy of 100% but this cant be true as ‘poor work’ is in my test set and I get a score of 0.50… and not 0.
    How can this be?

    Thank you very much

  185. Avatar
    Mahesh Chauhan October 29, 2020 at 3:56 pm #

    Thanks Jason,
  186. Avatar
    Tanmay November 6, 2020 at 8:53 pm #

    Hi Jason,

    I am trying to predict the output after training with this code. When I see the outputs, It always predicting [[0.72645265]]. Why is it doing so?

    I am predicting using the following model call-
    output = model.predict(padded_docs)

    Following is the code for getting padded_docs-
    t = Tokenizer()
    t.fit_on_texts([“good job”])
    vocab_size = len(t.word_index) + 1
    encoded_docs = t.texts_to_sequences(docs)
    max_length = 4
    padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding=’post’)

    Or someway I am going wrong? I am not exactly sure about what to pass in predict().

    • Avatar
      Jason Brownlee November 7, 2020 at 6:28 am #

      Looks good, it is predicting the probability of the document belonging to class 1.

      E.g. P(class=1) = 70%

      More on predicting with Keras models here:

      • Avatar
        Tanmay November 8, 2020 at 5:44 am #

        Yeah, it is going well. But not for all. When I test on input “poor job”, it predicts the same output “[[0.72645265]]” (and same for all test cases). I think I am providing the wrong input which is simply padded data ([[1,2,0,0]]) every time, not the exact words. How should I form input for the model.predict() function?

        • Avatar
          Jason Brownlee November 8, 2020 at 6:43 am #

          New input data must be prepared in an identical manner as the training dataset – same encoding, vocab, etc.

          • Avatar
            Juan December 14, 2020 at 12:15 pm #

            i have the same error could you please show us the code to predict correctly?

          • Avatar
            Jason Brownlee December 14, 2020 at 1:36 pm #

            Sorry I don’t understand. What error are you having exactly?

          • Avatar
            Juan December 14, 2020 at 2:45 pm #

            I answer to myself and Tammay

            encoded_try = t.texts_to_sequences(try)
            padded_try = pad_sequences(encoded_try, maxlen=4, padding=’post’)

            only the trainned words works***

            Jasson good blog, the next put also like evaluate the model please, we are noobs ????

            psdt: (the problem was that we didn’t know how to evaluate text “model.predict” because we thought the same result always)

  187. Avatar
    Pratik Sen November 12, 2020 at 5:21 am #

    The output from the Embedding layer is 3D (Ref: but in your blog its written that the output is 2D

    • Avatar
      Juan December 14, 2020 at 2:42 pm #

      I answer to myself and Tammay

      encoded_try = t.texts_to_sequences(try)
      padded_try = pad_sequences(encoded_try, maxlen=4, padding=’post’)

      only the trainned words works***

      Jasson good blog, the next put also like evaluate the model please, we are noobs 😉

  188. Avatar
    Dipankar Porey November 14, 2020 at 11:32 pm #

    what does Embedding layer in model ?

    • Avatar
      Jason Brownlee November 15, 2020 at 6:26 am #

      Sorry, I don’t understand your question, can you please elaborate?

    Dipankar Porey November 14, 2020 at 11:34 pm #

    I mean why are we using Embedding layer in a model ????

    • Avatar
      Jason Brownlee November 15, 2020 at 6:27 am #

      Embedding layers can be used to learn the relationship between words.

    Sheema Egbert November 19, 2020 at 1:04 am #

    Hi Jason, thanks for your amazing tutorials.

    How do you add bias to the embedding?

    Thanks in advance

    • Avatar
      Jason Brownlee November 19, 2020 at 7:46 am #

      You’re welcome!

      What do you mean by “add bias to the embedding”?

  191. Avatar
    Sheema Egbert November 20, 2020 at 3:16 pm #

    I mean when you train the embedding, is it advantageous to have a single bias term also for each learned embedding?

    Thanks for your interest!

    Sheema Egbert November 21, 2020 at 3:45 pm #

    Thanks for clearing that up, Jason.

    Puneesh Khanna December 10, 2020 at 4:38 pm #

    Hi Jason,
    Is there any way that I can convert a 2D input into embeddings. Basically I have an adjacency matrix (2-D array) representing a graph and training would have happen over a batch of graphs as data points. Is there any way that each node of the graph (which is basically represented by each row in the adjacency matrix) can be converted to a more meaningful embedding ?

  194. Avatar
    vian December 12, 2020 at 3:36 am #

    thanks, Mr, Jason please help me.
    1- if I have one word in each statement and there is no relation between them so how can I represent class labels and really I don’t know the benefits of class labels here?
    2- what is the mathematical model behind the embedding process?

  195. Avatar
    SULAIMAN KHAN December 29, 2020 at 10:54 pm #

    pre-train means test data

    • Avatar
      Jason Brownlee December 30, 2020 at 6:38 am #

      No. Pre-trained means training dataset using a different algorithm and resulting in a standalone model that can be used as part of a subsequent model (like a neural net).

    Arwinder Singh January 3, 2021 at 2:39 am #

    hi…i want to concatenate the hidden states with word we can do this

    • Avatar
      Jason Brownlee January 3, 2021 at 5:57 am #

      Perhaps test a merge or concat layer in keras to see if you can achieve your desired effect?

      • Avatar
        Arwinder Singh January 3, 2021 at 3:48 pm #

        thanks…I have tried the following and it’s giving dimension error:

        encoder_inputs = Input(shape=(None,))
        enc_emb = Embedding(eng_vocab_size, latent_dim, mask_zero = True)(encoder_inputs)
        encoder_lstm = LSTM(latent_dim*4, return_sequences=True, return_state=True)
        encoder_outputs, state_h, state_c = encoder_lstm(enc_emb)
        encoder_states = [state_h, state_c]

        error is:
        ValueError: A Concatenate layer requires inputs with matching shapes except for the concat axis. Got inputs shapes: [(None, 512), (None, None, 128)]

        • Avatar
          Jason Brownlee January 4, 2021 at 6:04 am #

          Sorry, I don’t have the capacity to debug your code.

  197. Avatar
    Anil Rahate January 10, 2021 at 4:21 am #

    Hi Jason, I am trying to understand how can we extracted embedded features? e.g. for a each training example, how can I get textual feature of say 300 dimension x 50 sequence length. So that I am not required to use embedding layer in sequential model and I can directly use LSTM and other layers. Any guidance will really help.

    • Avatar
      Jason Brownlee January 10, 2021 at 5:46 am #

      Each word is mapped to one vector. Retrieve the vectors for your words directly.

  198. Avatar
    vycki January 11, 2021 at 9:32 pm #

    Hello, if I have a document containing 300k sentences and the longest sentence in the document has 451 words, what would my vocab_size and input_length be in the embedding layer? I still don’t quite get it.

    • Avatar
      Jason Brownlee January 12, 2021 at 7:51 am #

      You can choose the vocab size and number of words.

      It could be chosen arbitrarily or chosen based on summary stats calculated from your data.

  199. Avatar
    Patrick January 18, 2021 at 2:08 pm #

    Not knowing the underlying algorithm for embedding makes it puzzling how the output_dim can be of any size.

    Please tell me that the input_length should actually be the length of the longest sentence. I would not be surprise if the input_length can be padded to exceed the longest sentence, but what would be the point?

    • Avatar
      Jason Brownlee January 19, 2021 at 6:32 am #

      There is no algorithm – it’s just a vector of weights for each word in the vocab and the weights are updated like any other weights in the net during training.

      The input length is the number of words you want to pass in in a sample. It can be as short or long as you like and zero padded to all have the same length.

      Perhaps experiment and discover what works well for your dataset and model.

  200. Avatar
    Ebtesam Almansor January 19, 2021 at 3:36 pm #

    What is the different between embedding layer and Elom embedding in kersa???

  201. Avatar
    Ebtesam Almansor January 20, 2021 at 11:49 am #

    How, so what is the accuracy for ??
    I used them in classification problem

    • Avatar
      Jason Brownlee January 20, 2021 at 12:42 pm #

      Agreed, accuracy is probably not relevant in the example.

    Dinesh January 20, 2021 at 5:03 pm #

    Hi jason,
    you have explained vocabulary size as ” This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.”

    please explain me about vocabulary and vocabulary size.

    you have mentioned “integer encoded” in your explanation of vocabulary size. but when we use integer encoding it will be based on the number of words in our data right ? then why we are giving that as 200. Please explain me. I am new to ML forgive me if my question is silly.


    • Avatar
      Jason Brownlee January 21, 2021 at 6:44 am #

      Vocab is the number of words known to the model, each has a unique number that you assign.

      Which part is confusing to you?

  203. Avatar
    Dinesh January 21, 2021 at 4:54 pm #

    Thankyou for your kind reply.

    my doubt is In the above example “doc” have 12 unique words. then why we are using “vocab_size” as 50. Is vocabulary size is based on the number of words present in the data ?
    how the value 50 comes.
    Please explain me what is vocabulary size with some example.


    • Avatar
      Jason Brownlee January 22, 2021 at 7:18 am #

      In that part of the example, vocab size is contrived.

      • Avatar
        Dinesh January 22, 2021 at 9:13 pm #

        Thank you so much.

    Lele January 30, 2021 at 5:35 am #


    One doubt: in the example given, the embedding layer has ‘vocab_size’ x 8 = 400 pesos.

    It is as if we have 50 inputs and 8 processing units (vectors).

    When a sequence of distinct words (length = 4, as in the example) is provided as input to the embedding layer will we have the “activation” of only 32 weights of the embedding layer?

    32 = 4 (number of words in the input string) x 8 (dimension of the vectors)

    • Avatar
      Jason Brownlee January 30, 2021 at 6:41 am #

      If you provide a sequence of 4 words and the dimensionality of the embeddings is 8, then you will have one 8 element vector as output from the embedding for each word, e.g. a 4×8 matrix.

  205. Avatar
    ADITYA GHOSH February 13, 2021 at 9:10 pm #

    If i add oov_token=’UNK’ in tokenizer do i still need to put ,vocab_size = len(t.word_index) + 1

    lower=True, split=’ ‘, char_level=False,oov_token=’UNK’


    • Avatar
      Jason Brownlee February 14, 2021 at 5:06 am #

      Yes, 0 is always reserved for unknown/missing whether you use it or not.

  206. Avatar
    Marco Cetraro March 2, 2021 at 6:25 am #

    Hello Jason,

    Thank you very much for your great posts.

    I have reviewed almost all the questions and answers in this post and I have noticed that several people asked the same question that I have but your answer was not related to what they asked.

    You mentioned:
    “The model is a simple binary classification model. Importantly, the output from the Embedding layer will be 4 vectors of 8 dimensions each, one for each word.”

    I understand that the 8 dimensions is an arbitrary number. But what I am not understand is about the 4 vectors ? If there are 10 docs and each doc was translated or converted in a vector of 4 dimension, Why does the output of the Embedding layer is going to be of 4 vectors?

    Thank you.

    • Avatar
      Jason Brownlee March 2, 2021 at 8:13 am #

      Docs are not translated to vectors, words are. Each word will be a vector with length 8, a doc will be a matrix of such vectors. 10 docs will be 10 matrices of such vectors.

      Does that help?

      Run the code as see for yourself!

  207. Avatar
    MS March 22, 2021 at 12:24 am #

    Can we use Tf-idf vectorizer or ohe to convert character to vectors instead of test_to_sequence() in the glove embedding case?

  208. Avatar
    David March 23, 2021 at 7:49 pm #

    Thank you very much Jason for your wonderfully well written blog post.

    Harsha March 31, 2021 at 8:59 pm #

    For example, I have a data frame say with 3 columns,i.e Review, Funny or Not, Positive or Negative. So I converted my review column into a padded_seq column, used the column to get a resultant embedding matrix from Word2Vec, and binarized the Funny or Not column. Okay so now along with the padded Sequences how can I send another column as a feature to my model.

    • Avatar
      Jason Brownlee April 1, 2021 at 8:13 am #

      Sorry, I don’t understand your question.

      Perhaps you can summarise or rephrase it?

  210. Avatar
    Meriem April 22, 2021 at 11:59 am #

    Thank you Jason!!

    So, in general (not only for text), when we use embedding layer ( in Keras) for converting categorical features into numerical vectors, is it just adding an embedding layer in our main machine learning model (and the embedding layer just convert the categorical features)? Or we separately train two different models, the first model with embedding layer for producing numeric features (converting categorical features) and the second model for prediction (our main problem) where we feed the embeddings as features?

    Thank you!

      Jason Brownlee April 23, 2021 at 4:57 am #

      You’re welcome.

      Yes, embedding just converts ordinal encoded values to real-valued vectors for any model you like. Neural nets lets you train the embedding as part of the model.

  211. Avatar
    Mitra April 26, 2021 at 2:34 pm #

    Hi, Thanks for creating amazing blogs. I have a question regarding the source of our embeddings. Given the fact that models like BERT learn contextualized embeddings, is there any chance that we use their tokenizer and then their embeddings in a for example translation model? I have pretrained the Persian BERT model on my domain-specific data and now I’m wondering if I can use these embeddings in a translation model.

  212. Avatar
    RESEENA MOL April 30, 2021 at 5:00 pm #

    when I’m trying to run this code,I’m getting this error. what’s wrong with this code. thanks in advance
    embeddings_index = dict()
    f = open(‘glove.6B.100d.txt’, ‘r’, errors = ‘ignore’, encoding=’utf8′)
    for line in f:
    values = line.split()
    word = values[0]
    coefs = asarray(values[1:], dtype=’float32′)
    embeddings_index[word] = coefs
    print(‘Loaded %s word vectors.’ % len(embeddings_index))

    TypeError Traceback (most recent call last)
    in ()
    4 values = line.split()
    5 word = values[0]
    —-> 6 coefs = array(values[1:], dtype=’float32′)
    7 embeddings_index[word] = coefs
    8 f.close()

    TypeError: array.array() takes no keyword arguments

  213. Avatar
    Alok June 18, 2021 at 8:04 pm #

    Hi, thank you for the amazing blog. Is there way to validate how well my embedding is working?

    • Avatar
      Jason Brownlee June 19, 2021 at 5:51 am #

      Perhaps use it in a model and compare the performance of a model with and without it, or with different encodings/embeddings.

  214. Avatar
    Amardeep July 11, 2021 at 5:00 am #

    I am looking at approach of combining numerical and text features for prediction. Is there anyway to consider word embeddings and get this done using neural networks

  215. Avatar
    Radhika July 20, 2021 at 2:12 am #

    Hello Jason,

    Thank you for the very well-written post . A question on the third scenario please:
    “How to use a pre-trained word embedding in a neural network”

    In this tutorial , we used the glove word embeddings as “weights” for the embedding layer . Could we instead , represent/encode each word in the input(training data ) with its corresponding glove word embedding and then train the neural network ? In this case , we would no longer need an embedding layer at all ?

    • Avatar
      Jason Brownlee July 20, 2021 at 5:35 am #

      You’re welcome.

      Yes, it is the same thing. Either the embedding is used directly in the model or as pre-processing for input data. The same.

      • Avatar
        Radhika July 21, 2021 at 12:56 am #

  216. Avatar
    saba August 9, 2021 at 4:07 am #

    Hi Jason,
    i am using BERT for word embedding ,i am doing sentiment analysis of news headlines and classifying data into positive and negative classes. Is it fine to only use embedding of CLS token and then classify it? i am using SVM classifier with Bert embeddings. I am getting 84% accuracy which is not desired i want more accuracy. Is there any alternate solution will be appreciated .Thanks

    • Avatar
      Jason Brownlee August 9, 2021 at 5:59 am #

      Perhaps try alternate embeddings, alternate data preparations, alternate predictive models and configurations, etc.

    Parita Shah August 11, 2021 at 9:43 pm #

    Keras provides the one_hot() function that creates a hash of each word as an efficient integer encoding. We will estimate the vocabulary size of 50, which is much larger than needed to reduce the probability of collisions from the hash function.

    For the above sentence I didnt understood why we took vocabulary size of 50 when we had it around 15?Can you please explain in layman terms. Do we always need to take large values than our actual vocab_size?

    • Adrian Tam
      Adrian Tam August 12, 2021 at 5:55 am #

      The vocab should be large enough to capture the complexity of the data set.

    Francesco August 28, 2021 at 2:27 am #

    Hello, Jason, thank you for your tutorials which help me a lot. I have one question about the problem you tackled in this tutorial.
    I’m doing a project about text summarization and I have a dataframe consisting of two columns, the first contains the texts, the second contains the summaries. I want to learn the embeddings of the words before feeding the texts to my seq2seq model for the summarization (I want to create a separate model for the embeddings otherwise there would be too many parameters to learn by just one model).

    In the preprocessing phase, I one-hot encoded both the texts and the summaries, and then I padded them. Now each of the texts is a sequence with a length of 462, while each summary is a sequence long 140.

    My question is: since I want to learn embeddings for the whole vocabulary of words that my dataset contains, do I need to stretch the length of the summary sequences so that they reach a length of 462, so that all the sequences have the same length and can be fed to the embedding model?

    • Adrian Tam
      Adrian Tam August 28, 2021 at 4:12 am #

      Usually you need to do alignments like what you described but there can be many ways to do. Padding is one. If you have word2vec embeddings, you may also do the sum of vectors from each word so instead of length 462 or 140, you have length of N for the length-N word vectors regardless of how long your sentence was. I am sure you can think of many different ways to do it. It is your call here to justify the method you choose.

    Arwa September 2, 2021 at 1:51 am #

    Hello dear,
    thank you for this tutorial

    I try the code but dub ends on the problem that I want to solve, I have questions:

    the problem is: comparing 2 sentence to see if they have same meaning or not
    so train data consist of [sentence 1, sentence2]. and label [0 or1]

    1- I extract the word embedding from train and test data, is that true?
    2- how to pass the input to the model, all what I find they pass just one doc/sentence
    but I want to pass 2 sentence

    could u help please ?

  220. Avatar
    Arwa September 2, 2021 at 8:14 pm #

    thank you very much for replaying

    yes I did the word embedding and my dataset is ready to use, but I stuck on fitting the model !!
    how to input 2 embedding sentences?
    my data look like this:
    x_trine_data = [ [ [0.2 , 0.3, 0.9], [0.7, …,],[…],..] [ [0.4,0.3..],[0.32,..] ] ]
    x_trine_data[0] represent sen1
    x_trine_data [1] represent sen2

    sen1 sen 2 label
    [0] hello word there [0]welcome to class 0
    [1] A went to school [1] A in the school 1


  221. Avatar
    Syed Ali Raza Hamdani September 11, 2021 at 11:11 pm #

    Francois Chollet Fig 61 From text to tokens to vectors there are zeros don’t understand zeros

  222. Avatar
    Sergey September 14, 2021 at 10:06 pm #

    Thanks a lot for the tutorial! Explained simply and nicely!

      Adrian Tam September 15, 2021 at 10:57 pm #

      Glad you liked it. Thank you.

  223. Avatar
    Sourabh September 18, 2021 at 7:42 pm #

    Hi, I did not understand how the embedding layer will map the weight matrix to its respective keys ( words). also what about the words which you are passing to the model in the padded_docs but where absent in the GloVe vectors words?

    • Adrian Tam
      Adrian Tam September 19, 2021 at 6:42 am #

      For words unknown to the pretrained vectors, you can predefine a replacement embedding to it (or you can do some preprocessing to replace those undefined words before looking up the embedding).

  224. Avatar
    Julia October 6, 2021 at 4:40 am #

    Hi Jason!
    Thank you for this tutorial.
    My question is: can Keras Embedding layer be used only for textual data?

    Thank you!

    • Adrian Tam
      Adrian Tam October 6, 2021 at 11:29 am #

      Not textual, but English words are the easiest because there are a lot of embedding data out there for download. But technically it doesn’t mean that’s the only application. You can use embedding layer for anything as long as you can justify it.

    Rania November 13, 2021 at 2:49 am #

    Dear Dr. Brownlee,

    I would like to thank you for you valuable articles, I really learned a lot.

    My question is: Can we use fasttext pretrained embedding the same way as Glove?

    • Adrian Tam
      Adrian Tam November 14, 2021 at 2:28 pm #

      Should work. Just the dimension would be different.

    Raja December 29, 2021 at 4:24 pm #

    Thank you Jason for the nice article. I have below query:

    # integer encode the documents
    vocab_size = 50
    # pad documents to a max length of 4 words
    max_length = 4

    Can’t the method find out how many total number of words we have in the input supplied and decide vocab_size automatically? We may or may not actual vocab_size hence may supply wrong values. For example instead of 50 we may pass 5. Then many words could have same number.

    Same way padding maximum length, can’t the method find out based on the number of words in each sentence? Like mentioned above we may make mistakes while passing this value as well.

    Kindly throw some light on this. Thanks.

    • Avatar
      James Carmichael February 28, 2022 at 12:13 pm #

      Hi Raja…Please narrow your question or provide a more specific query so that we may better assist you.

    sam January 1, 2022 at 12:46 am #

    Thanks, Dr. Brownlee for this great blog . I used glove as embedding layer in a captioning model by getting the embedding matrix, but if I need to use BERT how can I start, please .

    Will January 6, 2022 at 11:54 pm #

    Hi Jason

    Im confused how the output from the embedded layer in section 3 is 4 by 8?

    If the sentence is a 1 by 4 vector and the embedding matrix is a 8 by 50 matrix,
    if we multiply them together you don’t get a 4 by 8 matrix, you’d get an error because it confounds the laws of matrix multiplication.

    So how are output dimensions what they are? I’m sure I’m missing something! many thanks.

  229. Avatar
    Dev March 10, 2022 at 9:43 pm #

    Exactly what I was looking for! love you man

    • Avatar
      James Carmichael March 11, 2022 at 1:01 pm #

      Great feedback Dev!

    zi May 13, 2022 at 2:49 am #

    in this example,
    it is something like transfer learning, yes ?

  231. Avatar
    Majida September 14, 2022 at 6:44 pm #

    hi Jason,

    Kindly guide – How can I use Pretrained word embedding models for local languages those are not available in the pretrained model or do I have to use a embedding layer for embedding matrix?

    How can I get benefit from pretrained models for local language?

    • Avatar
      James Carmichael September 15, 2022 at 5:39 am #

      Hi Majida…Please clarify what is meant by “local language” so that we may better assist you.

  232. Avatar
    Majida September 20, 2022 at 1:29 pm #

    hi, Thank you for your kind reply.

    English sentence : “He is a good boy”
    Roman Urdu: “ye acha larka ha”

    ye for he.
    ha for is.
    acha for good
    larka for boy.

    Actually this is not proper Urdu language spoken in Pakistan, India, and all around the world. But this form of writing is popular among social media users.

  233. Avatar
    Silvio September 20, 2022 at 4:47 pm #

    Hi Jason

    tf.keras.preprocessing.text.one_hot is Deprecated and it recommends to use tf.keras.layers.Hashing with output_mode=’one_hot’ instead.
    But when I use this, it says: “(‘Keyword argument not understood:’, ‘output_mode’)”

    Do you have any idea?

    • Avatar
      James Carmichael September 21, 2022 at 5:28 am #

      Hi Silvio…Please clarify the source of the recommendation and error you are referring to so that we may better assist you.

  234. Avatar
    Subodh October 31, 2022 at 2:55 am #

    Hello! I am building a model using Embedding layers followed by GlobalAveragePooling1D() layer and other layers like shown below:

    tf.keras.layers.Embedding(vocab_size=10000, embedding_dim=300, input_length=16),
    tf.keras.layers.Dense(32, activation=’tanh’),
    tf.keras.layers.Dense(16, activation=’tanh’),
    tf.keras.layers.Dense(1, activation=’sigmoid’)

    what I understood was after the Embedding layer the following layers obtain the data or feature vectors of the corresponding words. And that feature vector is used for further processing in other layers. So, the Embedding layer was self-sufficient to generate the feature vector.

    But From this article, which is very much informative though, what I understood is for generating the feature vectors, it needs to propagate through all other layers. If data propagate through the whole layer for generating a feature vector, then how is that feature vector used?

    My other question is if I am using RNN layers, do I have to use an embedding layer?

  235. Avatar
    Iraj Koohi November 24, 2022 at 9:03 am #

    Great article Tnx.
    For example, below we define an Embedding layer with a vocabulary of 200 (e.g. integer encoded words from 0 to 199, inclusive), a vector space of 32 dimensions in which words will be embedded, and input documents that have 50 words each.
    Regarding the following setup of embedded layer:
    e = Embedding(200, 32, input_length=50)
    What would be the setup of embedded layer or layers if we want to process three separate inputs at a time instead of one input above?
    For example, I have three inputs each 100 vocabulary size?

  236. Avatar
    Iraj Koohi November 24, 2022 at 9:17 am #

    To clarify my e:
    I have three sub-addresses each unique integers 0-99 and combination of those three will make a unique address. Instead of using a dense layer with 100*100*100=1000000 inputs, I am using three embedded layers after input layers each identical. Next concatenation the outputs of embedded layers, and finally process the output of the dense layer.
    My question is if I can use a single dense layer or not? If so, what will be the setup of dense layer?

  237. Avatar
    Tao November 29, 2022 at 8:48 am #

    Given: Max 3 words per case, Embedding size = 8. After flattening, each case will end up having 3 x 8 = 24 ‘features’ for any downstream architecture in the model.

    Consider the sentences ‘cat sat mat’, ‘mat cat sat’, ‘sat mat cat’ etc. Each has the same same 24 features, just in different order/chunks.

    It seems like ‘feature X’ or ‘column X’ is not conceptually the same case to case. For example, sometimes feature 9 is the first latent embedding value for CAT, sometimes it is the first latent embedding element for SAT, sometimes it is for MAT. It’s not about the token, it’s about the token that *happens* (bag of words) to be there for that case.

    How can downstream learning from these 24 new features have consistent results? I think I am missing something conceptually.


  238. Avatar
    Sandeep March 12, 2024 at 11:23 pm #

    if i use tokenizer = Tokenizer(oov_token = oov_token), then i don’t have to add +1 to vocab_size .. right ?

    • Avatar
      James Carmichael March 13, 2024 at 8:59 am #

      Hi Sandeep…Please proceed with your idea and let us know what you find.

Leave a Reply