How to Use Word Embedding Layers for Deep Learning with Keras

Word embeddings provide a dense representation of words and their relative meanings.

They are an improvement over sparse representations used in simpler bag of word model representations.

Word embeddings can be learned from text data and reused among projects. They can also be learned as part of fitting a neural network on text data.

In this tutorial, you will discover how to use word embeddings for deep learning in Python with Keras.

After completing this tutorial, you will know:

  • About word embeddings and that Keras supports word embeddings via the Embedding layer.
  • How to learn a word embedding while fitting a neural network.
  • How to use a pre-trained word embedding in a neural network.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Updated Feb/2018: Fixed a bug due to a change in the underlying APIs.
  • Updated Oct/2019: Updated for Keras 2.3 and TensorFlow 2.0.
How to Use Word Embedding Layers for Deep Learning with Keras

How to Use Word Embedding Layers for Deep Learning with Keras
Photo by thisguy, some rights reserved.

Tutorial Overview

This tutorial is divided into 3 parts; they are:

  1. Word Embedding
  2. Keras Embedding Layer
  3. Example of Learning an Embedding
  4. Example of Using Pre-Trained GloVe Embedding

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

1. Word Embedding

A word embedding is a class of approaches for representing words and documents using a dense vector representation.

It is an improvement over more the traditional bag-of-word model encoding schemes where large sparse vectors were used to represent each word or to score each word within a vector to represent an entire vocabulary. These representations were sparse because the vocabularies were vast and a given word or document would be represented by a large vector comprised mostly of zero values.

Instead, in an embedding, words are represented by dense vectors where a vector represents the projection of the word into a continuous vector space.

The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used.

The position of a word in the learned vector space is referred to as its embedding.

Two popular examples of methods of learning word embeddings from text include:

  • Word2Vec.
  • GloVe.

In addition to these carefully designed methods, a word embedding can be learned as part of a deep learning model. This can be a slower approach, but tailors the model to a specific training dataset.

2. Keras Embedding Layer

Keras offers an Embedding layer that can be used for neural networks on text data.

It requires that the input data be integer encoded, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras.

The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.

It is a flexible layer that can be used in a variety of ways, such as:

  • It can be used alone to learn a word embedding that can be saved and used in another model later.
  • It can be used as part of a deep learning model where the embedding is learned along with the model itself.
  • It can be used to load a pre-trained word embedding model, a type of transfer learning.

The Embedding layer is defined as the first hidden layer of a network. It must specify 3 arguments:

It must specify 3 arguments:

  • input_dim: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.
  • output_dim: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem.
  • input_length: This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.

For example, below we define an Embedding layer with a vocabulary of 200 (e.g. integer encoded words from 0 to 199, inclusive), a vector space of 32 dimensions in which words will be embedded, and input documents that have 50 words each.

The Embedding layer has weights that are learned. If you save your model to file, this will include weights for the Embedding layer.

The output of the Embedding layer is a 2D vector with one embedding for each word in the input sequence of words (input document).

If you wish to connect a Dense layer directly to an Embedding layer, you must first flatten the 2D output matrix to a 1D vector using the Flatten layer.

Now, let’s see how we can use an Embedding layer in practice.

3. Example of Learning an Embedding

In this section, we will look at how we can learn a word embedding while fitting a neural network on a text classification problem.

We will define a small problem where we have 10 text documents, each with a comment about a piece of work a student submitted. Each text document is classified as positive “1” or negative “0”. This is a simple sentiment analysis problem.

First, we will define the documents and their class labels.

Next, we can integer encode each document. This means that as input the Embedding layer will have sequences of integers. We could experiment with other more sophisticated bag of word model encoding like counts or TF-IDF.

Keras provides the one_hot() function that creates a hash of each word as an efficient integer encoding. We will estimate the vocabulary size of 50, which is much larger than needed to reduce the probability of collisions from the hash function.

The sequences have different lengths and Keras prefers inputs to be vectorized and all inputs to have the same length. We will pad all input sequences to have the length of 4. Again, we can do this with a built in Keras function, in this case the pad_sequences() function.

We are now ready to define our Embedding layer as part of our neural network model.

The Embedding has a vocabulary of 50 and an input length of 4. We will choose a small embedding space of 8 dimensions.

The model is a simple binary classification model. Importantly, the output from the Embedding layer will be 4 vectors of 8 dimensions each, one for each word. We flatten this to a one 32-element vector to pass on to the Dense output layer.

Finally, we can fit and evaluate the classification model.

The complete code listing is provided below.

Running the example first prints the integer encoded documents.

Then the padded versions of each document are printed, making them all uniform length.

After the network is defined, a summary of the layers is printed. We can see that as expected, the output of the Embedding layer is a 4×8 matrix and this is squashed to a 32-element vector by the Flatten layer.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Finally, the accuracy of the trained model is printed, showing that it learned the training dataset perfectly (which is not surprising).

You could save the learned weights from the Embedding layer to file for later use in other models.

You could also use this model generally to classify other documents that have the same kind vocabulary seen in the test dataset.

Next, let’s look at loading a pre-trained word embedding in Keras.

4. Example of Using Pre-Trained GloVe Embedding

The Keras Embedding layer can also use a word embedding learned elsewhere.

It is common in the field of Natural Language Processing to learn, save, and make freely available word embeddings.

For example, the researchers behind GloVe method provide a suite of pre-trained word embeddings on their website released under a public domain license. See:

The smallest package of embeddings is 822Mb, called ““. It was trained on a dataset of one billion tokens (words) with a vocabulary of 400 thousand words. There are a few different embedding vector sizes, including 50, 100, 200 and 300 dimensions.

You can download this collection of embeddings and we can seed the Keras Embedding layer with weights from the pre-trained embedding for the words in your training dataset.

This example is inspired by an example in the Keras project:

After downloading and unzipping, you will see a few files, one of which is “glove.6B.100d.txt“, which contains a 100-dimensional version of the embedding.

If you peek inside the file, you will see a token (word) followed by the weights (100 numbers) on each line. For example, below are the first line of the embedding ASCII text file showing the embedding for “the“.

As in the previous section, the first step is to define the examples, encode them as integers, then pad the sequences to be the same length.

In this case, we need to be able to map words to integers as well as integers to words.

Keras provides a Tokenizer class that can be fit on the training data, can convert text to sequences consistently by calling the texts_to_sequences() method on the Tokenizer class, and provides access to the dictionary mapping of words to integers in a word_index attribute.

Next, we need to load the entire GloVe word embedding file into memory as a dictionary of word to embedding array.

This is pretty slow. It might be better to filter the embedding for the unique words in your training data.

Next, we need to create a matrix of one embedding for each word in the training dataset. We can do that by enumerating all unique words in the Tokenizer.word_index and locating the embedding weight vector from the loaded GloVe embedding.

The result is a matrix of weights only for words we will see during training.

Now we can define our model, fit, and evaluate it as before.

The key difference is that the embedding layer can be seeded with the GloVe word embedding weights. We chose the 100-dimensional version, therefore the Embedding layer must be defined with output_dim set to 100. Finally, we do not want to update the learned word weights in this model, therefore we will set the trainable attribute for the model to be False.

The complete worked example is listed below.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example may take a bit longer, but then demonstrates that it is just as capable of fitting this simple problem.

In practice, I would encourage you to experiment with learning a word embedding using a pre-trained embedding that is fixed and trying to perform learning on top of a pre-trained embedding.

See what works best for your specific problem.

Further Reading

This section provides more resources on the topic if you are looking go deeper.


In this tutorial, you discovered how to use word embeddings for deep learning in Python with Keras.

Specifically, you learned:

  • About word embeddings and that Keras supports word embeddings via the Embedding layer.
  • How to learn a word embedding while fitting a neural network.
  • How to use a pre-trained word embedding in a neural network.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more...

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

See What's Inside

652 Responses to How to Use Word Embedding Layers for Deep Learning with Keras

  1. Avatar
    Mohammad October 4, 2017 at 7:58 am #

    Thank you Jason,
    I am excited to read more NLP posts.

    • Avatar
      Jason Brownlee October 4, 2017 at 8:03 am #


      • Avatar
        Ajit March 15, 2020 at 12:40 am #

        Thanks man, It was really helpful.

    • Avatar
      sherry July 22, 2019 at 7:22 pm #

      after embedding,have to have a “Flatten()” layer? In my project, I used a dense layer directly after embedding. is it ok?

    • Avatar
      Peter Nduru October 8, 2019 at 6:06 am #

      I appreciate how well updated you keep these tutorials. the first thing I always look at, when I start reading is the update date. thank you very much.

      • Avatar
        Jason Brownlee October 8, 2019 at 8:10 am #

        You’re welcome.

        I require all of the code to work and keep working!

    • Avatar
      Martin October 11, 2019 at 4:09 am #

      Hi, Jason:

      when one_hot encoding is used, why is padding necessary? Doesn’t one_hot encoding already create an input of equal length?

  2. Avatar
    shiv October 5, 2017 at 10:07 am #

    I split my data into 80-20 test-train and I’m still getting 100% accuracy. Any idea why? It is ~99% on epoch 1 and the rest its 100%.

  3. Avatar
    Sandy October 6, 2017 at 2:44 pm #

    Thank you Jason. I always find things easier when reading your post.
    I have a question about the vector of each word after training. For example, the word “done” in sentence “Well done!” will be represented in different vector from that word in sentence “Could have done better!”. Is that right? I mean the presentation of each word will depend on the context of each sentence?

    • Avatar
      Jason Brownlee October 7, 2017 at 5:48 am #

      No, each word in the dictionary is represented differently, but the same word in different contexts will have the same representation.

      It is the word in its different contexts that is used to define the representation of the word.

      Does that help?

      • Avatar
        Sandy October 7, 2017 at 5:37 pm #

        Yes, thank you. But I still have a question. We will train each context separately, then after training the first context, in this case is “Well done!”, we will have a vector representation of the word “done”. After training the second context, “Could have done better”, we have another vector representation of the word “done”. So, which vector will we choose to be the representation of the word “done”?
        I might misunderstand the procedure of training. Thank you for clarifying it for me.

        • Avatar
          Jason Brownlee October 8, 2017 at 8:32 am #

          No. All examples where a word is used are used as part of the training of the representation of the word. There is only one representation for each word during and after training.

          • Avatar
            Sandy October 8, 2017 at 2:46 pm #

            I got it. Thank you, Jason.

  4. Avatar
    Chiedu October 7, 2017 at 5:36 pm #

    Hi Jason,
    any ideas on how to “filter the embedding for the unique words in your training data” as mentioned in the tutorial?

    • Avatar
      Jason Brownlee October 8, 2017 at 8:32 am #

      The mapping of word to vector dictionary is built into Gensim, you can access it directly to retrieve the representations for the words you want: model.wv.vocab

      • Avatar
        mahna April 28, 2018 at 2:31 am #

        HI Jason,
        I am really appreciated the time U spend to write this tutorial and also replying.
        My question is about “model.wv.vocab” you wrote. is it an address site?
        It does not work actually.

  5. Avatar
    Abbey October 8, 2017 at 2:19 am #

    Hi, Jason

    Good day.

    I just need your suggestion and example. I have two different dataset, where one is structured and the other is unstructured. The goal is to use the structured to construct a representation for the unstructured, so apply use word embedding on the two input data but how can I find the average of the two embedding and flatten it to one before feeding the layer into CNN and LSTM.

    Looking forward to your response.

    • Avatar
      Jason Brownlee October 8, 2017 at 8:40 am #

      Sorry, what was your question?

      If your question was if this is a good approach, my advice is to try it and see.

      • Avatar
        Abiodun Modupe October 9, 2017 at 7:46 pm #

        Hi, Jason
        How can I find the average of the word embedding from the two input?

        • Avatar
          Jason Brownlee October 10, 2017 at 7:43 am #

          Perhaps you could retrieve the vectors for each word and take their average?

          Perhaps you can use the Gensim API to achieve this result?

      • Avatar
        Rafael Sá June 17, 2019 at 2:47 am #

        Hi Jason,

        I have a set of documents(1200 text of movie Scripts) and i want to use pretrained embeddings. But i want to update the vocabulary and train again adding the words of my corpus. Is that possible ?

        • Avatar
          Jason Brownlee June 17, 2019 at 8:24 am #


          Load the pre-trained vectors. Add new random vectors for the new words in the vocab and train the whole lot together.

  6. Avatar
    Vinu October 9, 2017 at 5:54 pm #

    Hi Jason…Could you also help us with R codes for using Pre-Trained GloVe Embedding

    • Avatar
      Jason Brownlee October 10, 2017 at 7:43 am #

      Sorry, I don’t have R code for word embeddings.

  7. Avatar
    Hao October 12, 2017 at 5:49 pm #

    Hi Jason, really appreciate that you answered all the replies! I am planning to try both CNN and RNN (maybe LSTM & GRU) on text classification. Most of my documents are less than 100 words long, but about 5 % are longer than 500 words. How do you suggest to set the max length when using RNN?If I set it to be 1000, will it degrade the learning result? Should I just use 100? Will it be different in the case of CNN?
    Thank you!

  8. Avatar
    Michael October 13, 2017 at 10:22 am #

    I’d like to thank you for this post. I’ve been struggling to understand this precise way of using keras for a week now and this is the only post I’ve found that actually explains what each step in the process is doing – and provides code that self-documents what the data looks like as the model is constructed and trained. This makes it so much easier to adapt to my particular requirements.

  9. Avatar
    Azim October 17, 2017 at 5:47 pm #

    In the above Keras example, how can we predict a list of context words given a word? Lets say i have a word named ‘sudoku’ and want to predict the sourrounding words. how can we use word2vec from keras to do that?

  10. Avatar
    Willie October 21, 2017 at 5:55 pm #

    Hi Jason,

    Thanks for your useful blog I have learned a lots.

    I am wondering if I already have pretrained word embedding, is that possible to set keras embedding layer trainable to be true? If it is workable, will I get a better result, when I only use small size of data to pretrain the word embedding model. Many thanks!

    • Avatar
      Jason Brownlee October 22, 2017 at 5:16 am #

      You can. It is hard to know whether it will give better results. Try it.

  11. Avatar
    cam October 28, 2017 at 5:34 am #

    Hey Jason,

    Is it possible to perform probability calculations on the label? I am looking at a case where it is not simply +/- but that a given data entry could be both but more likely one and not the other.

    • Avatar
      Jason Brownlee October 29, 2017 at 5:48 am #

      Yes, a neural network with a sigmoid or softmax output can predict a probability-like score for each class.

      • Avatar
        David Stancu November 3, 2017 at 6:10 am #

        I’m doing something like this except with my own feature vectors — but to the point of the labels — I do ternary classification using categorical_crossentropy and a softmax output. I get back an answer of the probability of each label.

  12. Avatar
    Ravil November 3, 2017 at 5:43 am #

    Hey Jason!

    Thanks for a wonderful and detailed explanation of the post. It helped me a lot.

    However, I’m struggling to understand how the model predicts a sentence as positive or negative.
    i understand that each word in the document is converted into a word embedding, so how does our model evaluate the entire sentence as positive or negative? Does it take the sum of all the word vectors? Perhaps average of them? I’ve not been able to figure this part out.

    • Avatar
      Jason Brownlee November 3, 2017 at 2:13 pm #

      Great question!

      The model interprets all of the words in the sequence and learns to associate specific patterns (of encoded words) with positive or negative sentiment

      • Avatar
        Ken April 2, 2018 at 8:37 am #

        Hi Jason,

        Thanks a lot for your amazing posts. I have the same question as Ravil. Can you elaborate a bit more on “learns to associate specific patterns?”

        • Avatar
          Jason Brownlee April 2, 2018 at 2:49 pm #

          Good question Ken, perhaps this post will make it clearer how ml algorithms work (a functional mapping):

          Does that help?

          • Avatar
            Ken April 2, 2018 at 10:40 pm #

            Thanks for your reply. But I was trying to ask is that how does keras manage to produce a document level representation by having the vectors of each word? I don’t seem to find how was this being done in the code.


          • Avatar
            Jason Brownlee April 3, 2018 at 6:34 am #

            The model such as the LSTM or CNN will put this together.

            In the case of LSTMs, you can learn more here:

            Does that help?

          • Avatar
            Alexi September 27, 2018 at 1:48 am #

            Hi Jason,

            First, thanks for all your really useful posts.

            If I understand well your post and answers to Ken and Ravil, the neural network you build in fact reduces the sequence of embedding vectors corresponding to all the words of a document to a one-dimensional vector with the Flatten layer, and you just train this flattening, as well as the embedding, to get the best classification on your training set, isn’t it?

            Thank you in advance for your answer.

          • Avatar
            Jason Brownlee September 27, 2018 at 6:04 am #

            Sort of.

            words => integers => embedding

            The embedding has a vector per word which the network will use as a representation for the word.

            We have a sequence of vectors for the words, so we flatten this sequence to one long vector for the Dense model to read. Alternately, we could wrap the dense in a timedistributed layer.

          • Avatar
            Alexi September 27, 2018 at 5:49 pm #

            Aaah! So nothing tricky is done when flattening, more or less just concatenating the fixed number of embedding vectors that is the output of the embedding layer, and this is why the number of words per document has to be fixed as a setting of this layer. If this is correct, I think I’m finally understanding how all this works.

            I’m sorry to bother you more, but how does the network works if a document much shorter than the longest document (the number of its words being set as the number of words per document to the embedding layer) is given to the network as training or testing? It just fills the embedding vectors of this non-appearing words as 0? I’ve been looking for ways to convert all the word embeddings of a text to some sort of document embedding, and this just seems a solution too simple to work, or that may work but for short documents (as well as other options like averaging the word vectors or taking the element-wise maximum of minimum).

            I’m trying to do sentiment analysis for spanish news, and I have news with like 1000 or more words, and wanted to use pre-trained word embeddings of 300 dimensions each. Wouldn’t it be a size way too huge per document for the network to train properly, or fast enough? I imagine you do not have a precise answer, but I’d like to know if you have tried the above method with long documents, or know that someone has.

            Thank you again, I’m sorry for such a long question.

          • Avatar
            Jason Brownlee September 28, 2018 at 6:07 am #


            We can use padding for small documents and a Masking input layer to ignore padded values. More here:

            Try different sized embeddings and use results to guide the configuration.

          • Avatar
            Alexi October 1, 2018 at 5:26 pm #

            Okay, thank you very much! I will give it a try.

  13. Avatar
    chengcheng November 9, 2017 at 2:56 am #

    the chinese word how to vector sequence

    • Avatar
      Jason Brownlee November 9, 2017 at 10:03 am #


    • Avatar
      lstmbot December 16, 2017 at 10:30 pm #

      me bot trying interact comments born with lstm

  14. Avatar
    Hilmi Jauffer November 16, 2017 at 4:30 pm #

    Hi Jason,
    I have successfully trained a model using the word embedding and Keras. The accuracy was at 100%.

    I saved the trained model and the word tokens for predictions.‘model.h5’, True)

    TOKENIZER = Tokenizer(num_words=MAX_NB_WORDS)
    pickle.dump(TOKENIZER, open(‘tokens’, ‘wb’))

    When predicting:
    – Load the saved model.
    – Setup the tokenizer, by loading the saved word tokens.
    – Then predict the category of the new data.

    I am not sure the prediction logic is correct, since I am not seeing the expected category from the prediction.

    The source code is in Github:

    Appreciate if you can have a look and let me know what I am missing.

    Best regards,

    • Avatar
      Jason Brownlee November 17, 2017 at 9:20 am #

      Your process sounds correct. I cannot review your code sorry.

      What was the problem exactly?

      • Avatar
        Tony July 11, 2018 at 8:00 am #

        Thank you, Jason! Your examples are very helpful. I hope to get your attention with my question. At training, you prepare the tokenizer by doing:

        t = text.Tokenizer();

        Which creates a dictionary of words:numbers. What do we do if we have a new doc with lost of new words at prediction time? Will all these words go the unknown token? If so, is there a solution for this, like can we fit the tokenizer on all the words in the English vocab?

        • Avatar
          Jason Brownlee July 11, 2018 at 2:52 pm #

          You must know the words you want to support at training time. Even if you have to guess.

          To support new words, you will need a new model.

  15. Avatar
    Fabrício Melo November 17, 2017 at 7:35 am #

    Hello Jason!

    In Example of Using Pre-Trained GloVe Embedding, do you use the word embedding vectors as weights of the embedding layer?

  16. Avatar
    Alex November 21, 2017 at 11:15 pm #

    Very nice set of Blogs of NLP and Keras – thanks for writing them.

    As a quick note for others

    When I tried to load the glove file with the line:
    f = open(‘../glove_data/glove.6B/glove.6B.100d.txt’)

    I got the error
    UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x9d in position 2776: character maps to

    To fix I added:
    f = open(‘../glove_data/glove.6B/glove.6B.100d.txt’,encoding=”utf8″)

    This issue may have been caused by using Windows.

    • Avatar
      Jason Brownlee November 22, 2017 at 11:12 am #

      Thanks for the tip Alex.

      • Avatar
        Liliana November 26, 2017 at 12:00 pm #

        Hi Jason,

        Wonderful tutorials!

        I have a question. Why do we have to one-hot vectorize the labels? Also, if I have a pad sequence of ex. [2,4,0] what the one hot will be? I’m trying to understand better one hot vectorzer.

        I appreciate your response!

  17. Avatar
    Wassim November 28, 2017 at 1:06 am #

    Hi Jason,
    Thank you for your excellent tutorial. Do you know if there is a way to build a network for a classification using both text embedded data and categorical data ?
    Thank you

  18. Avatar
    ashish December 2, 2017 at 8:29 pm #

    How to do sentence classification using CNN in keras ? please help

  19. Avatar
    Stuart December 7, 2017 at 12:11 am #

    Fantastic explanation, thanks so much. I’m just amazed at how much easier this has become since the last time I looked at it.

  20. Avatar
    Stuart December 11, 2017 at 3:30 pm #

    Hi Jason …the 14 word vocab from your docs is “well done good work great effort nice excellent weak poor not could have better” For a vocab_size of 14, this one_shot encodes to [13 8 7 6 10 13 3 6 10 4 9 2 10 12]. Why does 10 appear 3 times, for “great”, “weak” and “have”?

    • Avatar
      Jason Brownlee December 11, 2017 at 4:54 pm #

      Sorry, I don’t follow Stuart. Could you please restate the question?

  21. Avatar
    Stuart December 11, 2017 at 9:34 pm #

    Hi Jason, the encodings that I provided in the example above came from kerasR with a vocab_size of 14. So let me ask the same question about the uniqueness of encodings using your Part 3 one_hot example above with a vocab_size of 50.
    Here different encodings are produced for different new kernels (using Spyder3/Python 3.4):
    [[31, 33], [27, 33], [48, 41], [34, 33], [32], [5], [14, 41], [43, 27], [14, 33], [22, 26, 33, 26]]
    [[6, 21], [48, 44], [7, 26], [46, 44], [45], [45], [10, 26], [45, 48], [10, 44], [47, 3, 21, 27]]
    [[7, 8], [16, 42], [24, 13], [45, 42], [23], [17], [34, 13], [13, 16], [34, 42], [17, 31, 8, 19]]

    Pleas note that in the first line, “33” encodes for the words “done”, “work”, “work”, “work” & “done”. In the second line “45” encodes for the words “excellent” & “weak” & “not”. In the third line, “13” encodes “effort”, “effort” & “not”.

    So I’m wondering why the encodings are not unique? Secondly, if vocab_size must be much larger then the actual size of the vocabulary?


  22. Avatar
    Nadav December 25, 2017 at 8:28 am #

    Great article Jason.
    How do you convert back from an embedding to a one-hot? For example if you have a seq2seq model, and you feed the inputs as word embeddings, in your decoder you need to convert back from the embedding to a one-hot representing the dictionary. If you do it by using matrix multiplication that can be quite a large matrix (e.g embedding size 300, and vocab of 400k).

    • Avatar
      Jason Brownlee December 26, 2017 at 5:12 am #

      The output layer can predict integers directly that you can map to words in your vocabulary. There would be no embedding layer on the output.

  23. Avatar
    Hitkul January 9, 2018 at 9:05 pm #

    Very helpful article.
    I have created word2vec matrix of a sentence using gensim and pre-trained Google News vector. Can I just flatten this matrix to a vector and use that as a input to my neural network.
    For example:
    each sentence is of length 140 and I am using a pre-trained model of 100 dimensions, therefore:- I have a 140*100 matrix representing the sentence, can i just flatten it to a 14000 length vector and feed it to my input layer?

  24. Avatar
    Paul January 12, 2018 at 6:40 pm #

    Great article, could you shed some light on how do Param # of 400 and 1500 in two neural networks come from? Thanks

    • Avatar
      Paul Lo January 12, 2018 at 9:55 pm #

      Oh! Is it just vocab_size * # of dimension of embedding space?
      1. 50 * 8 = 400
      2. 15* 100 = 1500

    • Avatar
      Jason Brownlee January 13, 2018 at 5:31 am #

      What do you mean exactly?

  25. Avatar
    Andy Brown January 14, 2018 at 2:02 pm #

    Great post! I’m working with my own corpus. How would I save the weight vector of the embedding layer in a text file like the glove data set?

    My thinking is it would be easier for me to apply the vector representations to new data sets and/or machine learning platforms (mxnet etc) and make the output human readable (since the word is associated with the vector).

    • Avatar
      Jason Brownlee January 15, 2018 at 6:57 am #

      You could use get_weights() in the Keras API to retrieve the vectors and save directly as a CSV file.

  26. Avatar
    jae January 17, 2018 at 8:10 am #

    Clear Short Good reading, always thank you for your work!

  27. Avatar
    Murali Manohar January 17, 2018 at 4:58 pm #

    Hello Jason,
    I have a dataset with which I’ve attained 0.87 fscore by 5 fold cross validation using SVM.Maximum context window is 20 and one hot encoded.

    Now, I’ve done whatever has been mentioned and getting an accuracy of 13-15 percent for RNN models where each one has one LSTM cell with 3,20,150,300 hidden units. Dimensions of my pre-trained embeddings is 300.

    Loss is decreasing and even reaching negative values, but no change in accuracy.

    I’ve tried the same with your CNN,basic ANN models you’ve mentioned for text classification .

    Would you please suggest some solution. Thanks in advance.

  28. Avatar
    Carsten January 17, 2018 at 8:35 pm #

    When I copy the code of the first box I get the error:

    AttributeError: ‘int’ object has no attribute ‘ndim’

    in the line :, labels, epochs=50, verbose=0)

    Where is the problem?

    • Avatar
      Jason Brownlee January 18, 2018 at 10:07 am #

      Copy the the code from the “complete example”.

      • Avatar
        Thiziri February 8, 2018 at 12:45 am #

        Hi Jason,
        I’ve got the same error, also will running the “compete example”.
        What can be the cause?

        • Avatar
          Gokul February 8, 2018 at 6:49 pm #

          Try casting the labels to numpy arrays.

        • Avatar
          soren February 9, 2018 at 6:31 am #

          i get the same!

          • Avatar
            Jason Brownlee February 9, 2018 at 9:22 am #

            I have fixed and updated the examples.

    • Avatar
      ademyanchuk February 8, 2018 at 3:34 pm #

      Carsten, you need labels to be numpy.array not just list.

  29. Avatar
    Willie January 17, 2018 at 9:18 pm #

    Hi Jason,

    If I have unkown words in training set, how can I assign the same random initialize vector to all of the unkown words when using pretrained vector model like glove or w2v. thanks!!!

    • Avatar
      Jason Brownlee January 18, 2018 at 10:08 am #

      Why would you want to do that?

      • Avatar
        Willie January 18, 2018 at 1:05 pm #

        If my data is in specific domain and I still want to leverage general word embedding model(e.g. glove.6b.100d trained from wiki), then it must has some OOV in domain data, so. no no mather in training time or inference time it propably may appear some unkown words.

        • Avatar
          Jason Brownlee January 19, 2018 at 6:27 am #

          It may.

          You could ignore these words.

          You could create a new embedding, set vectors from existing embedding and learn the new words.

  30. Avatar
    Vladimir January 21, 2018 at 1:46 pm #

    Amazing Dr. Jason!
    Thanks for a great walkthrough.

    Kindly advice on the following.
    On the step of encoding each word to integer you said: “We could experiment with other more sophisticated bag of word model encoding like counts or TF-IDF”. Could you kindly elaborate on how can it be implemented, as tfidf encodes tokens with floats. And how to tie it with Keras, passing it to an Embedding layer please? I’m keen to experiment with it, hope it could yield better results.

    Another question is about input docs. Suppose I’ve preprocessed text by the means of nltk up to lemmatization, thus, each sample is a list of tokens. What is the best approach to pass it to Keras embedding layer in this case then?

    • Avatar
      Jason Brownlee January 22, 2018 at 4:42 am #

      I have most posts on BoW, learn more here:

      You can encode your tokens as integers manually or use the Keras Tokenizer.

      • Avatar
        Vladimir January 22, 2018 at 11:54 pm #

        Well, Keras Tokenizer can accept only texts or sequences. Seems the only way is to glue tokens together using ‘ ‘.join(token_list) and then pass onto the Tokenizer.

        As for the BOW articles, I’ve walked through it theys are so very valuable. Thank you!

        Using BOW differs so much from using Embeddings. As BOW would introduce huge sparse array of features for each sample, while Embeddings aim to represent those features (tokens) very densely up to a few hundreds items.

        So, BOW in the other article gives incredibly good results with just very simple NN architecture (1 layer of 50 or 100 neurons). While I struggled to get good results using Embeddings along with convolutional layers…
        From your experience, would you please advice on that please? Are Embeddings actually viable and it is just a matter of finding a correct architecture?

        • Avatar
          Jason Brownlee January 23, 2018 at 8:01 am #

          Nice! And well done for working through the tutorials. I love to see that and few people actually “do the work”.

          Embeddings make more sense on larger/hard problems generally – e.g. big vocab, complex language model on the front end, etc.

          • Avatar
            Vladimir January 23, 2018 at 10:15 am #

            I see, thank you.

  31. Avatar
    joseph January 31, 2018 at 9:20 pm #

    Thanks jason for another great tutorial.

    I have some questions :

    Isn’t one hot definition is binary one, vector of 0’s and 1?
    so [[1,2]] would be encoded to [[0,1,0],[0,0,1]]

    how embedding algorithm is done on keras word2vec/globe or simply dense layer(or something else)

    • Avatar
      Jason Brownlee February 1, 2018 at 7:20 am #

      Sorry, I don’t follow your question. Perhaps you could rephrase it?

  32. Avatar
    Anna February 4, 2018 at 9:40 pm #

    Amazing Dr. Jason!
    Thanks for a great walkthrough.

    The dimension for each word vector like above example e.g. 8, is set randomly?

    Thank you

    • Avatar
      Jason Brownlee February 5, 2018 at 7:45 am #

      The dimensionality is fixed and specified.

      In the first example it is 8, in the second it is 100.

      • Avatar
        Anna February 5, 2018 at 2:17 pm #

        Thank you Dr. Jason for your quick feedback!

        Ok, I see that the pre-trained word embedding is set to 100 dimensionality because the original file “glove.6B.100d.txt” contained a fixed number of 100 weights for each line of ASCII.

        However, the first example as you mentioned in here, “The Embedding has a vocabulary of 50 and an input length of 4. We will choose a small embedding space of 8 dimensions.”

        You choose 8 dimensions for the first example. Does it means it can be set to any numbers other than 8? I’ve tried to change the dimension to 12. It doesn’t appeared any errors but the accuracy drops from 100% to 89%

        Layer (type) Output Shape Param #
        embedding_1 (Embedding) (None, 4, 12) 600
        flatten_1 (Flatten) (None, 48) 0
        dense_1 (Dense) (None, 1) 49
        Total params: 649
        Trainable params: 649
        Non-trainable params: 0

        Accuracy: 89.999998

        So, how dimensionality is set? Does the dimensions effects the accuracy performance?

        Sorry I am trying to grasp the basic concept in understanding NLP stuff. Much appreciated for your help Dr. Jason.

        Thank you

        • Avatar
          Jason Brownlee February 5, 2018 at 2:54 pm #

          No problem, please ask more questions.

          Yes, you can choose any dimensionality you like. Larger means more expressive, required for larger vocabs.

          Does that help Anna?

          • Avatar
            Anna February 5, 2018 at 4:40 pm #

            Yes indeed Dr. Now I can see that the dimensionality is set depends on the number of vocabs.

            Thank you again Dr Jason for enlighten me! 🙂

          • Avatar
            Jason Brownlee February 6, 2018 at 9:11 am #

            You asked good questions.

  33. Avatar
    Miroslav February 5, 2018 at 7:09 am #

    Hi Jason,
    thanks for amazing tutorial.

    I have a question. I am trying to do semantic role labeling with context window in Keras. How can I implement context window with embedding layer?

    Thank you

  34. Avatar
    Gabriel February 6, 2018 at 4:08 am #

    Hi, great website! I’ve been learning a lot from all the tutorials. Thank you for providing all these easy to understand information.

    How would I go about using other data for the CNN model? At the moment, I am using just textual data for my model using the word embeddings. From what I understand, the first layer of the model has to be the Embeddings, so how would I use other input data such as integers along with the Embeddings?

    Thank you!

  35. Avatar
    Aditya February 6, 2018 at 4:59 am #

    Hi Jason, this tutorial is simple and easy to understand. Thanks.

    However, I have a question. While using the pre-trained embedding weights such as Glove or word2vec, what if there exists few words in my dataset, which weren’t present in the dataset on which word2vec or Glove was trained. How does the model represent such words?

    My understanding is that in your second section (Using Pre-Trained Glove Embeddin), you are mapping the words from the loaded weights to the words present in your dataset, hence the question above.

    Correct me, if it’s not the way I think it is.

    • Avatar
      Jason Brownlee February 6, 2018 at 9:22 am #

      You can ignore them, or assign them to zero vector, or train a new model that includes them.

      • Avatar
        Serhiy February 7, 2019 at 10:52 pm #

        Hi Jason. Thanks for this and other VERY clear and informative articles.
        Wanted to add one question to Aditya post:

        “mapping the words from the loaded weights to the words present in your dataset”

        how does it mapping? does it use matrix index == word number from (padded_docs or)?

        I am asking because – what if I pass embedding_matrix with origin order, but will shuffle padded_docs before

        • Avatar
          Jason Brownlee February 8, 2019 at 7:49 am #

          Words must be assigned unique integers that remain consistent across all data and embeddings.

  36. Avatar
    Han February 20, 2018 at 4:03 pm #

    Hi Jason,

    I am trying to train a Keras LSTM model on some text sentiment data. I am also using GridSearchCV in sklearn to find the best parameters. I am not quite sure what went wrong but the classification report from sklearn says:

    UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.

    Below is what the classification report looks like:

    precision recall f1-score support

    negative 0.00 0.00 0.00 98
    positive 0.70 1.00 0.83 232

    avg / total 0.49 0.70 0.58 330

    Do you know what the problem is?

    • Avatar
      Jason Brownlee February 21, 2018 at 6:35 am #

      Perhaps you are trying to use keras metrics with sklearn? Watch the keywords you use when specifying the keras model vs the sklearn evaluation (CV).

  37. Avatar
    Alberto Nogales Moyano March 2, 2018 at 2:10 am #

    Hi Jason,
    your blog is really really interesting. I have a question: Which is the diffrence between using word2vec and texts_to_sequences from Tokenizer in keras? I mean in the way the texts are represented.
    Is any of the two options better than the other?
    Thanks a lot.
    Kind regards.

    • Avatar
      Jason Brownlee March 2, 2018 at 5:33 am #

      word2vec encodes words (integers) to vectors. texts_to_seqences encodes words to integers. It is a step before word2vec or a step before bag of words.

  38. Avatar
    Souraj Adhikary March 2, 2018 at 5:39 pm #

    Hi Jason,

    I have a dataframe which contains texts and corresponding labels. I have used gensim module and used word2vec to make a model from the text. Now I want to use that model for input into Conv1D layers. Can you please tell me how to load the word2vec model in Keras Embedding layer? Do I need to pre-process the model in some way before loading? Thanks in advance.

    • Avatar
      Jason Brownlee March 3, 2018 at 8:07 am #

      Yes, load the weights into an Embedding layer, then define the rest of your network.

      The tutorial above will help a lot.

  39. Avatar
    Ankush Chandna March 13, 2018 at 1:57 am #

    This is really helpful. You make us awesome at what we do. Thanks!!

  40. Avatar
    R.L. March 20, 2018 at 4:13 pm #

    Thank you for this extremely helpful blog post. I have a question regarding to interpreting the model. Is there a way to know / visualize the word importance after the model is trained? I am looking for a way to do so. For instance, is there a way to find like the top 10 words that would trigger the model to classify a text as negative and vice versa? Thanks a lot for your help in advance

    • Avatar
      Jason Brownlee March 21, 2018 at 6:31 am #

      There may be methods, but I am not across them. Generally, neural nets are opaque, and even weight activations in the first/last layers might be misleading if used as importance scores.

  41. Avatar
    Mohit March 30, 2018 at 9:35 pm #

    Hi Jason ,

    Can you please tell me the logic behind this:

    vocab_size = len(t.word_index) + 1

    Why we added 1 here??

    • Avatar
      Jason Brownlee March 31, 2018 at 6:36 am #

      So that the word indexes are 1-offset, and 0 is reserved for padding / no data.

  42. Avatar
    Ryan March 31, 2018 at 9:59 pm #

    Hi Jason,

    If I want to use this model to predict next word, can I just change the output layer to Dense(100, activation = ‘linear’) and change the loss function to MSE?

    Many thanks,


  43. Avatar
    Coach April 17, 2018 at 10:23 pm #

    Thanks for this tutoriel ! Really clear and usefull !

  44. Avatar
    Maryam April 28, 2018 at 4:53 am #

    Hi Jason,
    U R the best in keras tutorials and also replying the questions. I am really grateful.
    Although I have understood the content of the context and codes U have written above, I am not able to understand what you mean about this sentence:[It might be better to filter the embedding for the unique words in your training data.].
    what does “to filter the embedding” mean??
    Thank you for replying.

    • Avatar
      Jason Brownlee April 28, 2018 at 5:34 am #

      It means, only have the words in the embedding that you know exist in your dataset.

      • Avatar
        Maryam April 28, 2018 at 11:45 pm #

        Hi Jason,
        Thank U 4 replying but as I am not a native English speaker, I am not sure whether I got it or not. You mean to remove all the words which exist in the glove but do not exist in my own dataset?? in order to raise the speed of implementation?
        I am sorry to ask it again as I did not understand clearly.
        Thank U in advance Jason

  45. Avatar
    Aiza May 9, 2018 at 6:14 am #

    This post is great. I am new to machine learning so i have a question which might be basic so i am not sure.As from what i understand, the model takes the embedding matrix and text along with labels at once.What i am trying to do is concatenate POS tag embedding with each pre-trained word embedding but POS tag can be different for the same word depending upon the context.It essentially means that i cant alter the embedding matrix at add to the network embedding layer.I want to take each sentence,find its embedding and concatenate with POS tag embedding and then feed into neural network.Is there a way to do the training sentence by sentence or something? Thanks

    • Avatar
      Jason Brownlee May 9, 2018 at 6:32 am #

      You might be able to use the embedding and pre-calculate the vector representations for each sentence.

      • Avatar
        Aiza May 9, 2018 at 7:06 am #

        Sorry but i didn’t quite understand.Can you please elaborate a little?

        • Avatar
          Jason Brownlee May 9, 2018 at 2:54 pm #

          Sorry, I mean that you can prepare an word2vec model standalone. Then pass each sentence through it to build up a list of vectors. Concat the vectors together and you have a distributed sentence representation.

          • Avatar
            Aiza May 9, 2018 at 10:16 pm #

            Thanks alot! One more thing, is it possible to pass other information to the embedding layer than just weights?For example i was thinking that what if i dont change the embedding matrix at all and create a separate matrix of POS tags for whole training data which is also passed to the embedding layer which concatenates these both sequentially?

          • Avatar
            Jason Brownlee May 10, 2018 at 6:32 am #

            You could develop a model that has multiple inputs, for example see this post for examples:

  46. Avatar
    Aiza May 15, 2018 at 8:46 am #

    Thanks.I saw this post.Your model has separate inputs but they get merged after flattening.In my case i want to pass the embeddings to first convolutional layer,only after they are concatenated. Uptil now what i did was that i have created another integerized sequence of my data according to POS_tags(embedding_pos) to pass as another input and another embedding matrix that contains the embeddings of all the POS tags.
    e=(Embedding(vocab_size, 50, input_length=23, weights=[embedding_matrix], trainable=False))
    e1=(Embedding(38, 38, input_length=23, weights=[embedding_matrix_pos], trainable=False))
    merged_input = concatenate([e,e1], axis=0)
    model_embed = Sequential()
    model_embed.add(merged_input),embedding_pos, final_labels, epochs=50, verbose=0)

    I know this is wrong but i am not sure how to concat those both sequences and if you can direct me in right direction,it would be great.The error is
    ‘Layer concatenate_6 was called with an input that isn’t a symbolic tensor. Received type: . Full input: [, ]. All inputs to the layer should be tensors.’

    • Avatar
      Jason Brownlee May 15, 2018 at 2:42 pm #

      Perhaps you could experiment and compare the performance of models with different merge layers for combining the inputs.

  47. Avatar
    Franco May 16, 2018 at 6:51 pm #

    Hi Jason, awesome post as usual!

    Your last sentence is tricky though. You write:

    “In practice, I would encourage you to experiment with learning a word embedding using a pre-trained embedding that is fixed and trying to perform learning on top of a pre-trained embedding.”

    Without the original corpus, I would argue, that’s impossible.

    In Google’s case, the original corpus of around 100 billion words is not publicly available. Solution? I believe you’re suggesting “Transfer Learning for NLP.” In this case, the only solution I see is to add manually words.

    E.g. you need ‘dolares’ which is not in Google’s Word2Vec. You want to have similar vectors as ‘money’. In this case, you add ‘dolares’ + 300 vectors from money. Very painful, I know. But it’s the only way I see to do “Transfer Learning with NLP”.

    If you have a better solution, I’d love your input.

    Cheerio, a big fan

    • Avatar
      Jason Brownlee May 17, 2018 at 6:30 am #

      Not impossible, you can use an embedding trained on a other corpus and ignore the difference or fine tune the embedding while fitting your model.

      You can also add missing words to the embedding and learn those.

      Remember, we want a vector for each word that best captures their usage, some inconsistencies does not result in a useless model, it is not binary useful/useless case.

      • Avatar
        Franco May 17, 2018 at 4:42 pm #

        Thank you very much for the detailed answer!

  48. Avatar
    Ashar May 31, 2018 at 6:11 am #

    The link for the Tokenizer API is this same webpage. Can you update it please?

  49. Avatar
    Andreas Papandreou May 31, 2018 at 10:55 am #

    Hi Jason, great post!
    I have successfully trained my model using the word embedding and Keras. I saved the trained model and the word tokens.Now in order to make some predictions, do i have to use the same tokenizer with one that i used in the training?

  50. Avatar
    zedom June 2, 2018 at 5:15 pm #

    Hi Jason, when i was looking for how to use pre-trained word embedding,
    I found your article along with this one:
    They have many similarities.

  51. Avatar
    Jack June 13, 2018 at 10:58 pm #

    Hey jason,

    I am trying to do this but sometime keras gives the same integer to different words. Would it be better to use scikit encoder that converts words to integers?

    • Avatar
      Jason Brownlee June 14, 2018 at 6:08 am #

      This might happen if you are using a hash encoder, as Keras does, but calls it a one hot encoder.

      Perhaps try a different encoding scheme of words to integers

  52. Avatar
    Meysam June 18, 2018 at 5:13 am #

    Hi Jason,
    I have implemented the above tutorial and the code works fine with GloVe. I am so grateful abot the tutorial Jason.
    but when I download GoogleNews-vectors-negative300.bin which is a pre-trained embedding word2vec it gave me this error:

    File “/home/mary/anaconda3/envs/virenv/lib/python3.5/site-packages/gensim/models/”, line 171, in __getitem__
    return vstack([self.get_vector(entity) for entity in entities])

    TypeError: ‘int’ object is not iterable.

    I wrote the code as the same as your code which you wrote for loading glove but with a little change.

    ‘model = gensim.models.KeyedVectors.load_word2vec_format(‘./GoogleNews-vectors-negative300.bin’, binary=True)

    for line in model:
    values = line.split()
    word = values[0]
    coefs = asarray(values[1:], dtype=’float32′)
    embeddings_index[word] = coefs
    print(‘Loaded %s word vectors.’ % len(embeddings_index))

    embedding_matrix = zeros((vocab_dic_size, 300))
    for word in vocab_dic.keys():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
    embedding_matrix[vocab_dic[word]] = embedding_vector

    I saw you wrote a tutorial about creating word2vec by yourself in this link “”,
    but I have not seen a tutorial about aplying pre-trained word2vec like GloVe.
    please guide me to solve the error and how to apply the GoogleNews-vectors-negative300.bin pretrained wor2vec?
    I am so sorry to write a lot as I wanted to explain in detail to be clear.
    any guidance will be appreciated.

    • Avatar
      Jason Brownlee June 18, 2018 at 6:46 am #

      Perhaps try the text version as in the above tutorial?

      • Avatar
        Meysam June 19, 2018 at 12:28 am #

        Hi Jason
        thank you very much for replying, but as I am weak at the English language, the meaning of this sentence is not clear. what do you mean by “try the text version”??
        in fact, GloVe contains txt files and I implement it correctly but when I wanna run a program by GoogleNews-vectors-negative300.bin which is a pre-trained embedding word2vec it gave me the error and also this file is a binary one and there is no pre-trained embedding word2vec file by txt prefix.
        can you help me though I know you are busy?

  53. Avatar
    Kavit Gangar June 18, 2018 at 7:17 pm #

    How can we use pre-trained word embedding on mobile?

    • Avatar
      Jason Brownlee June 19, 2018 at 6:29 am #

      I don’t see why not, other than disk/ram size issues.

  54. Avatar
    NewToDeepNLP June 28, 2018 at 12:17 pm #

    Great post! What changes are necessary if the labels are more than binary such as with 4 classes:
    labels = array([2,1,1,1,2,0,-1,0,-1,0])
    E.g. instead of ‘binary_crossentropy’ perhaps ‘categorical_crossentropy’?
    And how shold the Dense layer change?
    If I use: model.add(Dense(4, activation=’sigmoid’)), I get an error:

    ValueError: Error when checking target: expected dense_1 to have shape (None, 4) but got array with shape (10, 1)

    thanks for your work!

  55. Avatar
    James June 28, 2018 at 9:58 pm #

    Hi Jason,

    For starters, thanks for this post. Ideal to get things going quickly. I have a couple of questions if you don’t mind:

    1) I don’t think that one-hot encoding the string vectors is ideal. Even with the recommended vocab size (50), I still got collisions which defeats the purpose even in a toy example such as this. Even the documentation states that uniqueness is not guaranteed. Keras’ Tokenizer(), which you used in the pre-trained example is a more reliable choice in that no two words will share the same integer mapping. How come you proposed one-hot encoding when Tokenizer() does the same job better?

    2) Getting Tokenizer()’s word_index property, returns the full dictionary. I expected the vocab_size to be equal to len(t.word_index) but you increment that value by one. This is in fact necessary because otherwise fitting the model fails. But I cannot get the intuition of that. Why is the input dimension size equal to vocab_size + 1?

    3) I created a model that expects a BoW vector representation of each “document”. Naturally, the vectors were larger and sparser [ (10,14) ] which means more parameters to learn, no. However, in your document you refer to this encoding or tf-idf as “more sophisticated”. Why do you believe so? With that encoding don’t you lose the word order which is important to learn word embeddings? For the record, this encoding worked well too but it’s probably due to the nature of this little experiment.

    Thank you in advance.

    • Avatar
      Jason Brownlee June 29, 2018 at 6:05 am #

      The keras one hot encoding method really just takes a hash. It is better to use a true one hot encoding when needed.

      I do prefer the Tokenizer class in practice.

      The words are 1-offset, leaving room for 0 for “unknown” word.

      tf-idf gives some idea of the frequency over the simple presence/absence of a word in BoW.

      How that helps.

  56. Avatar
    abbas July 12, 2018 at 3:27 am #

    where can i find the file “../glove_data/glove.6B/glove.6B.100d.txt”??because i come up with the following error.
    File “”, line 36
    f = open(‘../glove_data/glove.6B/glove.6B.100d.txt’)
    SyntaxError: invalid character in identifier

    • Avatar
      Jason Brownlee July 12, 2018 at 6:28 am #

      You must download it and place it in your current working directory.

      Perhaps re-read section 4.

  57. Avatar
    Harry July 17, 2018 at 1:02 pm #

    Excellent work! This is quite helpful to novice.
    And I wonder is this useful to other language apart from English? Since I am a Chinese, and I wonder whether I can apply this to Chinese language and vocabulary.
    Thanks again for your time devoted!

  58. Avatar
    sree harsha July 18, 2018 at 6:02 am #


    can you explain how can the word embeddings be given as hidden state input to LSTM?

    thanks in advance

    • Avatar
      Jason Brownlee July 18, 2018 at 6:39 am #

      Word embeddings don’t have hidden state. They don’t have any state.

      • Avatar
        sree harsha July 18, 2018 at 6:57 pm #

        for example, I have a word and its 50 (dimensional) embeddings. How can I give these embeddings as hidden state input to an LSTM layer?

        • Avatar
          Jason Brownlee July 19, 2018 at 7:47 am #

          Why would you want to give them as the hidden state instead of using them as input to the LSTM?

  59. Avatar
    zbk July 20, 2018 at 8:39 am #

    hi, what a practical post!
    I have a question, I work in a sentiment analysis project with word2vec as an embedding model with keras. my problem is when I want to predict a new sentence as an input I face this error:

    ValueError: Error when checking input: expected conv1d_1_input to have shape (15, 512) but got array with shape (3, 512)

    consider that I want to enter a simple sentence like:”I’m really sad” with the length 3- and my input shape has the length of 15- I don’t know how can I reshape it or doing what to get rid of this error.

    and this is the related part of my code:

    model = Sequential()
    model.add(Conv1D(32, kernel_size=3, activation=’elu’, padding=’same’, input_shape=(15, 512)))
    model.add(Conv1D(32, kernel_size=3, activation=’elu’, padding=’same’))
    model.add(Conv1D(32, kernel_size=3, activation=’elu’, padding=’same’))
    model.add(Conv1D(32, kernel_size=3, activation=’elu’, padding=’same’))

    model.add(Conv1D(32, kernel_size=2, activation=’elu’, padding=’same’))
    model.add(Conv1D(32, kernel_size=2, activation=’elu’, padding=’same’))
    model.add(Conv1D(32, kernel_size=2, activation=’elu’, padding=’same’))
    model.add(Conv1D(32, kernel_size=2, activation=’elu’, padding=’same’))

    model.add(Dense(256, activation=’relu’))
    model.add(Dense(256, activation=’relu’))
    model.add(Dense(2, activation=’softmax’))

    would you please help me to solve this problem?

    • Avatar
      Jason Brownlee July 21, 2018 at 6:27 am #

      You must prepare the new sentence in exactly the same way as the training data, including length and integer encoding.

  60. Avatar
    zbk July 20, 2018 at 6:09 pm #

    At least would you mind sharing some suitable source for me to solve this problem please?
    I hope you answer my question as what you done all the time. Thanks

    • Avatar
      Jason Brownlee July 21, 2018 at 6:32 am #

      I’m eager to help and answer specific questions, but I don’t have the capacity to write code for you sorry.

  61. Avatar
    Alalehrz July 28, 2018 at 5:13 am #

    Hi Jason,

    I have two questions. First, for new instances if the length is greater than this model input shall we truncate the sentence?
    Also, since the embedding input is just for the seen training set words, what happens to the predictions-to-word process? I assume it just returns some words similar to the training words not considering all the dictionary words. Of course I am talking about a language model with Glove.

    • Avatar
      Jason Brownlee July 28, 2018 at 6:40 am #

      Yes, truncate.

      Unseen words are mapped to nothing or 0.

      The training dataset MUST be representative of the problem you are solving.

      • Avatar
        az July 31, 2018 at 3:30 am #

        From what i understood from this comment,it is about prediction on test data. Lets assume that there are 50 words in vocabulary which means sentences will have unique integers uptil 50. Now since test data must be tokenized with same instance of tokenizer and if it has some new words, it would have integers with 51 ,52 oand so on..In this case,would the model automatically use 0 for word embeddings or can it raise out of bound type exception?Thanks

        • Avatar
          Jason Brownlee July 31, 2018 at 6:11 am #

          You would encode unseen words as 0.

          • Avatar
            Yuedong Wu August 30, 2020 at 4:25 pm #

            All 50 vocabulary words should start from index 1 to 50, while leave 0 for unseen word in vocabulary. am I right?

          • Avatar
            Jason Brownlee August 31, 2020 at 6:07 am #


  62. Avatar
    Jaskaran July 29, 2018 at 11:21 pm #

    can this be described as transfer learning?

  63. Avatar
    eden July 31, 2018 at 1:07 am #

    i have trained and tested my own network. During my work,when i integerized the sentences and created a corresponding word embedding matrix ,it included embeddings for train,validation and test data as well.
    Now if i want to reload my model to test for some other similar data, i am confused that how the words from this new data would relate to embedding matrix?
    You should have embeddings for test data as well right? or when you create embedding matrix you exclude test data?Thanks

    • Avatar
      Jason Brownlee July 31, 2018 at 6:09 am #

      The embedding is created from the training dataset.

      It should be sufficiently rich/representative enough to cover all data you expect to in the future.

      New data must have the same integer encoding as the training data prior to being mapped onto the embedding when making a prediction.

      Does that help?

      • Avatar
        eden July 31, 2018 at 8:04 am #

        yes,i understand that i should be using the same tokenizer object for encoding both train and test data, but i am not sure how the embedding layer would behave for the word or index which isnt part of embedding matrix. Obviously test data would have similar words but there must be some words that are bound to be new. Would you say this approach is right to include test data too while creating embedding matrix for model? If you want to predict using some pre trained model,how can i deal with this issue? A small example can be really helpful. Thanks alot for all the help and time!

        • Avatar
          Jason Brownlee July 31, 2018 at 2:55 pm #

          It won’t. The encoding will set unknown words to 0.

          It really depends on the goal of what you want to evaluate.

  64. Avatar
    Jaskaran August 1, 2018 at 5:04 pm #

    i want to train my model to predict the target word given to a 5 word sequence . how can i represent my target word ?

  65. Avatar
    Bharath August 3, 2018 at 12:48 pm #

    Hello Jason,

    This is regarding the output shape of the first embedding layer : (None,4,8).
    Am I correct in understanding that the 4 represents the input size which is 4 words and the 8 is the number of features it has generated using those words?

  66. Avatar
    Sreenivasa August 10, 2018 at 7:33 pm #

    Hi Jason,
    Thanks for sharing your knowledge.
    My task is to classify set of documents into different categories.( I have a training set of 100 documents and say 10 categories).
    The idea is to extract top M words ( say 20) from the first few lines of each doc, convert words to word embeddings and use it as feature vectors for the neural network.

    Question : Since i take top M words from the document, it may not be in the “right” order each time, meaning the there can be different words at a given position in the input layer ( unlike bag of words model). Wont this approach impact the Neural network from converging?


    • Avatar
      Jason Brownlee August 11, 2018 at 6:08 am #

      The key is to assign the same integer value to each word prior to feeding the data into the embedding.

      You must use the same text tokenizer for all documents.

  67. Avatar
    Fatemeh August 17, 2018 at 7:15 am #

    Hi Jason,
    Thank you for your great explanation. I have used the pre-trained google embedding matrix in my seqtoseq project by using encoder-decoder. but in my test, I have a problem. I don’t know how to make a reverse for my embedding matrix. Do you have a sample project? My solution is: when my decoder predicts a vector, I should search for that in my pre-trained embedding matrix, and then find its index and then understand its related word. Am I right?

    • Avatar
      Jason Brownlee August 17, 2018 at 7:40 am #

      Why would you need to reverse it?

      • Avatar
        Vivien September 29, 2018 at 1:45 pm #

        Hi Jason

        Thanks for an excellent tutorial. Using your methods, I’ve converted text into word index and applied word embeddings.

        Like Fatemeh, I’m wondering if it’s possible to reverse the process, and convert embedding vectors back into text? This could be useful for applications such as text summarising.

        Thank you.

        • Avatar
          Jason Brownlee September 30, 2018 at 6:00 am #

          Yes, each vector has an int value known by the embedding and each int has a mapping to a word via the tokenizer.

          Random vectors in the space do not, you will have to find the closest vector in the embedding using euclidean distance.

  68. Avatar
    fatma August 18, 2018 at 10:09 am #

    Dear Dr. Jason,

    Accuracy: 89.999998 on my Laptop, result different from computer to other?

  69. Avatar
    Ivan September 1, 2018 at 2:00 pm #


    So many thanks for this tutorial!

    I’ve been trying to train a network that consists of an Embedding layer, an LSTM and a softmax as the output layer. However, it seems that the loss and accuracy get stuck at some point.

    Do you have any advice?

    Thanks in advance.

  70. Avatar
    Aramis September 28, 2018 at 10:28 am #

    Thank you so much,
    It helped me alot in learning how to use pre trained embbeding in neural nets

  71. Avatar
    Rafael sa October 24, 2018 at 5:28 am #

    Hi Jason, thank uou for the great materiAl.

    I have one doubt, want to make the embedding of a list of 1200 documents to use it as input to a classification model to predict moviebox office based on the moviescript text…
    My question is… if i want to train the embedding with the vocabulary of the real dataset, how can i after classify the rest of the dataset that was not trained ? Can a use the embeddings learned on the training as input to the classification model ?

    • Avatar
      Jason Brownlee October 24, 2018 at 6:33 am #

      Good question.

      You must ensure that the training dataset is representative of the broader problem. Any words unseen during training will be marked zero (unknown) by the tokenizer unless you update your model.

      • Avatar
        Rafael SA November 1, 2018 at 9:06 pm #

        Thank You Jason. As soon as I get the results I’ll try to share it here.
        I’d like to thank you too about your great platform, it is being very helpful to me.

  72. Avatar
    Dimitris October 26, 2018 at 2:06 am #

    Nice post once again! It seems that in each batch all embedding are updated which I think it should not happen. You got any idea how to update only the one that are passed each time? That is for computational reasons or others problem definitions related reasons.

    • Avatar
      Jason Brownlee October 26, 2018 at 5:38 am #

      I’m not sure what you mean exactly, can you elaborate?

  73. Avatar
    Mahdi November 18, 2018 at 1:57 am #

    Hello Jason, i would like to think you for this post, it’s really interresting and understandable.

    I’ve reused the script but instead of using “docs” and “labels” lists, i used the IMDB movie reviews dataset. The problem is that i can’t reach more than 50% accuracy and the loss is stable in all epochs to value 0.6932.

    What do you think about that ?

  74. Avatar
    Mehran November 18, 2018 at 12:55 pm #

    Thanks for the article. Could you also provide an example of how to train a model with only one Embedding layer? I’m trying to do the same with Keras but the problem is that the fit method asks for labels which I don’t have. I mean I only have a bunch of text files that I’m trying to come up with the mapping for.

    • Avatar
      Jason Brownlee November 19, 2018 at 6:42 am #

      Models typically only have one embedding layer. What do you mean exactly?

  75. Avatar
    Vishal November 22, 2018 at 4:07 pm #


    Thank you for the excellent explanation!

    I have a few questions related to unknown words.

    Some pretrained word embeddings like the GoogleNews embeddings have an embedding vector for a token called ‘UNKNOWN’ as well.

    1. Can I use this vector to represent words that are not present in the training set instead of the vector of all zeros? If so, how should I go about loading this vector into the Keras Embedding layer? Should it be loaded at the 0th index in the embedding matrix?

    2. Also, can I use the Tokenizer API to help me convert all unknown words (words not in the training set) to ‘UNKNOWN’?

    Thank you.

    • Avatar
      Jason Brownlee November 23, 2018 at 7:43 am #

      Yes, find the integer value for the unknown word and use that to assign to all words not in the vocab.

  76. Avatar
    Saurabh November 27, 2018 at 4:46 pm #

    If word embedding doesn’t contain a word we input to a model , How to address this issue?
    1) Is it possible to load additional words (besides those in our vocabulary) in embedding matrix.
    Or may be any other elegant way you would like to suggest?

  77. Avatar
    mohammad November 28, 2018 at 11:15 pm #

    Hi .thanks a lot for your post . i’m new in python and deep learning !

    i have 240,000 tweet train set “50 % male and 50% female” class . and 120,000 tweet test set ” 50 % male and 50% female”. i want use lstm in python bud i have following error at ” fit ” method :

    ValueError: Error when checking input: expected lstm_16_input to have 3 dimensions, but got array with shape (120000, 400)

    can you help me?

    • Avatar
      Jason Brownlee November 29, 2018 at 7:40 am #

      It looks like a mismatch between your data and the model, you can change the data or change the model.

  78. Avatar
    Mohamed December 5, 2018 at 3:46 am #

    Hi Jason, Thanks for this article.

    I am getting this error

    TypeError: ‘OneHotEncoder’ object is not callable

    How oto overcome?


  79. Avatar
    Rushi December 6, 2018 at 5:40 pm #

    Hi , i have 2 models with this embedding layers , how do i merge those model ?


  80. Avatar
    Utkarsh Rai December 7, 2018 at 12:29 am #

    hi Jason, greate tutorial, i am very new to all this. I have a query, u r using glove for the embedding layer but during fitting u are directly using padded_docs. The vectors in padded_docs have no co-relation to glove. I am sure that i am missing something plz enlighten.

    • Avatar
      Jason Brownlee December 7, 2018 at 5:23 am #

      The padding just adds ‘0’ to ensure the sequences are the same length. It does not effect the encoding.

  81. Avatar
    Eduardo Andrade December 7, 2018 at 3:48 pm #

    Hi, Jason. Considering the “3. Example of Learning an Embedding”, I’m adding “model.add(LSTM(32, return_sequences=True))” after the embedding layer and I would like to understand what happens. The number of parameters returned for this LSTM layer is “5248” and I don’t know how to calculate it. Thank you.

    • Avatar
      Jason Brownlee December 8, 2018 at 6:58 am #

      Each unit in the LSTM will take the entire embedding as input, therefore must have one weight for each dimension in the embedding.

  82. Avatar
    Vic` December 18, 2018 at 10:12 am #

    Hi Jason,

    Do you have any example showing how we can use a bi-directional LSTM on text (i.e., using word embeddings)?

  83. Avatar
    Matt December 24, 2018 at 9:57 pm #

    I am interested in using the predict function to predict new docs. For instance, ‘Struck out!’ My understanding is that if one or more words in the doc you want to predict weren’t involved in training, then the model can’t predict it. Is the solution to simply train on enough docs to make sure the vocabulary is extensive enough to make new predictions in this way?

    • Avatar
      Jason Brownlee December 25, 2018 at 7:21 am #

      Yes, or mark new words as “unknown” a predict time.

  84. Avatar
    Ajay December 26, 2018 at 11:53 pm #

    Hello Jason, Is there any reason that the output of the Embedding layer is a 4×8 matrix?

    • Avatar
      Jason Brownlee December 27, 2018 at 5:43 am #

      No, it is just an example to demonstrate the concept.

  85. Avatar
    Zeyu December 30, 2018 at 7:09 am #

    Hi, Jason. Thanks a lot for this excellent tutorial. I have a quick question about the Keras Embedding layer.

    vocab_size = len(t.word_index) + 1

    t.word_index starts from 1 and ends with 15. Therefore, there are totally 15 words in the vocabulary. Then why do we need to add 1 here please?

    Thanks a lot for the help!

    • Avatar
      Jason Brownlee December 31, 2018 at 6:02 am #

      The words are 1-offset and index 0 is left for “unknown” words.

      • Avatar
        Niccola March 3, 2021 at 7:49 am #

        Would you mind specifying this answer? This is still not quite clear to me

        • Avatar
          Jason Brownlee March 3, 2021 at 8:09 am #

          Sure, which part is not clear?

          • Avatar
            Niccola March 3, 2021 at 1:49 pm #

            What exactly does 1-offset mean in this context?

            And if “index 0 is left for unknown words”, wouldn’t that imply that you could ignore them?

            And if I use this instead:

            vocab_size = len(t.word_counts.keys()) + 1

            Do I also have to add the 1?

          • Avatar
            Jason Brownlee March 3, 2021 at 1:59 pm #

            Words not in the vocab are assigned the value of 0 which maps to the index of the first vector and represents unknown words.

            The first word in the vocab is mapped to vector at index 1. The second words maps to the vector at index 2, etc until the total number of vectors is handled.

            Does that help?

          • Avatar
            Niccola March 3, 2021 at 2:09 pm #

            Thank you for your response, but I am not quite getting this. My understanding is, when I work with t.word_counts instead of t.word_index (not sure if it makes a difference) that I get something like this:

            OrderedDict([(‘word1’, 35), (‘word2’), 232))

            Then, if I use len(t.word_counts) it gives me 2 in this case. Why am I then adding a 1.

            That is, if I use t.word_counts I don’t see any unknown word when I print it out.

          • Avatar
            Jason Brownlee March 4, 2021 at 5:44 am #

            We are not adding 1 to the counts. We are not touching the ordered dict of words.

            We are just adding one more word to our vocabulary called “unknown” that maps to the first vector in the embedding.

          • Avatar
            Niccola March 3, 2021 at 5:29 pm #

            The tensorflow documentation also says to add 1, but I am really not sure why:


          • Avatar
            Jason Brownlee March 4, 2021 at 5:46 am #

            To make room for “unknown” – words not in the vocab.

            Adding 1 means adding one word to the vocab, the first word mapped to index 0 – word number zero.

          • Avatar
            Niccola March 4, 2021 at 8:05 am #

            Are you aware of any resource that might explain why we have to add 1? I understand it is because of unknown words, so we increase the given vocab size by 1. But I don’t understand why that is.

          • Avatar
            Jason Brownlee March 4, 2021 at 8:24 am #

            We don’t have to do it, we could simply ignore words that are not in the vocab.

  86. Avatar
    Anna January 3, 2019 at 4:13 pm #

    Hi Jason,

    If I have three columns of (string type) multivariate data, one column is categorical, the other two columns are not. Is it ok if I integer encode them using LabelEncoding(), and then scale the encoded data using feature scaling method like MinMax, StandardScaler etc. before feed into anomaly detection model? Even though the ROC shows an impressive result. But is it valid to pre-process text data like that?

    Thank you.

    • Avatar
      Jason Brownlee January 4, 2019 at 6:26 am #

      Perhaps try it and compare performance.

      • Avatar
        Anna January 4, 2019 at 2:49 pm #

        I have tried it and it shows nearly 100% ROC. What I mean is that, is it accurate to pre-process the text data like that? Because when I checked your post regarding pre-processing text data, there is no feature scaling (MinMax, StandardScaler etc) on text data after encode them to integer. I’m afraid if the way I pre-process data is not accurate.

        • Avatar
          Jason Brownlee January 5, 2019 at 6:47 am #

          Generally text is encoded as integers then either one hot encoded or mapped to a word embedding. Other data preparation (e.g. scaling) is not required.

  87. Avatar
    Kafeel Basha January 6, 2019 at 7:46 pm #


    I was trying to do multi class classification of text data using Keras in R and Python.

    In Python I was able to get predicted labels using inverse_transform() method from encoded class values. But when I try to do the same in R using CatEncoders library, getting some of the labels as NAs. Any reason for that.

    • Avatar
      Jason Brownlee January 7, 2019 at 6:28 am #

      No need to transform the prediction, the model can make a class or probability prediction directly.

  88. Avatar
    Li Xiaohong January 7, 2019 at 12:49 am #

    Hi Jason,

    Thanks for your sharing! I have a question on word embedding. Correct me if I am wrong: noticed the word embedding created here only contains words in the training/test set. I would think a word embedding including all vocab in GloVE file will be better? For example, if in production, we encounter a new word than in training/test set, but it is part of the GloVE vocab, in this case, we can capture the meaning of the production words although we don’t see it in training/test set. I think this will benefit sentiment classification problems with smaller training set?


    • Avatar
      Jason Brownlee January 7, 2019 at 6:36 am #

      Generally, you carefully choose your vocab. If you want to maintain a larger vocab than is required of the project “just in case”, go for it.

  89. Avatar
    Rahul Sangole January 17, 2019 at 5:12 am #


    Are there non-text applications of embeddings? For example – I have large sets of categorical variables, each with very large number of levels, which go into a classification model. Could I use embeddings in such a case?


    • Avatar
      Jason Brownlee January 17, 2019 at 5:30 am #

      Yes, embeddings are fantastic for categorical data!

  90. Avatar
    Dipawesh Pawar January 18, 2019 at 2:11 am #

    Hi Jason…

    Thanks a lot for such a nice post. It enriched my knowledge a lot.

    I have one doubt on text_to_sequence and one_hot methods provided by keras. Both of them are giving same encoded docs with your example. If they gives same output then when we should use text_to_sequence and when we should go for one_hot?

    Again thanx a lot for such a nice post.

  91. Avatar
    Ritesh January 21, 2019 at 6:18 pm #

    Hi Jason,

    Your post are really superb. Thanks for writing such great post .

    I have one query , why people use Embedding Layer when we have already got the vector representation of a word from word2vec or glove . Using these two pre trained model we have already got the same size vector representation of each word and if word in not found we can assign the random value of same size. After getting the vector representation why we are passing to the Embedding layer?


    • Avatar
      Jason Brownlee January 22, 2019 at 6:21 am #

      Often the learned embedding in the neural net performs better because it is specific to the model and prediction task.

  92. Avatar
    Ritesh January 22, 2019 at 4:54 pm #

    Hi Jason,

    Thanks for the reply .

    What if I set trainable = False . Then also it is needed to use Embedding layer when I have already vector representation of each word in sequence using word2vec or glove.


    • Avatar
      Jason Brownlee January 23, 2019 at 8:45 am #

      Yes, you can keep the embedding fixed during training if you like. Results are often not as good in my experience.

  93. Avatar
    hayj February 1, 2019 at 10:08 pm #

    Hello Jason, thank you for this very good tutorial. I have a question : I trained your model on the imbd sentiment analysis dataset. The model has 80% of accuracy. But I have very bad embeddings.

    First a short detail, I used one_hot which use a hashing trick, but I used the md5 because the default hash function is not consistent across runs, this is mentionned in the doc of Keras (so to save the model and predict new documents it is not good, am I right ?).

    But the important thing is that I have very bad embeddings, I created a dict which map lower words and embedding (of size 8). Following this : I didn’t use Glove vector for now.

    I tested to search most similar words and I got random words for “good” (holt, christmas, stodgy, artistic, szabo, mandatory…). I set the voc size to 100000. Of course due to the hashing trick, 2 words can have the same index so I don’t take into account similarities of 1.0. I think bad vectors embeddings are due to the fact we train embeddings on entire document and not context like word2vec, what do you think ?

    • Avatar
      Jason Brownlee February 2, 2019 at 6:17 am #

      Generally, the embeddings learned by a model work better than pre-fit embeddings, at least in my experience.

  94. Avatar
    Despoina February 7, 2019 at 6:47 pm #

    Hello, such a great tutorial. All your tutorials are very helpful! Thank you.

    I want to find embedings of three different Greek sentences (for classification). Then I want to merge per paired and to fit my model.
    I have read your tutorial ‘How to Use the Keras Functional API for Deep Learning’ which is very helpful for the merge.
    My question is: Is there any way to calculated before to use as an input to my model? I must have three different models to calculate the embedings?

    Thank you in Advance.

    • Avatar
      Jason Brownlee February 8, 2019 at 7:46 am #

      Yes, you could prepare the input manually if you wish: each word is mapped to an integer, and the integer will be an index in the embedding. From that you can retrieve the vector for each word and use that as input to a model.

      • Avatar
        Despoina February 8, 2019 at 8:56 pm #

        Thank you!!!

  95. Avatar
    Mohit February 10, 2019 at 12:19 am #

    Hi Jason,

    Thank you so much for this wonderful explanation. After reading many other resources I understand the embedding layers only after reading this. I have few questions and I’d really appreciate if you could take out the time and answer them.

    1) In your code, you used a Flatten layer after the Embedding layer and before the Dense layer. In few other places I noticed that a GlobalAveragePooling1D() layer is used in place of the Flattern. Can you explaining what Global Average Pooling does and why it’s used for Text Classification?

    2)You explained in of the comments that each word will have only one vector representation before and after the training. So just to confirm, when a word x is inputted to the embedding layer, the output always updates the same vector that represents x? For example, for vocab size 500 and embedding dimension 10, ([500, 10] output shape)if word x is the first vector([0,10]) in the output, every time the word x is inputted the first vector([0, 10]) will be updated and not if the word is not present?

    3) What’s the intuition behind choosing the size of the embedding dimension?

    Thank you again Jason. Will be waiting for your response.

    • Avatar
      Jason Brownlee February 10, 2019 at 9:43 am #

      Either approach can be used, both do the same thing. Perhaps try both and see what works best for your specific dataset.

      The vectors for a given word are learned. After the model is finished training, the vectors are fixed for each word. If a word is not present during training, the model will not be able to support it.

      Embedding must be large enough to capture the relationship between words. Often 100 to 300 is more than enough for a large vocab.

  96. Avatar
    vamsi February 15, 2019 at 4:27 pm #

    very userful post for beginners. But i have a doubt regarding size of embedding vector. As you mentioned in the post–
    “The Embedding has a vocabulary of 50 and an input length of 4. We will choose a small embedding space of 8 dimensions.
    The model is a simple binary classification model. Importantly, the output from the Embedding layer will be 4 vectors of 8 dimensions each, one for each word. ”

    I did not understand why the output from the embedding layer should be 4 vectors. If it is related to input length, please explain me how ?. also the phrase “one for each word” , i did not understand this as well

    • Avatar
      Jason Brownlee February 16, 2019 at 6:15 am #

      A good rule of thumb is to start with a length of 50, and try increasing to 100 or 200 and see if it impacts model performance.

  97. Avatar
    Ratha February 21, 2019 at 9:14 pm #

    Hi Jason, I’m facing touble while using bag of words as features in CNN. Do you have any idea to implement BoW based CNN?

    • Avatar
      Jason Brownlee February 22, 2019 at 6:18 am #

      It is not clear to me why/how you would pass a BoW encoded doc to a CNN – as all spatial information would be lost. Sounds like a bad idea.

  98. Avatar
    Alexa February 21, 2019 at 10:17 pm #

    Hi! Thanks for a good tutorial. I have a question regarding embeddings. How is the performance when using Embedding Layer for training embeddings VS using pre trained embedding? Is it faster and does the model require less time when training if using a pre trained embedding?

    Thanks for answer.

    • Avatar
      Jason Brownlee February 22, 2019 at 6:19 am #

      Embeddings trained as part of the model seem to perform better in my experience.

  99. Avatar
    Despoina February 24, 2019 at 8:16 pm #

    Hello, great post!

    I want to know more about vocabulary size in Keras Embedding Layer. I am working with Greek and Italian languages. Do you have any scientific paper to suggest?

    Thank you very much!

    • Avatar
      Jason Brownlee February 25, 2019 at 6:40 am #

      Perhaps test a few different vocab sizes for your dataset and see how it impacts model performance?

      • Avatar
        Despoina February 25, 2019 at 10:45 pm #

        Ok, thank you very much!

  100. Avatar
    Yoni Pick February 25, 2019 at 6:48 am #

    Hi Jason

    What needs to change when using the unsupervised model on pre-train embedding matrix?

    also if you want to use week supervised?

  101. Avatar
    ark March 5, 2019 at 9:58 pm #


  102. Avatar
    Hadij March 10, 2019 at 7:47 am #

    Hi Jason,
    For one-hot encoder, if we are given the new test set, how can we use one_hot function to have the same matrix space as we had for training set? Since we can not have a separate one_hot encoder for the test set.
    Thank you very much.

    • Avatar
      Jason Brownlee March 10, 2019 at 8:19 am #

      You can keep the encoder objects used to prepare the data and use them on new data.

  103. Avatar
    Miguel Won March 31, 2019 at 1:01 am #

    I ofter see implementations like yours, where the embedding layer is built from a word_index that contains all words build, e.g., with Keras preprocessing Tokenizer API. But if I fit my corpus with the Tokenizer and with vocabulary size limit (with num_words), why should I need an embedding layer of the size of the total number of unique words? Wouldn’t it be a waste of space being used? Is there any issue to build an embedding layer with a size suitable to my vocabulary size limit I need?

    • Avatar
      Jason Brownlee March 31, 2019 at 9:31 am #

      Not really, the space is small.

      It can be good motivation to ensure your vocab only contains words required to make accurate predictions.

  104. Avatar
    Leena Agrawal April 15, 2019 at 9:47 pm #

    Hi Mr Jason,

    Excellent tutorial indeed!

    I have a question regarding dimensions of embedding layer , could you please help:

    e = Embedding(200, 32, input_length=50)

    How do we decide to select size of out_dim which is 32 here? is there any specefic reason for this value?

    Thanks in advance:

    • Avatar
      Jason Brownlee April 16, 2019 at 6:48 am #

      You can use trial and error and evaluate how different sized output dimensions impact model performance on your specific dataset.

      50 or 100 is a good starting point if you are unsure.

  105. Avatar
    Mohammad April 16, 2019 at 3:28 pm #

    i trained my model using word embedding with glove, but kindly let me know how to prepare the test data for predicting results with trained weight. still i did not find any post which follow whole process. specially word embedding with glove

    • Avatar
      Jason Brownlee April 17, 2019 at 6:51 am #

      The word embedding cannot predict anything. I don’t understand, perhaps you can elaborate?

  106. Avatar
    Alex April 19, 2019 at 12:20 pm #

    Hello, Jason,

    thanks for the post. I have a question about the embedding data you actually fit. I print the padded_docs after the model compile. It seems to me that the printed matrix is not an embedding matrix. It’s a integer matrix. So I think what you fit in the CNN is not embedding but the integer matrix you define. Could you please help me explain it? Thanks a lot.

    • Avatar
      Jason Brownlee April 19, 2019 at 3:05 pm #

      Yes, the padded_docs is integers that are fed to the embedding layer that maps each integer to an 8-element vector.

      The values of these vectors are then defined by training the network.

  107. Avatar
    Oscar April 21, 2019 at 4:25 am #

    Hi Jason,

    I am working on character embedding. My dataset consists of raw HTTP traffic both normal and malicious. I have used the Tokenizer API to integer encode my data with each character having an index assigned to it.

    Please let me know if I understood this correctly:

    My data is integer encoded to values between 1-55, therefore my input_dim is 55.

    I will start with output-dim of 32 and modify this value a needed.

    Now for the input_length, I am a bit confused how to set this value.
    I have different lengths for the numerical strings in my dataset the longest is 666. Do I set input-length to 666? And if I do this what will happen to the sequences with shorter length?

    Thank you for your help!

    • Avatar
      Oscar April 21, 2019 at 6:39 am #

      Also, should I set the input dim to a value higher than 55?

    • Avatar
      Jason Brownlee April 21, 2019 at 8:27 am #

      Do you mean word embedding instead of char embedding?

      I don’t have any examples of embedding char’s – I’m not sure it would be effective.

      • Avatar
        Oscar April 21, 2019 at 5:34 pm #

        In meant character embedding. I used tokenizer and set the character level to True.
        I am not sure how to use word embedding for query strings of http traffic when they are not made of real words and just strings of characters.
        I am designing a character level neural network for detecting parameter injection in http requests.
        The result would be in a binary format 0 if request is normal and 1 if it’s malicious.
        So you don’t think character embedding is helpful here?

        • Avatar
          Jason Brownlee April 22, 2019 at 6:21 am #

          Sorry, I don’t have an example of a character embedding.

          Nevertheless, you should be able to provided strings of integer encoded chars to the embedding in your model. It will look much like an embedding for words, just lower carnality (<100 chars perhaps). Also I don't expect good results.

          What problem are you having exactly?

  108. Avatar
    Mario May 2, 2019 at 6:09 pm #

    I have found vocab_size = len(t.word_index) + 1 to be wrong. This index not only ignores the Tokenizer(num_words=X parameter, but also stores more words than are actually ever going to be encoded.

    I fit my text without word limit, and then encode the same text using the tokenizer, and the length of the word_index is larger than the max(max(encoded_texts)).

  109. Avatar
    lamesa May 17, 2019 at 6:46 pm #

    hello jason, how are you?
    I am doing my masters thesis on text summarization using word embeddings, and now i am in the middle of many questions, how could I use these features and which neural network alg is best. please could you give some guide….

  110. Avatar
    lamesa May 20, 2019 at 9:36 pm #

    thanks jason…it is really helpful..

  111. Avatar
    Siddhartha June 9, 2019 at 6:06 pm #

    Hi Jason,

    This entire article is very useful. It helped me in writing my initial implementation.
    I have one question related to Input() and Embedding() in Keras.
    If I already have a pretrained word embedding. In that case, should I use Embedding or Input ?

    • Avatar
      Jason Brownlee June 10, 2019 at 7:36 am #

      Yes, the embedding vectors are loaded into the Embedding layer and the layer may be marked as not trainable.

  112. Avatar
    alphabeta June 23, 2019 at 7:44 pm #

    what if in test we have a new words which were not there in the training text ?
    It will not proceed from the embedding layer correct?

  113. Avatar
    Zeinab June 29, 2019 at 10:53 pm #

    Hi, Jason
    I want to ask you about how can i save my learned word embedding?

  114. Avatar
    Zeinab July 2, 2019 at 4:03 am #

    Can I construct a network for learning the embedding matrix only?

  115. Avatar
    zeinab July 10, 2019 at 6:28 pm #

    I have a text similarity application where I measure Pearson correlation coefficient as keras metrics.
    In many epochs, I noticed that the correlation value is nan.
    Is this is normal or there is a problem in the model?

    • Avatar
      Jason Brownlee July 11, 2019 at 9:46 am #

      You may have to debug the model to see what the cause of the NAN is.

      Perhaps an exploding gradient or vanishing gradient?

      • Avatar
        Zeinab July 12, 2019 at 2:55 pm #

        Do you mean that I have to adjust the activation function?

        I use elu activation function and Adam optimization function,

        Do you mean that I have to change any of them and see the results?

        • Avatar
          Jason Brownlee July 13, 2019 at 6:52 am #


          Try relu.
          Try batch norm.
          Try smaller learning rate.

      • Avatar
        Zeinab July 13, 2019 at 10:19 pm #

        Can I know what do you mean by debuging the model?

        • Avatar
          Jason Brownlee July 14, 2019 at 8:10 am #

          Yes, here are some ideas:

          – Consider aggressively cutting the code back to the minimum required. This will help you isolate the problem and focus on it.
          – Consider cutting the problem back to just one or a few simple examples.
          – Consider finding other similar code examples that do work and slowly modify them to meet your needs. This might expose your misstep.
          – Consider posting your question and code to StackOverflow.

  116. Avatar
    Leon July 20, 2019 at 12:19 pm #

    vocab_size = len(t.word_index) + 1

    why do we need to increase the vocabulary size by 1???

    • Avatar
      Jason Brownlee July 21, 2019 at 6:23 am #

      This is so that we can reserve 0 for unknown words and start known words at 1.

  117. Avatar
    Zineb_Morocco July 26, 2019 at 5:48 am #

    Thanks a lot for this wonderful work .. we’re in July 2019 and still taking advange of it.

    • Avatar
      Jason Brownlee July 26, 2019 at 8:33 am #

      Thanks. I try to write evergreen tutorials that remain useful.

  118. Avatar
    Christiane July 29, 2019 at 5:24 am #

    Dear Jason,

    thanks a lot for your detailed explanations and the code examples.

    What I am still wondering about, however, is how I would combine embeddings with other variables, i.e. having a few numeric or categorical variables and then one or two text variables. As I see you input the padded docs (only text) in But how would I add the other variables? It doesn’t seem realistic to always only have one text variable as input.

  119. Avatar
    Zineb_Morocco July 30, 2019 at 11:43 pm #

    Thank you Jason for this wonderful work and examples. That really help.

  120. Avatar
    Youssef MELLAH August 4, 2019 at 6:19 am #

    Thank u Jason Brownlee, that’s very interesting and clear.

    And if i have two inputs?

    for example i am working on Text-to-SQL task and that necessit 2 inputs : user question and table schema (columns names).

    how can i process? how can i do embeddig? with 2 embeddings layers?

    Thank u for help.

  121. Avatar
    Youssef MELLAH August 4, 2019 at 7:30 am #

    Ah okay, that’s interesting too, thanks!!

    Can you please confirm me the architecture above to encode both user Questions and table Schema in the same model?

    (1) userQuestion ==> One-hote-endoding ==> Embedding(GloVe) ==> LSTM
    (2) tableShema ==> One-hote-endoding ==> Embedding(GloVe) ==> LSTM

    (1) concatenate (merge) (2) ==> rest of model layers….

    thakns Jason.

    • Avatar
      Jason Brownlee August 5, 2019 at 6:42 am #

      No need for the one hot encoding, the text can be encoded as integers, and the integers mapped to vectors via the embedding.

      • Avatar
        Youssef Mellah August 5, 2019 at 7:58 am #

        ok that’s clear, thanks.

        The attention mechanism can be done only with the merge?

        • Avatar
          Jason Brownlee August 5, 2019 at 1:59 pm #

          No, you can use attention on each input if you wish.

          • Avatar
            Youssef MELLAH August 5, 2019 at 6:42 pm #

            Should i do attention on both inputs (user Question & table Schema) separately or can i do it after merging the 2 inputs?

          • Avatar
            Jason Brownlee August 6, 2019 at 6:31 am #

            Test many different model types and see what works well for your specific dataset.

          • Avatar
            Youssef MELLAH August 15, 2019 at 12:41 am #

            okay thank you Jason!!

          • Avatar
            Jason Brownlee August 15, 2019 at 8:12 am #

            No problem.

  122. Avatar
    Elizabeth August 5, 2019 at 12:06 am #

    I wonder about pre-trained word2vec, there is no such good tutorial for that. I am looking to implement pre-trained word2vec but I do not know should I follow the same steps of Glove or look for another source for that?
    thanks MR.Jason I am very inspired by you in machine learning

  123. Avatar
    Dean August 7, 2019 at 1:06 am #

    Why dont you use mask_zero=True ion your embedding layers? It seems necessary since you are padding sentences with 0’s.

    • Avatar
      Jason Brownlee August 7, 2019 at 8:01 am #

      Great suggestion, not sure if that argument existed back when I wrote these tutorials. I was using masking input layers instead.

  124. Avatar
    Nathalia August 9, 2019 at 8:04 am #

    hi, can you help me with a question?

    Im working with a dataset that has a city column as a feature and thats has a lot of different cities. So, I create a embeddinglayer for this feature. First, I used this command :
    data[‘city’]= data[‘city’].astype(‘category’)
    data[‘city’]= data[‘city’]
    After that, for each different city a value was assigned starting at 0

    So, Im confused about how this embedded layer works when the test data has a input that was not training. I saw that you said that when this occurs, we had to put 0 as input, but 0
    it’s related with some city. Should i start assigning this values ​​to the city from 1?

    • Avatar
      Jason Brownlee August 9, 2019 at 8:20 am #

      Excellent question.

      Typically, unseen categories are assigned 0 for “unknown”.

      0 should be reserved and real numbering should start at 1.

      • Avatar
        Nathalia August 9, 2019 at 8:25 am #

        thank you, you always help me a lot!

  125. Avatar
    ravi August 20, 2019 at 1:38 pm #

    Hi Jason..

    Thanks for such a great tutorial. I am confused on when we talk about learned word embeddings , do we consider weighs of the embedding layer or output of embedding layer.

    let me ask in other way as well, when we utilize pretrained embedding let us say “glove.6B.50d.txt” . those word embeddings are weights or the output of the layer?

    • Avatar
      Jason Brownlee August 20, 2019 at 2:14 pm #

      They are the same thing. Integers are mapped to vectors, those vectors are the output of the embedding.

  126. Avatar
    Ralph August 23, 2019 at 7:39 pm #

    Hi Jason,

    I am new to ML, trying out different things, and your posts are the most helpful I encountered, it helps me a lot to understand, thank you!

    Here I think I understood the procedure, but I still have a deeper question, on the point of embeddings. If I understand correctly, this embedding kind of maps a set of words as points onto another dimensionnal space. The surprising fact in your example is that we pass from a space of dimension 4 to a space of dimension 8, so it might not be seen as an improvement at first.

    Still I imagine that the embedding makes it so that points in the new space are more equally placed, am I right? Then I don’t understand several things:
    -How does the context where one words appear come into play? Other words which are often close by will also be represented by closer points in the new space?
    -Why does it have to be integers? And why is it more applied to word encodings? I mean we could imagine the same process could be helpful for images as well. Or is it just a dimension reduction technique tailored for words documents?

    Thank you for your insights anyway

    • Avatar
      Jason Brownlee August 24, 2019 at 7:50 am #

      Not equally spaced, but spaced in away that preserves or best captures their relationships.

      Context defines the relationships captured in the embedding, e.g. what works appear with what other words. Their closeness.

      Each word gets one vector. The simplest way is to map words to integers and integers to the index of vectors in a matrix. No other reason.

      Great questions!

  127. Avatar
    Anand August 24, 2019 at 4:43 pm #

    Jason,Thank you so much for your time and effort!

    My question is related to the line-
    “e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)”

    Here you are using-weights=[embedding_matrix], but there is no relation telling for which word which vector. Then, it produces for each document one 4*100 matrix(example for [ 6 2 0 0]).How it will extract the vectors related to 6,2,0,0 accurately?

    • Avatar
      Jason Brownlee August 25, 2019 at 6:34 am #

      The vectors are ordered, with an array index you can retrieve the vector for each word, e.g. word 0, word 1, word 2, work 3, etc.

  128. Avatar
    Youssef MELLAH September 2, 2019 at 12:36 am #

    Hello Jason,

    I m searching for interesting formation on Python (numpy pandas … and tools for ML DL and NLP) and formation on Keras !!

    Some suggestions please?

  129. Avatar
    Emre Calisir September 3, 2019 at 6:32 pm #

    Thanks for article, I will run it for the Italian language documents. Is there any GoogleNews pretrained word2vec covering Italian vocabulary?

    • Avatar
      Jason Brownlee September 4, 2019 at 5:56 am #

      Good question, I’m not sure off the cuff sorry.

  130. Avatar
    Muhammad Usgan September 7, 2019 at 5:59 pm #

    Hallo jason, Can I put Dropout into this model ?

  131. Avatar
    ASHWARYA ASHIYA September 19, 2019 at 6:49 pm #

    e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)

    What does the parameter – weights=[embedding_matrix] – stand for ? weights or inputs for the Embedding Layer ?

    • Avatar
      Jason Brownlee September 20, 2019 at 5:37 am #

      The weights are the vectors, e.g. one vector for each word in your vocab.

  132. Avatar
    Munisha October 3, 2019 at 7:30 pm #

    Layer (type) Output Shape Param #
    embedding_1 (Embedding) (None, 4, 8) 400
    flatten_1 (Flatten) (None, 32) 0
    dense_1 (Dense) (None, 1) 33
    Total params: 433
    Trainable params: 433
    Non-trainable params: 0

    Could you please clarify no. of weights learnt for embedding layer. Now we have 10 documents and embedding layer from 4 to 8. How many weights parameters will actually be learnt here. My understanding was that there should only be (4+1)*8 = 40 weights to be learnt, including the bias term. Why is it learning weights for all the documents separately (10*(4+1)*8 = 400) ?

    • Avatar
      Jason Brownlee October 4, 2019 at 5:40 am #

      The number of weights in an embedding is vector length times number of vectors (words).

  133. Avatar
    martin October 12, 2019 at 4:29 pm #

    Hi, Jason:

    In this example, what type of neural architecture it is? It is not LSTM, not CNN. Is it a Multi-Layer Perceptron model?

    • Avatar
      Jason Brownlee October 13, 2019 at 8:26 am #

      We are just working with embeddings. I guess with an output layer, you can call it an MLP.

  134. Avatar
    Zafari October 13, 2019 at 5:52 am #

    Hi, Thanks for this excellent article. I tried to use a pre-trained word embedding instead of random number in a Keras-based classifier. However, after constructing the embedding matrix and adding it to the embedding layer as follow, during training epochs all of the accuracy values are the same and no learning happens. However, after removing the “weights=[embedding_matrix]” it works well and reached the accuracy of 90%.

    layers.Embedding(input_dim=vocab_size, weights=[embedding_matrix],

    What can be the reason of this strange behavior?

    • Avatar
      Jason Brownlee October 13, 2019 at 8:35 am #

      An embedding specalized on your own data is often better than something generic.

      • Avatar
        Hussain July 28, 2020 at 8:46 am #

        Quick question. While using pre-trained embeddings (MUSE) in an Embeddings layer, is it okay to set trainable=True?

        Note: The model doesn’t overfit when i set trainable=True. The model doesn’t predict well if i set trainable=False.

        • Avatar
          Jason Brownlee July 28, 2020 at 8:52 am #

          Yes, although you might want to use a small learning rate to ensure you don’t wash away the weights.

          • Avatar
            Hussain July 28, 2020 at 9:24 am #

            Thank you very much for your reply. Currently I am using a learning rate of 0.0001.


          • Avatar
            Jason Brownlee July 28, 2020 at 10:54 am #

            Perhaps use SGD instead of Adam as Adam will change the learning rate for each model parameter and could get quite large.

            Or at least compare results with adam vs sgd.

  135. Avatar
    Xuesong Wang October 13, 2019 at 8:32 pm #

    Hi Jason,
    Thank you for your post. I have an issue. The data I used include both categorical and numerical features. Say, Some features are cost and time while others are post code. How should I write the code? Build separate models and concatenate them together? Thank you

    • Avatar
      Jason Brownlee October 14, 2019 at 8:07 am #

      Great question!

      The features must be prepared separately then aggregated.

      The two ways I like to do this is:

      1. Manually. Prepare each feature type separately then concate into input vectors.
      2. Multi-Input Model. Prepare features separately and feed different features into different inputs of a model, and let the model concate the features.

      Does that help?

      • Avatar
        martin October 15, 2019 at 4:39 pm #

        Should the categorical be converted into numerical using one-hot encoding?

      • Avatar
        Franz Götz-Hahn October 28, 2019 at 9:04 pm #

        If you have multivariate time series of which you know they are meaningfully connected (say trajectories in x and y), does it make sense to put a Conv layer before feeding them into the embedding?

        Could you explain what you mean by “preparing” the features?

        • Avatar
          Jason Brownlee October 29, 2019 at 5:22 am #

          No embedding is typically the first step, e.g. an interpretation of input.

          Prepared means transformed in whatever way you want, such as encoding or scaling.

          • Avatar
            Franz Götz-Hahn October 29, 2019 at 5:03 pm #

            Thanks for the answer! If I may ask a follow up: is embedding of multivariate numerical data uncommon? I have seen fairly little work that uses it.

          • Avatar
            Jason Brownlee October 30, 2019 at 5:57 am #

            Embedding is used for categorical data, not numerical data.

  136. Avatar
    Zineb_Morocco October 16, 2019 at 1:28 am #

    hi Jason,

    I use an example where a put the vocabulary size = 200 and the training sample contain about 20 different words.
    When I check the embeddings ( the vectors) using ** layers[0].get_weights()[0]** I obtain an array with 200 rows.

    1/ how can I know the vector corresponding to each word (from the 20 words I ‘ve got)?
    2/ where the 180 (200 – 20) vectors come from since I use only 20 words?

    Thanks in advance.

    • Avatar
      Jason Brownlee October 16, 2019 at 8:08 am #

      The vocab size and number of words are the same thing.

      I think you might be confusing the size of the embedding and the vocab?

      Each word is assigned a number (0, 1, 2, etc), the index of vectors maps to the words, vector 0 is word 0, etc.

  137. Avatar
    Zineb_Morocco October 16, 2019 at 11:38 pm #

    Thanks for your answer Jason,
    I ‘ll clarify my question:
    the vocab size is 200 that means that the number of words is 200.
    But effectively i’m working with 20 words only ( the words of my training sample) : let say word[0] to word[19].
    So, after the embedding, the vector[0] corresponds to word[0] and so on. but vector[20].. vector [30] … what do they match ?
    I have no word[20] or word[30] .

    • Avatar
      Jason Brownlee October 17, 2019 at 6:37 am #

      If you define the vocab with 200 words but only have 20 in the training set, the the words not in the training set will have random vectors.

      • Avatar
        Zineb_Morocco October 18, 2019 at 12:22 am #

        Ok. Thank you.

  138. Avatar
    Elizabeth October 27, 2019 at 7:28 am #

    I want to save my own pretrained model in the same way Golve saved their model as txt file and the word followed by its vector? How I would do that?
    thank you

    • Avatar
      Jason Brownlee October 28, 2019 at 6:00 am #

      You could extract the weights from the embedding layer via layer.get_weights() then enumerate the vectors and save to a file int he format you prefer.

  139. Avatar
    Elizabeth October 28, 2019 at 7:47 am #

    beginner in python I did not understand what you mean by enumerating..and which layer should I get weight from?…

    • Avatar
      Jason Brownlee October 28, 2019 at 1:16 pm #

      You can get the vectors from the embedding layer.

      You can either hold a reference to the embedding layer from when you constructed the model, or retrieve the layer by index (e.g. model.get_layers()[0]) or by name, if you name it.

      Enumerating means looping.

  140. Avatar
    Michael November 19, 2019 at 4:47 am #

    Hello, Jason!

    Thanks for the article!
    I have been wondering about the input_dim of the learnable embedding layer.
    You set it to vocab_size, that in your case is 50 (the hashing trick upper limit), which is much larger than the actual vocabulary size of 15.

    The documentation of Embedding in keras says:
    “Size of the vocabulary, i.e. maximum integer index + 1.”
    Which is ambiguous.

    I have experimented with some numbers for vocab_size, and cannot see any systematic difference.

    Would it actually matter for more realistically sized examples?

    Could you say a couple of words about it?
    Thanks again

    • Avatar
      Jason Brownlee November 19, 2019 at 7:50 am #

      Smaller vocabs means you will have fewer words/word vectors and in turn a simpler model which is faster/easier to learn. The cost is it might perform worse.

      This is the trade-off large/slow but good, small/fast but less good.

      • Avatar
        Michael November 19, 2019 at 7:48 pm #

        Thanks, Jason!

        I may have not explained myself properly:
        The *actual* number of words in the vocabulary is the same (14).
        The difference is the value of input_dim to Embedding().

        In the example, you chose 50 as high enough to prevent collisions in encoding, but also
        used it as an input_dim in one of the cases.


        • Avatar
          Jason Brownlee November 20, 2019 at 6:11 am #

          I see.

          • Avatar
            martin November 22, 2019 at 6:28 pm #

            I thought the question is “Size of the vocabulary, i.e. maximum integer index + 1.”. Since there are 14 words in this example, why vocab size isn’t 15, instead of 50?

          • Avatar
            Jason Brownlee November 23, 2019 at 6:49 am #

            There is the size of the vocab, there is also the size of the embedding space. They are different – in case that us causing confusion.

            We must have size(vocab) + 1 vectors in the embedding, to have space for “unkown”, e.g. vector at index 0.

  141. Avatar
    martin November 22, 2019 at 5:52 pm #

    Jason: In this example, ‘one_hot’ function instead of ‘to_categorical’ function is used. The 2nd is the real one-hot representation, and the 1st is simply creating an integer for each word. Why isn’t to_categorical used here? They are different, right?

  142. Avatar
    moSaber November 24, 2019 at 11:11 am #

    Thanks a lot Jason! in 3. Example of Learning an Embedding section, could you please elaborate what is 400 params that are being trained in the embedding layer? Thnx

    • Avatar
      Jason Brownlee November 25, 2019 at 6:19 am #

      Yes, each vector is mapped to an 8 element vector, and the vocab is 50 words. Therefore 50*8 = 400.

      • Avatar
        Mohit September 26, 2020 at 6:54 pm #

        Jason, why the output shape of embedding layer is: (4,8)?
        It should be (50,8) as the vocab size is 50 and we are creating the embeddings of all words in our vocabulary.

        • Avatar
          Jason Brownlee September 27, 2020 at 6:51 am #

          vocab size is the total vectors in the layer – the number of words supported, not the output.

          The output is the number of input words (8) where each word has the same vector length (4).

          • Avatar
            Sean February 11, 2023 at 8:57 pm #

            is it typo? The output is the number of input words (4) where each word has the same vector length (8).

          • Avatar
            James Carmichael February 12, 2023 at 9:26 am #

            Hi Sean…We do not believe it is a typo. What occurs when you execute the code?

  143. Avatar
    criz December 30, 2019 at 3:13 am #

    Hi i need some help when running the file.

    (wordembedding) C:\Users\Criz Lee\Desktop\Python Projects\wordembedding>
    Traceback (most recent call last):
    File “C:\Users\Criz Lee\Desktop\Python Projects\wordembedding\”, line 1, in
    from numpy import array
    File “C:\Users\Criz Lee\Anaconda3\lib\site-packages\numpy\”, line 140, in
    from . import _distributor_init
    File “C:\Users\Criz Lee\Anaconda3\lib\site-packages\numpy\”, line 34, in
    from . import _mklinit
    ImportError: DLL load failed: The specified module could not be found.

    pls advise.. thanks

    • Avatar
      Jason Brownlee December 30, 2019 at 6:02 am #

      Looks like there is a problem with your development environment.

      This tutorial may help:

      • Avatar
        criz December 31, 2019 at 3:20 am #

        Hi jason, i’ve tried the url u provided but still didnt manage to solve it.

        basically i typed
        1. conda create -n wordembedding
        2. activate wordembedding
        3. pip install numpy (installed ver 1.16)
        4. ran

        error shows
        File “C:\Users\Criz Lee\Desktop\Python Projects\wordembedding\”, line 2, in
        from numpy import array
        ModuleNotFoundError: No module named ‘numpy’

        pls advise.. thanks

        • Avatar
          Jason Brownlee December 31, 2019 at 7:35 am #

          Sorry to hear that. I am not an expert in debugging workstations, perhaps try posting to stackoverflow.

  144. Avatar
    martin December 30, 2019 at 8:23 am #

    Hi, Jason: How to encode a new document using the Tokenizer object fit on training data? It seems there is no function to return an encoder from the tokenizer object.

    • Avatar
      Jason Brownlee December 31, 2019 at 7:24 am #

      You can save the tokenizer and use it to prepare new data in an identical manner as you did the training data after the tokenizer was fit.

  145. Avatar
    congmin min December 31, 2019 at 11:59 am #

    What do you mean by ‘save the tokenizer’? Tokenizer is an object, not a model.

    • Avatar
      Jason Brownlee January 1, 2020 at 6:30 am #

      It is as important as the model, in that sense it is part of the model.

      You can save Python objects to file using pickle.

  146. Avatar
    Sintayehu January 3, 2020 at 5:08 pm #

    Brief post… and am interested to read about pre-trained word embedding for sequence labeling task.

  147. Avatar
    Rishang January 25, 2020 at 4:56 pm #

    Hello Sir,

    I am not able to understand the significance of vector space?
    You have given 8 for the first problem, glove vectors has 100 dimension for each word.
    What is the idea behind these vector spaces and what does each value of the dimension tells us?

    Thankyou 🙂

    • Avatar
      Jason Brownlee January 26, 2020 at 5:15 am #

      The size of the vector space does not matter too much.

      More importantly, the model learns a representation where similar works will have a similar representatioN (coordinate) in the vector space. We don’t have to specify these relationships, they are learned automatically.

      • Avatar
        Rishang February 3, 2020 at 1:07 am #

        Thankyou Sir for your answer. I clearly understood what is vector space.
        I have one more question- If I declare the vocabulary size as 50 and if there are more than 50 words in my training data, what happens to those extra words?

        For the same reason I could not understand this line of glove vectors-
        “The smallest package of embeddings is 822Mb, called ““. It was trained on a dataset of one billion tokens (words) with a vocabulary of 400 thousand words.”
        What about the 600 thousand words ?

        • Avatar
          Jason Brownlee February 3, 2020 at 5:46 am #

          Words not in the vocab are marked as 0 or unknown.

  148. Avatar
    asma January 31, 2020 at 4:22 pm #

    I have a list of words as my dataset/training data. So i run your code for glove as follows:

    —-Error is——
    ValueError Traceback (most recent call last)
    in ()
    8 values = line.split()
    9 words = values[0]
    —> 10 coefs = asarray(values[1:], dtype=’float32′)
    11 embeddings_index[words] = coefs
    12 f.close()

    /usr/local/lib/python3.6/dist-packages/numpy/core/ in asarray(a, dtype, order)
    84 “””
    —> 85 return array(a, dtype, copy=False, order=order)

    ValueError: could not convert string to float: ‘0.1076.097748’

    could you please help

  149. Avatar
    Sanpreet Singh February 5, 2020 at 5:46 pm #

    Hello Jason Brownlee

    I hope you are fine and thanks for writing such an article. I would like to ask an question to you where I am struck.

    I had created keras deep learning model with glove word embedding on imdb movie set. I had used 50 dimension but during validation there is lot of gap between accuracy of training and validation data. I have done pre-processing very carefully using custom approach but still I am getting overfitting.

    During prediction phase instead of testing dataset, I have taken real time text and goes for prediction but if I do prediction again and again for same input data, results varies. I am trying my best why my results varies.

    I have 8 classes which depicts the score of each review and data is categorical. Softmax layer is predicting different result every time same data in fed to trained model

  150. Avatar
    Saad Farooq February 7, 2020 at 2:40 am #

    I have a dataset in which training data are test data in different files. How do I pass test data to the model for evaluation?

    • Avatar
      Jason Brownlee February 7, 2020 at 8:22 am #

      Load them into memory separately then use one for train and one for test/val.

      Perhaps I don’t understand the problem?

  151. Avatar
    sachin February 24, 2020 at 10:58 pm #

    How to give text input to a pre-trained LSTM model with input layer shape (200, ) .
    I want to know how convert the text input and give it as an input to the model.

  152. Avatar
    Nati February 26, 2020 at 10:07 pm #

    Thanks you so much for this detailed tutorial. I think that I missed something… When using a Pre-Trained GloVe Embedding, in case you want to do a train test split, when is the right time to perform it?
    Should I need to do it after creating the weight matrix?

    • Avatar
      Jason Brownlee February 27, 2020 at 5:47 am #

      Sorry, I don’t follow your question – how are the embedding and train/test split related? Can you please elaborate?

      • Avatar
        Nati February 27, 2020 at 9:27 pm #

        Sure. I would like to use the embedding to create a model that will predict a class out of 30 different classes. The dataframe that I’m using contains a text column which I want to use to predict the class. In order to do so, I thought about using the pre-trained embedding (same as you did but with more classes). Now, in order to test the model I want to do a split, so my question is when do you recommend to do it? I tried to create a weight matrix for the all data and then split it to train and test but it gave me a very poor results on the test set.

        Any idea what am I doing wrong?

        • Avatar
          Jason Brownlee February 28, 2020 at 6:07 am #

          If you train your own embedding, it is prepared on the training dataset.

          If you use a pre-trained embedding, not split make sense as it is already trained on different data.

          • Avatar
            Nati February 28, 2020 at 10:23 am #

            Thanks I think that I understand (sorry I’m kind of newbie) . And assuming that I want to use the trained model to predict the labels on an unseen data, what should be the input? Should it be a padded docs with the same shape?

          • Avatar
            Jason Brownlee February 28, 2020 at 1:27 pm #

            New input must be prepared in an identical way to the training data.

  153. Avatar
    Despoina M March 19, 2020 at 9:01 am #

    Hello, another great post!

    I would like to ask, regarding with the Example of Using Pre-Trained Embedding, is it possible to reduce the vocabulary size? I had pre-trained embedding with 300 dimension. My vocabulary size is 7000.
    When I put vocabulary size=300 I have this error:

    embedding_matrix[i] = embedding_vector

    IndexError: index 300 is out of bounds for axis 0 with size 300

    Thanks in advance

    • Avatar
      Jason Brownlee March 19, 2020 at 10:14 am #

      Yes, but you must set the vocab size in the Tokenizer.

      • Avatar
        Despoina M March 20, 2020 at 1:59 am #

        Thank you for the quick response. I did it.

  154. Avatar
    Tyler March 20, 2020 at 6:57 am #

    Thanks for the tutorials! I’m using the 300d embedding for my image to caption model, but filter out rare words in my vocabulary. Say I have 10k words, should I just:

    tokenizer.word_index.get(word, vocab_size + 1) <= vocab_size

    to filter out the words I don't want?

    Also, do you think it's worth retraining the embedding weights at a later point in training to fine-tune? I'm thinking of it in the same context as freezing a pre-trained encoder, then gradually unfreezing layers as the decoder reaches a certain level of validity.

  155. Avatar
    Aman Savaria March 22, 2020 at 6:23 am #

    Hi Jason,
    I have a question regarding the code. You have taken the vocab_size to be 1 greater than the number of unique words. I am unable to understand why did you do that, can you please tell me.
    I am a total newbie and I’m not from a programming background so I’m sorry if this was a silly question.
    Thank You

    • Avatar
      Jason Brownlee March 22, 2020 at 6:59 am #

      Yes, we use a 1-offset for words and save index 0 for “unknown” words.

  156. Avatar
    Aman Savaria March 24, 2020 at 6:41 am #

    Oh…Thanks jason

  157. Avatar
    Ahmed B. March 27, 2020 at 2:43 am #

    Hi Jason,
    I work on a multivariate time series with categoricals and numericals features. and I use a data of 3 years : 20 stocks during 30 days as window of training and 20 stocks during 7 days as target X_train.shape = (N_samples, 20*30, N_features), y_train.shape = (N_samples, 20*7, N_features), my question is, how I can apply an embedding layer for 3 categorical variables for this 3D arrays ?
    I tried to use this part of code but it doesn’t work :

    cat1_input = Input(shape=(1,), name=’cat1′)
    cat2_input = Input(shape=(1,), name=’cat2′)
    cat3_input = Input(shape=(1,), name=’cat3′)

    cat1_emb = Flatten()(Embedding(33, 1)(cat1_input))
    cat2_emb = Flatten()(Embedding(18, 1)(cat2_input))
    cat3_emb = Flatten()(Embedding( 2, 1)(cat3_input))

  158. Avatar
    Ahmed B. March 27, 2020 at 7:24 am #

    Thank you for your prompt reply Jason, this tutorial is about Embedding for 2D array but in my case I need to build a model that take 3D array (N_samples, time_steps, N_features) as input and (time_steps,N_stocks) as output.

    • Avatar
      Jason Brownlee March 27, 2020 at 8:04 am #

      Perhaps you can adapt the example for your needs.

  159. Avatar
    Despina M April 12, 2020 at 1:50 am #

    Hello again,

    I am wondering this time, if it’s worthwhile to compare two embedding matrices from two different languages. I want to find the similarity about two matrices from different languages.

    Thank you very much

  160. Avatar
    sangam April 16, 2020 at 12:07 pm #

    Hello, Jason, This post really helped me, I think the parameters have changed,
    the weights parameter is now ” Embedding_initializer ” in keras documentation.

  161. Avatar
    Charles April 20, 2020 at 6:08 am #

    Hi Jason, I am facing a memory issue since my vocab_size is around 400,000. I have around 195847 training examples and each sentence output has maxLen of 5 words.

    So when I try to create a one_hot for the decoder_outputs because this tries to create a 195847, 400000, 5 matrix it obviously runs out of memory.

    How do i get around this problem?

    • Avatar
      Jason Brownlee April 20, 2020 at 7:35 am #

      Good question.

      Try reducing the size of the vocab.
      Try reducing the size of the dataset.
      Try reducing the size of the model.
      Try an alternate representation, such as hashing.
      Try running on an ec2 instance with more RAM.

      • Avatar
        Charles April 22, 2020 at 10:56 pm #

        An additional point could be to run it in batch using fit_generator of keras.

        There is one more query that I have and i know it may sound silly but:

        Why do you need the inference model or decoding model for testing? Then how does the training model help? The only thing that I can see is that it helps in deciding the encoder inputs. Is there something I am missing. It seems to me that the entire purpose of training is lost.

        • Avatar
          Jason Brownlee April 23, 2020 at 6:06 am #

          I don’t understand. You train a model and then use it for inference. Without training it is useless. Without inference, you trained for no reason.

          Perhaps you can elaborate?

  162. Avatar
    Mohamed April 21, 2020 at 11:02 am #


    I would like to ask if I have label not only positive and negative but also neutral how I can add that to y_training?

  163. Avatar
    Charlène April 25, 2020 at 5:38 am #


    Thank you for this very well explained article.

    I don’t understand how the embeddings are learnt in the Embedding layer. Is this common backpropagation ?

  164. Avatar
    Rahim May 5, 2020 at 1:28 am #

    Dear Jason
    Thanks for your wonderful and helpful guides. When I run your suggested code in part 4 of this page (Example of Using Pre-Trained GloVe Embedding), I get the following error, while I have downloaded glove.6B.100d file from the URL you provided above. Can you please help me find the cause of error?

    Traceback (most recent call last):
    File “C:/Users/Dehkhargani/PycharmProjects/test2/”, line 40, in
    coefs = asarray(values[1:], dtype=’float32′)
    File “C:\Users\Dehkhargani\Anaconda\lib\site-packages\numpy\core\”, line 85, in asarray
    return array(a, dtype, copy=False, order=order)
    ValueError: could not convert string to float: ‘ng’

  165. Avatar
    Mohammad May 17, 2020 at 9:50 pm #

    Hi Jason, thanks for your helpful post.

    Do pre-trained word embedding like Glove or Google news still work fine on non-English texts?


  166. Avatar
    Roshan Nayak May 18, 2020 at 3:36 am #

    Could anyone please tell me why are we actually adding 1 to the vocab size.
    vocab_size = len(t.word_index) + 1
    Please let me know. Thank you!!

    • Avatar
      Jason Brownlee May 18, 2020 at 6:21 am #

      To make space for “0” which is “unknown” words.

  167. Avatar
    Tanay Gupta May 19, 2020 at 5:00 pm #

    “The model is a simple binary classification model. Importantly, the output from the Embedding layer will be 4 vectors of 8 dimensions each, one for each word.”

    How did you know output will be 4 vectors ?

    • Avatar
      Jason Brownlee May 20, 2020 at 6:20 am #

      We can inspect the shape of the model by calling model.summary()

  168. Avatar
    JG June 22, 2020 at 9:46 pm #

    Hi Jason:

    Great tutorial !. It is my first time I approach NLP and particularly “text classification/sentiment analysis”. Then I am little confused about tech definitions, procedures, meaning of them, etc.

    Let me summarize them for everybody, using my own ideas, in order to share and expect corrections/comments from you.

    1º) From a Dataset, constituted of a list of texts docs, that we are going to train for a texts classification problem (e.g.), we read them and get a words list of each doc (using e.g. keras API “text_to_word_sequence()”.

    2º) We can then apply some sort fo word cleaning/cured process, using string object methods to get only alphabetical words, or suppressing words punctuations, or eliminating minimum words lengths or word repetitions, etc.
    Using a keras “Tokenizer” object (e.g.) , we can define a Vocabulary (or the minimum common number of different words that are constituted the whole dataset docs), besides a words counter, through the tokenizer method “fit_on_texts (apply to all docs words).

    3º) Now that we have all words (cleanups) for all docs, we must perform the most important issue in MachineLearning. That is, to convert words (tokens) to be trained by a ML model into Numbers. How? by Coding them !.
    And this numbers must be arranged in Tensors (or arrays), consisting on samples (rows of tensor), with columns (different features per doc or numbers words or number of vocabulary word ) and even a third dimensions of tensor (e.g. deep axis?), by embedding them into a word vector space (of an arbitrary dimension).

    To perform this encoded work (number conversion) we can use keras tokenizer methods such as “texts_to_matrix()”, “texts_to_sequences()”or others keras methods such as “one_hot() “…helping always with keras “pad_sequences()” method to put the same length for all words(features) for any docs, using e.g. max length of words per doc or simply because we want to use the full vocabulary size representation as our feature length for any doc.

    As a result of all this process we must have a clean Dataset input, X tensor of numbers, with 2D dimensions (docs, features) or 3D dimensions ( docs, features, an extra embedding dimension) if we use techniques of Deep Learning (e.g. embedding).

    4) Also we can take advantage of a sort of “transfer learning”, when we apply embedding DL layers, if we load previous trained weights (trained elsewhere) of embedding layers such as GloVe dictionary (e.g. consisting of 400,000 different words with their own 100 coordinates weights) that we must adapt to our vocabulary used in our own docs (e.g. our own arbitrary selected 100 different words, and 100 coordinates as a vector word space for each of them).

    5º) So now that we have converted a list of docs (of a list of words) into a single unique Tensor of Numbers X, besides the Y labels associated, we must define the Model with an input layer, besides embedding(), Conv1D(), and MaxPool1D() layers (e.g.) if we are using DL techniques, plus a final flatten(), in order to present the feature word extraction to the head fully connected layers ( comprising mainly on Dense() layers, that are going finally to learn the number pattern extraction (representing the set of words of each doc) associated to each document label.

    This is my summed up understanding of NLP text classification issues . thanks

    • Avatar
      Jason Brownlee June 23, 2020 at 6:21 am #

      Not a bad summary, generally accurate from my skim.

  169. Avatar
    JG June 23, 2020 at 3:21 am #

    Sorry for the extensive comment Jason!.

    -I would like to share also my code results, based on your great tutorial. Because I like the way you teach us…learning not theoretically but doing and experiment with final and operative codes…! thanks one more time!.

    – My first main concern it is that even we have string methods, “tokenizer” objects (class), etc… still we have to write down many code lines… I can figure out , that a more high level APIs text processing and coding, may already exist, in order to give them appropriate arguments parameters such as (docs to be read, filters of ways you can clean ups your docs words, types of coding you want to apply to your words, including if you want to use pre-trained weights) order to get , the main output dataset preparation in terms of a “X” input tensor …instead of going to this tedious code lines!

    – Anyway, I went through it I get the following results:

    – I got 100% accuracy on evaluation using a basic ML model, not deep learning, consisting on a word codification by the keras “texts_to_matrix()” but using a X dataset input of 2D (10 different docs samples rows and 2 different options of vocab_size 50 (yours) and even less (15) associated to less words use it on them). I use a fully connected Model that comprise 2 dense layer (the output and previous one of 10 nodes).

    – I got also 100% acc but now using the DL embedding layer with the “one_hot()” encoding (you suggest), and not GloVe pre-trained weights, and in addition to the embedding layer I add a Conv1D layer and MaxPool1D and Flatten layers (CNN part !). I use a 2D X tensor with two option as features cols, with a max_length of 4 and even applying later a 50 vocabulary cols features (using “pad-sequences()” methods for it). I train later the embedding layer with Glove pre-trained weights using trainable false and true. and the same fully connected head model (with 2 dense layers)

    – as third main option I got also got 100% accuracy, but using the DL embedding layer with the “texts_to_sequences()” encoding method of keras, and for two options (max-length and applying the 50 vocab-size (after applying pad_sequences()). I also train the embedding with Glove (tranable false and trainable = true) and without Glove weights. And the same fully connected model as a head.

    My conclusion, it is the more stable and quick maximum level reaching is applying embedded layer with pre-train Glove (trainable = true) and using the coding “texts_to_sequences()”. I measure the best performance because I can reduce the 50 epochs o using less units on my second Dense layers, to get the same 1005 accuracy.


    I expect this could help Jason!

    • Avatar
      Jason Brownlee June 23, 2020 at 6:30 am #

      Great write-up!

      Yes, it is possible that text data prep has a come a long way over the last 3-4 years since I wrote this stuff up.

  170. Avatar
    JG June 24, 2020 at 9:43 pm #

    Yes… I think “spaCY” … is one of the new powerful ML/DL NLP libraries

  171. Avatar
    Vivek June 25, 2020 at 3:06 am #

    Hi Jason

    Good read ! I would like to have more clarity on this :

    What do we set trainable = False ?

    Should we not set it to ‘True’ if we want to use the GloVe 100-D weights present in the embeddings matrix ?

    I don’t quite understand the difference between setting it to ‘True’ and ‘False’


    • Avatar
      Jason Brownlee June 25, 2020 at 6:28 am #

      It means the weights in the ensemble cannot change during training if set “False”. As in, cannot be trained.

  172. Avatar
    Murat Karakaya July 13, 2020 at 9:41 pm #

    I think there is a typo here: “For example, below we define an Embedding layer with a vocabulary of 200 (e.g. integer encoded words from 0 to 199, inclusive),…..” Because, since 0 (zero) is used for padding, words are encoded words from 1 (one not zero!) to 199, inclusive. Regards,

  173. Avatar
    Fa July 21, 2020 at 3:32 am #

    For a multilabel classifier where each text could be linked to zero or more topics, what changes to the above example would be needed?

    For example, say I have a list of topics like [“politics”, “sports”, entertainment”, “finance”, “health”].

    A document mentioning heathcare and legistation would be labeled [1,0,0,0,1].

    Would it be enough to encode the “y” labels in a binary array format above?

    • Avatar
      Jason Brownlee July 21, 2020 at 6:12 am #

      Typically multi-class classification we integer encode the labels then one hot encode them and use a softmax activation function in the output layer.

  174. Avatar
    Ramesh Ravula July 27, 2020 at 9:09 pm #

    In “Example of Learning an Embedding” section why 8 dimensions?

  175. Avatar
    Keshav Rawat August 1, 2020 at 6:11 pm #


    In the pretrained Glove embedding section, you have used vocab_size = len(t.word_index) + 1.

    Why 1 is added? I can’t understand the reason.

    • Avatar
      Jason Brownlee August 2, 2020 at 5:40 am #

      To make room for index 0==unknown, so the vocab starts at index 1.

  176. Avatar
    dvl August 4, 2020 at 11:10 pm #

    Hii Joson Brownlee,
    Which one is better according to you?

  177. Avatar
    Ramki August 10, 2020 at 1:06 pm #

    Are there any good word or token embedding models for log files (syslogs, application logs etc.)

    Can you please suggest some links on machine learning applications for log files? e.g.
    – Does a log file have sensitive data? e.g. personal data, topology data like IP address etc,
    – What types of sensitive data are in a log file

    • Avatar
      Jason Brownlee August 10, 2020 at 1:38 pm #

      I would recommend training one for your specific dataset.

  178. Avatar
    SJ August 11, 2020 at 3:11 am #

    Where can I get this entire code?

  179. Avatar
    Andre August 27, 2020 at 1:14 pm #

    Hi jason,

    Could you give me some insight with this paper, please?

    I mean, I’ve some confusion when they would use the output of the model to the NCRF++, they said “For converting the instances into embeddings, two features were used namely, sigmoid output from MNN (fe1), character embedding (fe2) of size 30”. What they mean and how?

    Maybe if you have some tutorial sequence tagging using NCRF++ it would be very useful. Thankyou

    • Avatar
      Jason Brownlee August 27, 2020 at 1:38 pm #

      This is a common question that I answer here:

      • Avatar
        Andre August 27, 2020 at 4:55 pm #

        Ahh sorry for that, so let me explain it.

        tldr, the author using deep learning model which output a sigmoid probabilities.

        Then the author feed it to the NCRF++, author said “For converting the instances into embeddings, two features were used namely, sigmoid output from MNN (fe1), character embedding (fe2) of size 30”.

        Basically what I want to ask, do you have any articles about how NCRF++ works? Thank you

  180. Avatar
    Elvin Aghammadzada September 13, 2020 at 7:49 am #

    guys, plz consider adding a dark mode (theme) to the website

  181. Avatar
    MATHAV RAJ J September 17, 2020 at 6:37 pm #

    hi i am trying to concatenate two embedding layers in keras, one will be fixed and other one will be trained,
    model = Sequential()
    fixed_weights = np.array([word2glove[w] for w in words_in_glove])

    e1 = Embedding(input_dim = fixed_weights.shape[0], output_dim = EMBEDDING_DIM, weights = [fixed_weights], input_length = MAX_NEWS_LENGTH, trainable = False)

    e2 = Embedding(input_dim = len(word2index) – fixed_weights.shape[0],output_dim = EMBEDDING_DIM, input_length = MAX_NEWS_LENGTH)
    # model.add()
    model.add(concatenate([e1, e2]))

    but this seems unsuccessful and i get the error, (‘NoneType’ object is not subscriptable).. could you guide me where i am wrong?

    • Avatar
      Jason Brownlee September 18, 2020 at 6:42 am #

      Perhaps have 2 inputs to the model, 1 embedding and LSTM for each input, then concat the outputs of the lstms.

  182. Avatar
    singularityISnear September 25, 2020 at 7:44 pm #

    Hi , Want to write a common python function for a mode building which works for a custom embedding layer AND also with Glove’s pre-trained embeddings

  183. Avatar
    Par September 30, 2020 at 2:05 pm #

    Hi Jason,

    Thanks for the amazing tutorials. Always helpful.

    I have a question. I am trying to extract features from some annotated symptoms (from dialogue). So my sequences are ‘headache extent heavy headache frequency sometimes’ with different lengths each assigned with a time point.

    I want to extract features to later use them for time-series classification (so after flattening I would add a dense layer of size 100 for example). The problem is that I do not want to use the same labels here for embedding (in I’m afraid this will cause bias when I am using these features for training the main classifier with the same labels assigned to the time points containing these samples.

    Is there any way to do the embedding without assigning labels to each sample?

    Thank you,

    • Avatar
      Jason Brownlee September 30, 2020 at 2:16 pm #

      Perhaps you can use a standalone doc2vec technique, e.g. an equilivient of standalone word2vec methods.

  184. Avatar
    Cre October 28, 2020 at 9:01 am #

    Hi Jason,
    short question regarding the accruacy, and predict method. When I use:
    model.predict(pad_sequences([one_hot(‘poor work’, vocab_size)], maxlen=max_length, padding=’post’))
    I get an output of 0.5076578 {array([[0.5076578]], dtype=float32)}, which means by my understanding that it is neither good nor bad. And if I run model.evaluate I get an accuracy of 100% but this cant be true as ‘poor work’ is in my test set and I get a score of 0.50… and not 0.
    How can this be?

    Thank you very much

  185. Avatar
    Mahesh Chauhan October 29, 2020 at 3:56 pm #

    Thanks Jason,
    after 2-3 run i am able to get 100% accuracy as you said.

  186. Avatar
    Tanmay November 6, 2020 at 8:53 pm #

    Hi Jason,

    I am trying to predict the output after training with this code. When I see the outputs, It always predicting [[0.72645265]]. Why is it doing so?

    I am predicting using the following model call-
    output = model.predict(padded_docs)

    Following is the code for getting padded_docs-
    t = Tokenizer()
    t.fit_on_texts([“good job”])
    vocab_size = len(t.word_index) + 1
    encoded_docs = t.texts_to_sequences(docs)
    max_length = 4
    padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding=’post’)

    Or someway I am going wrong? I am not exactly sure about what to pass in predict().

    • Avatar
      Jason Brownlee November 7, 2020 at 6:28 am #

      Looks good, it is predicting the probability of the document belonging to class 1.

      E.g. P(class=1) = 70%

      More on predicting with Keras models here:

      • Avatar
        Tanmay November 8, 2020 at 5:44 am #

        Yeah, it is going well. But not for all. When I test on input “poor job”, it predicts the same output “[[0.72645265]]” (and same for all test cases). I think I am providing the wrong input which is simply padded data ([[1,2,0,0]]) every time, not the exact words. How should I form input for the model.predict() function?

        • Avatar
          Jason Brownlee November 8, 2020 at 6:43 am #

          New input data must be prepared in an identical manner as the training dataset – same encoding, vocab, etc.

          • Avatar
            Juan December 14, 2020 at 12:15 pm #

            i have the same error could you please show us the code to predict correctly?

          • Avatar
            Jason Brownlee December 14, 2020 at 1:36 pm #

            Sorry I don’t understand. What error are you having exactly?

          • Avatar
            Juan December 14, 2020 at 2:45 pm #

            I answer to myself and Tammay

            encoded_try = t.texts_to_sequences(try)
            padded_try = pad_sequences(encoded_try, maxlen=4, padding=’post’)

            only the trainned words works***

            Jasson good blog, the next put also like evaluate the model please, we are noobs ????

            psdt: (the problem was that we didn’t know how to evaluate text “model.predict” because we thought the same result always)

  187. Avatar
    Pratik Sen November 12, 2020 at 5:21 am #

    The output from the Embedding layer is 3D (Ref: but in your blog its written that the output is 2D

    • Avatar
      Juan December 14, 2020 at 2:42 pm #

      I answer to myself and Tammay

      encoded_try = t.texts_to_sequences(try)
      padded_try = pad_sequences(encoded_try, maxlen=4, padding=’post’)

      only the trainned words works***

      Jasson good blog, the next put also like evaluate the model please, we are noobs 😉

  188. Avatar
    Dipankar Porey November 14, 2020 at 11:32 pm #

    what does Embedding layer in model ?

    • Avatar
      Jason Brownlee November 15, 2020 at 6:26 am #

      Sorry, I don’t understand your question, can you please elaborate?

  189. Avatar
    Dipankar Porey November 14, 2020 at 11:34 pm #

    I mean why are we using Embedding layer in a model ????

    • Avatar
      Jason Brownlee November 15, 2020 at 6:27 am #

      Embedding layers can be used to learn the relationship between words.

  190. Avatar
    Sheema Egbert November 19, 2020 at 1:04 am #

    Hi Jason, thanks for your amazing tutorials.

    How do you add bias to the embedding?

    Thanks in advance

    • Avatar
      Jason Brownlee November 19, 2020 at 7:46 am #

      You’re welcome!

      What do you mean by “add bias to the embedding”?

  191. Avatar
    Sheema Egbert November 20, 2020 at 3:16 pm #

    I mean when you train the embedding, is it advantageous to have a single bias term also for each learned embedding?

    Thanks for your interest!

  192. Avatar
    Sheema Egbert November 21, 2020 at 3:45 pm #

    Thanks for clearing that up, Jason.

  193. Avatar
    Puneesh Khanna December 10, 2020 at 4:38 pm #

    Hi Jason,
    Is there any way that I can convert a 2D input into embeddings. Basically I have an adjacency matrix (2-D array) representing a graph and training would have happen over a batch of graphs as data points. Is there any way that each node of the graph (which is basically represented by each row in the adjacency matrix) can be converted to a more meaningful embedding ?

  194. Avatar
    vian December 12, 2020 at 3:36 am #

    thanks, Mr, Jason please help me.
    1- if I have one word in each statement and there is no relation between them so how can I represent class labels and really I don’t know the benefits of class labels here?
    2- what is the mathematical model behind the embedding process?

  195. Avatar
    SULAIMAN KHAN December 29, 2020 at 10:54 pm #

    pre-train means test data

    • Avatar
      Jason Brownlee December 30, 2020 at 6:38 am #

      No. Pre-trained means training dataset using a different algorithm and resulting in a standalone model that can be used as part of a subsequent model (like a neural net).

  196. Avatar
    Arwinder Singh January 3, 2021 at 2:39 am #

    hi…i want to concatenate the hidden states with word we can do this

    • Avatar
      Jason Brownlee January 3, 2021 at 5:57 am #

      Perhaps test a merge or concat layer in keras to see if you can achieve your desired effect?

      • Avatar
        Arwinder Singh January 3, 2021 at 3:48 pm #

        thanks…I have tried the following and it’s giving dimension error:

        encoder_inputs = Input(shape=(None,))
        enc_emb = Embedding(eng_vocab_size, latent_dim, mask_zero = True)(encoder_inputs)
        encoder_lstm = LSTM(latent_dim*4, return_sequences=True, return_state=True)
        encoder_outputs, state_h, state_c = encoder_lstm(enc_emb)
        encoder_states = [state_h, state_c]

        error is:
        ValueError: A Concatenate layer requires inputs with matching shapes except for the concat axis. Got inputs shapes: [(None, 512), (None, None, 128)]

        • Avatar
          Jason Brownlee January 4, 2021 at 6:04 am #

          Sorry, I don’t have the capacity to debug your code.

  197. Avatar
    Anil Rahate January 10, 2021 at 4:21 am #

    Hi Jason, I am trying to understand how can we extracted embedded features? e.g. for a each training example, how can I get textual feature of say 300 dimension x 50 sequence length. So that I am not required to use embedding layer in sequential model and I can directly use LSTM and other layers. Any guidance will really help.

    • Avatar
      Jason Brownlee January 10, 2021 at 5:46 am #

      Each word is mapped to one vector. Retrieve the vectors for your words directly.

  198. Avatar
    vycki January 11, 2021 at 9: