Building Transformer Models with Attention Crash Course. Build a Neural Machine Translator in 12 Days

Last Updated on January 9, 2023

Transformer is a recent breakthrough in neural machine translation. Natural languages are complicated. A word in one language can be translated into multiple words in another, depending on the context. But what exactly a context is, and how you can teach the computer to understand the context was a big problem to solve. The invention of the attention mechanism solved the problem of how to encode a context into a word, or in other words, how you can present a word and its context together in a numerical vector. Transformer brings this to one level higher so that we can build a neural network for natural language translation using only the attention mechanism but no recurrent structure. This not only makes the network simpler, easier to train, and parallelizable in algorithm but also allows a more complicated language model to be built. As a result, we can see computer-translated sentences almost flawlessly.

Indeed, such a powerful deep learning model is not difficult to build. In TensorFlow and Keras, you have almost all the building blocks readily available, and training a model is only a matter of several hours. It is fun to see a transformer model built and trained. It is even more fun to see a trained model to translate sentences from one language to another.

In this crash course, you will build a transformer model in the similar design as the original research paper.

This is a big and important post. You might want to bookmark it.

Let’s get started.

Building Transformer Models with Attention (12-day Mini-Course).
Photo by Norbert Braun, some rights reserved.

Who Is This Crash-Course For?

Before you get started, let’s make sure you are in the right place.

This course is for developers who are already familiar with TensorFlow/Keras. The lessons in this course do assume a few things about you, such as:

  • You know how to build a custom model, including the Keras functional API
  • You know how to train a deep learning model in Keras
  • You know how to use a trained model for inference

You do NOT need to be:

  • A natural language processing expert
  • A speaker of many languages

This crash course can help you get a deeper understanding of what a transformer model is and what it can do.

This crash course assumes you have a working TensorFlow 2.10 environment installed. If you need help with your environment, you can follow the step-by-step tutorial here:

Crash-Course Overview

This crash course is broken down into 12 lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.

Below is a list of the 12 lessons that will get you started and learn about the construction of a transformer model.

  • Lesson 01: Obtaining Data
  • Lesson 02: Text Normalization
  • Lesson 03: Vectorization and Making Datasets
  • Lesson 04: Positional Encoding Matrix
  • Lesson 05: Positional Encoding Layer
  • Lesson 06: Transformer Building Blocks
  • Lesson 07: Transformer Encoder and Decoder
  • Lesson 08: Building a Transformer
  • Lesson 09: Preparing the Transformer Model for Training
  • Lesson 10: Training the Transformer
  • Lesson 11: Inference from the Transformer Model
  • Lesson 12: Improving the Model

Each lesson could take you between 15 and up to 60 minutes. Take your time and complete the lessons at your own pace. Ask questions, and even post results in the comments online.

The lessons might expect you to go off and find out how to do things. This guide will give you hints, but even if you just follow the code in the lesson, you can finish a transformer model that works quite well.

Post your results in the comments; I’ll cheer you on!

Hang in there; don’t give up.

Lesson 01: Obtaining Data

As you are building a neural machine translator, you need data for training and testing. Let’s build a sentence-based English-to-French translator. There are many resources on the Internet. An example would be the user-contributed data for the flash card app Anki. You can download some data files at The data file would be a ZIP file containing a SQLite database file, from which you can extract the English-French sentence pairs.

However, you may find it more convenient to have a text file version, which you can find it at Google hosts a mirror of this file as
well, which we will be using.

The code below will download the compressed data file and extract it:

The data file will be a plaintext file named fra.txt. Its format would be lines of:

Your Task

Try to run the above code and open the file extracted. You should verify that the format of each line is like the above.

In the next lesson, you will process this file and prepare the dataset suitable for training and testing.

Lesson 02: Text Normalization

Just like all NLP tasks, you need to normalize the text before you use it. French letters have accents which would be represented as Unicode characters, but such representation is not unique in Unicode. Therefore, you will convert the string into NFKC (compatibility and composition normal form).

Next, you will tokenize the sentences. Each word should be a separate token as well as each punctuation mark. However, the punctuation used in contractions such as don’t, va-t-il, or c’est are not separated from the words. Also, convert everything into lowercase in the expectation that this will reduce the number of distinct words in the vocabulary.

Normalization and tokenization can go a lot deeper, such as subword tokenization, stemming, and lemmatization. But to keep things simple, you do not do these in this project.

Starting from scratch, the code to normalize the text is below. You will use the Python module unicodedata to convert a Unicode string into NFKC normal form. Then you will use regular expression to add space around punctuation marks. Afterward, you will wrap the French sentences (i.e., the target language) with sentinels [start] and [end]. You will see the purpose of the sentinels in later lessons.

When you run this, you should see the result from a few samples, such as these:

We saved the normalized sentence pairs in a pickle file, so we can reuse it in subsequent steps.

When you use it for your model, you want to know some statistics about this dataset. In particular, you want to see how many distinct tokens (words) in each language and how long the sentences are. You can figure these out as follows:

Your Task

Run the above code. See not only the sample sentences but also the statistics you collected. Remember the output as they will be useful for your next lesson. Besides, knowing the maximum length of sentences is not as useful as knowing their distribution. You should plot a histogram for that. Try out this to produce the following chart:

Sentence lengths in different languages

In the next lesson, you will vectorize this normalized text data and create datasets.

Lesson 03: Vectorization and Making Datasets

In the previous lesson, you cleaned up the sentences, but they are still text. Neural networks can handle only numbers. One way to convert the text into numbers is through vectorization. What this means is to transform the tokens from the text into an integer. Hence a sentence with $n$ tokens (words) will become a vector of $n$ integers.

You can build your own vectorizer. Simply build a mapping table of each unique token to a unique integer. When it is used, you look up the token one by one in the table and return the integers in the form of a vector.

In Keras, you have TextVectorization layer to save us the effort of building a vectorizer. It supports padding, i.e., integer 0 is reserved to mean “empty.” This is useful when you give a sentence of $m < n$ tokens but want the vectorizer always to return a fixed length $n$ vector.

You will first split the sentence pair into training, validation, and testing sets as you need them for the model training. Then, create a TextVectorization layer and adapt it to the training set only (because you should not peek into the validation or testing dataset until the model training is completed).

Note that the parameter max_tokens to TextVectorization object can be omitted to let the vectorizer figure it out. But if you set them to a value smaller than the total vocabulary (such as this case), you limit the the vectorizer to learn only the more frequent words and make the rare words as out-of-vocabulary (OOV). This may be useful to skip the words of little value or with spelling mistakes. You also fix the output length of the vectorizer. We assumed that a sentence should have no more than 20 tokens in the above.

The next step would be to make use of the vectorizer and create a TensorFlow Dataset object. This will be helpful in your later steps to train our model.

You will reuse this code later to make the train_ds and val_ds dataset objects.

Your Task

Run the above code. Verify that you can see an output similar to the below:

The exact vector may not be the same, but you should see that the shape should all be (batch size, sequence length). Some code above is borrowed from the tutorial by François Chollet, English-to-Spanish translation with a sequence-to-sequence Transformer. You may also want to see how his implementation of transformer is different from this mini-course.

In the next lesson, you will move to the topic of position encoding.

Lesson 04: Positional Encoding Matrix

When a sentence is vectorized, you get a vector of integers, where each integer represents a word. The integer here is only a label. We cannot assume two integers closer to each other means the words they represent are related.

In order to understand the meaning of words and hence quantify how two words are related to each other, you will use the technique word embeddings. But to understand the context, you also need to know the position of each word in a sentence. This is done by positional encoding.

In the paper Attention Is All You Need, positional encoding represents each token position with a vector. The elements of the vector are values of the different phase and frequency of sine waves. Precisely, at position $k=0, 1, \cdots, L-1$, the positional encoding vector (of length $d$) is

[P(k,0), P(k,1), \cdots, P(k,d-2), P(k,d-1)]

where for $i=0, 1, \cdots, d/2$,

P(k, 2i) &= \sin\Big(\frac{k}{n^{2i/d}}\Big) \\
P(k, 2i+1) &= \cos\Big(\frac{k}{n^{2i/d}}\Big)

In the paper, they used $n=10000$.

Implementing the positional encoding is not difficult, especially if you can use vector functions from NumPy.

You can see that we created a function to generate the positional encoding. We tested it out with $L=2048$ and $d=512$ above. The output would be a $2048\times 512$ matrix. We also plot the encoding in a heatmap. This should look like the following.

Heatmap representation of the positional encoding matrix

Your Task

The heatmap above may not be very appealing to you. A better way to visualize it is to separate the sine curves from the cosine curves. Try out the code below to reuse the pickled positional encoding matrix and obtain a clearer visualization:

If you wish, you may check that the different “depth” in the matrix represents a sine curve of different frequency. An example to visualize them is the following:

But if you visualize one “position” of the matrix, you see an interesting curve:

which shows you this:

One encoding vector

The encoding matrix is useful in the sense that, when you compare two encoding vectors, you can tell how far apart their positions are. The dot-product of two normalized vectors is 1 if they are identical and drops quickly as they move apart. This relationship can be visualized below:

Dot-product of normalized positional encoding vectors

In the next lesson, you will make use of the positional encoding matrix to build a positional encoding layer in Keras.

Lesson 05: Positional Encoding Layer

The transformer model

The transformer model from the paper Attention Is All You Need is illustrated below:

The positional encoding layer is at the entry point of a transformer model. However, the Keras library does not provide us one. You can create a custom layer to implement the positional encoding, as follows.

This layer is indeed combining an embedding layer with position encoding. The embedding layer creates word embeddings, namely, converting an integer token label from the vectorized sentence into a vector that can carry the meaning of the word. With the embedding, you can tell how close in meaning the two different words are.

The embedding output depends on the tokenized input sentence. But the positional encoding is a constant matrix as it depends only on the position. Hence you create a constant tensor for that at the time you created this layer. TensorFlow is smart enough to match the dimensions when you add the embedding output to the positional encoding matrix, in the call() function.

Two additional functions are defined in the layer above. The compute_mask() function is passed on to the embedding layer. This is needed to tell which positions of the output are padded. This will be used internally by Keras. The get_config() function is defined to remember all the config parameters of this layer. This is a standard practice in Keras so that you remember all the parameters you passed on to the constructor and return them in get_config(), so the model can be saved and loaded.

Your Task

Combine the above code together with the dataset train_ds created in Lesson 03 and the code snippet below:

You should see the output like the following:

You can see that the first tensor printed above is one batch (64 samples) of the vectorized input sentences, padded with zero to length 20. Each token is an integer but will be converted into an embedding of dimension 512. Hence the shape of en_emb above is (64, 20, 512).

The last tensor printed above is the mask used. This essentially matches the input where the position is not zero. When you compute the accuracy, you have to remember the padded locations should not be counted.

In the next lesson, you will complete the other building block of the transformer model.

Lesson 06: Transformer Building Blocks

Reviewing the diagram of transformer in Lesson 05, you will see that beyond the embedding and positional encoding, you have the encoder (left half of the figure) and decoder (right half of the figure). They share some similarities. Most notably, they have a multi-head attention block at the beginning and a feed forward block at the end.

It would be easier if you create each building block as separate submodels and later combine them into a bigger model.

First, you create the self-attention model. It is in the part of the diagram that is at the bottom of both encoder and decoder. A multi-head attention layer will take three inputs, namely, the key, the value, and the query. If all three inputs are the same, we call this multi-head attention layer self-attention. This submodel will have an add & norm layer with skip connection to normalize the output of the attention layer. Its implementation is as follows:

The function defined above is generic for both encoder and decoder. The decoder will set the option mask=True to apply causal mask to the input.

Set some parameters and create a model. The model plotted would look like the following.

Self-attention architecture with key dimension=128

In the decoder, you have a cross-attention model that takes input from the self-attention model as well as the encoder. In this case, the value and key are the output from the encoder whereas the query is the output from the self-attention model. At the high level, it is based on what the encoder understands about the context of the source sentence, and takes the partial sentence at the decoder’s input as the query (which can be empty), to predict how to complete the sentence. This is the only difference from the self-attention model; hence the code is very similar:

The model plotted would look like the following. Note that there are two inputs in this model, one for the context and another for the input from self-attention.

Cross-attention architecture with key dimension=128

Finally, there are feed forward models at the output of both encoder and decoder. It is implemented as Dense layers in Keras:

The model plotted would look like the following. Note that the first Dense layer uses ReLU activation and the second has no activation. A dropout layer is then appended for regularization.

Feed-forward submodel

Your Task

Run the above code and verify you see the same model diagram. It is important you match the layout as the final transformer model depends on them.

In the code above, Keras functional API is used. In Keras, you can build a model using sequential API, functional API, or subclass the Model class. Subclassing can also be used here, but sequential API cannot. Can you tell why?

In the next lesson, you will make use of these building block to create the encoder and decoder.

Lesson 07: Transformer Encoder and Decoder

Look again at the diagram of the transformer in Lesson 05. You will see that the encoder is the self-attention submodel connected to the feed-forward submodel. The decoder, on the other hand, is a self-attention submodel, a cross-attention submodel, and a feed-forward submodel connected in tandem.

Making an encoder and a decoder is therefore not difficult once you have these submodels as building blocks. Firstly, you have the encoder. It is simple enough that you can build an encoder model using Keras sequential API.

Plotting the model would see that it is simple as the following:

Encoder submodel

The decoder is a bit complicated because the cross-attention block takes input from the encoder as well; hence it is a model that takes two input. It is implemented as follows:

The model will look like the following:

Decoder submodel

Your Task

Copy over the three building block functions from Lesson 06 and run the above code to make sure you see the same layout as shown, in both the encoder and decoder.

In the next lesson, you will complete the transformer model with the building block you have created so far.

Lesson 08: Building a Transformer

Indeed, a transformer has encoder and decoder parts, and each part is not one but a series of encoders or decoders. It sounds complicated but not if you have the building block submodels to hide the details.

Refer to the figure in Lesson 05, and you see the encoder and decoder parts are just a chain of encoder and decoder blocks. Only the output of the final encoder block is used as input to the decoder blocks.

Therefore, the complete transformer model can be built as follows:

The tryexcept block in the code is to handle a bug in certain versions of TensorFlow that may cause the training error calculated erroneously. The model plotted above would be like the following. Not very simple, but the architecture is still tractable.

Transformer with 4 layers in encoder and 4 layers in decoder

Your Task

Copy over the three building block functions from Lessons 05, 06, and 07, so you can run the above code and generate the same diagram. You will reuse this model in the subsequent lessons.

In the next lesson, you will set up the other training parameters for this model.

Lesson 09: Prepare the Transformer Model for Training

Before you can train your transformer, you need to decide how you should train it.

According to the paper Attention Is All You Need, you are using Adam as the optimizer but with a custom learning rate schedule,

$$\text{LR} = \frac{1}{\sqrt{d_{\text{model}}}} \min\big(\frac{1}{\sqrt{n}}, \frac{n}{\sqrt{m^3}}\big)$$

It is implemented as follows:

The learning rate schedule is designed in such a way that it learns slowly at the beginning but accelerates as it learns. This helps because the model is totally random at the beginning, and you cannot even trust the output much. But as you train the model enough, the result should be sufficiently sensible and thus you can learn faster to help convergence. The learning rate as plotted would look like the following:

Customized learning rate schedule

Next, you also need to define the loss metric and accuracy metric for training. This model is special because you need to apply a mask to the output to calculate the loss and accuracy only on the non-padding elements. Borrow the implementation from TensorFlow’s tutorial Neural machine translation with a Transformer and Keras:

With all these, you can now compile your Keras model as follows:

Your Task

If you have implemented everything correctly, you should be able to provide all building block functions to make the above code run. Try to keep everything you made so far in one Python script or one Jupyter notebook and run it once to ensure no errors produced and no exceptions are raised.

If everything run smoothly, you should see the summary() above prints the following:

Moreover, when you look at the diagram of the transformer model and your implementation here, you should notice the diagram shows a softmax layer at the output, but we omitted that. The softmax is indeed added in this lesson. Do you see where is it?

In the next lesson, you will train this compiled model, on 14 million parameters as we can see in the summary above.

Lesson 10: Training the Transformer

Training the transformer depends on everything you created in all previous lessons. Most importantly, the vectorizer and dataset from Lesson 03 must be saved as they will be reused in this and the next lessons.

That’s it!

Running this script will take several hours, but once it is finished, you will have the model saved and the loss and accuracy plotted. It should look like the following:

Loss and accuracy history from the training

Your Task

In the training set up above, we did not make use of the early stopping and checkpoint callbacks in Keras. Before you run it, try to modify the code above to add these callbacks.

The early stopping callback can help you interrupt the training when no progress is made. The checkpoint callback can help you keep the best-score model rather than return to you only the final model at the last epoch.

In the next lesson, you will load this trained model and test it out.

Lesson 11: Inference from the Transformer Model

In Lesson 03, you split the original dataset into training, validation, and test sets in the ratio of 70%-15%-15%. You used the training and validation dataset in the training of the transformer model in Lesson 10. And in this lesson, you are going to use the test set to see how good your trained model is.

You saved your transformer model in the previous lesson. Because you have some custom made layers and functions in the model, you need to create a custom object scope to load your saved model.

The transformer model can give you a token index. You need the vectorizer to look up the word that this index represents. You have to reuse the same vectorizer that you used in creating the dataset to maintain consistency.

Create a loop to scan the generated tokens. In other words, do not use the model to generate the entire translated sentence but consider only the next generated word in the sentence until you see the end sentinel. The first generated word would be the one generated by the start sentinel. It is the reason you processed the target sentences this way in Lesson 02.

The code is as follows:

Your Task

First, try to run this code and observe the inference result. Some examples are below:

The second line of each test is the expected output while the third line is the output from the transformer.

The token [UNK] means “unknown” or out-of-vocabulary, which should appear rarely. Comparing the output, you should see the result is quite accurate. It will not be perfect. For example, they in English can map to ils or elles in French depending on the gender, and the transformer cannot always distinguish that.

You generated the translated sentence word by word, but indeed the transformer outputs the entire sentence in one shot. You should try to modify the program to decode the entire transformer output pred in the for-loop to see how the transformer gives you a better sentence as you provide more leading words in dec_tokens.

In the next lesson, you will review what you did so far and see if any improvements can be made.

Lesson 12: Improving the Model

You did it!

Let’s go back and review what you did and what can be improved. You made a transformer model that takes an entire English sentence and a partial French sentence (up to the $k$-th token) to predict the next (the $(k+1)$-th) token.

In training, you observed that the accuracy is at 70% to 80% at the best. How can you improve it? Here are some ideas, but surely, not exhaustive:

  • You used a simple tokenizer for your text input. Libraries such as NLTK can provide better tokenizers. Also, you didn’t use subword tokenization. It is less a problem for English but problematic for French. That’s why you have vastly larger vocabulary size in French in your model (e.g., l’air (the air) and d’air (of air) would become distinct tokens).
  • You trained your own word embeddings with an embedding layer. There are pre-trained embeddings (such as GloVe) readily available, and they usually provide better quality embeddings. This may help your model to understand the context better.
  • You designed the transformer with some parameters. You used 8 heads for multi-head attention, output vector dimension is 128, sentence length was limited to 20 tokens, drop out rate is 0.1, and so on. Tuning these parameters will surely impact the transformer one way or another. Similarly important are the training parameters such as number of epochs, learning rate schedule, and loss function.

Your Task

Figure out how to change the code to accomodate the above changes. But if we test it out, do you know the right way to tell if one model is better than another?

Post your answer in the comments below. I would love to see what you come up with.

This was the final lesson.

The End! (Look How Far You Have Come)

You made it. Well done!

Take a moment and look back at how far you have come.

  • You learned how to take a plaintext sentence, process it, and vectorize it
  • You analyzed the building block of a transformer model according to the paper Attention Is All You Need, and implemented each building block using Keras
  • You connected the building blocks into a complete transformer model, and train it
  • Finally, you can witness the trained model to translate English sentences into French with high accuracy


How did you do with the mini-course?
Did you enjoy this crash course?

Do you have any questions? Were there any sticking points?
Let me know. Leave a comment below.

Learn Transformers and Attention!

Building Transformer Models with Attention

Teach your deep learning model to read a sentence

...using transformer models with attention

Discover how in my new Ebook:
Building Transformer Models with Attention

It provides self-study tutorials with working code to guide you into building a fully-working transformer models that can
translate sentences from one language to another...

Give magical power of understanding human language for
Your Projects

See What's Inside

25 Responses to Building Transformer Models with Attention Crash Course. Build a Neural Machine Translator in 12 Days

  1. Martin January 8, 2023 at 6:04 am #

    Hi, Adrian:

    For the first lesson, you set:
    vocab_size_en = 10000
    vocab_size_fr = 20000
    seq_length = 20

    But the tokens analysis shows:
    Total english tokens: 14969
    Total french tokens: 31271
    Max English length: 51
    Max french length: 58

    which means you set a much lower count of vocabulary sizes than actual token counts in the text, and ‘seq_length=20’ is much smaller than max token lengths 51 & 58. What type of tokens will be thrown away when sentence token length exceeds 20?

    Is this understanding correct?

    • Adrian Tam January 9, 2023 at 2:12 pm #

      Correct. That’s on purpose to discard the less frequently used words and ignore the longer but rare case of long sentences. This trade off is to allow us to focus on the common use cases and produce a better translator.

  2. Furio January 11, 2023 at 12:39 am #

    Hi Adrian,

    Running the model on my computer is difficult , is possible to have the model file “eng-fra-transformer.h5” ?

    Thanks in advance.

    • James Carmichael January 11, 2023 at 7:59 am #

      Hi Furio…Have you tried Google Colab? There is even a GPU option available.

  3. Tom January 12, 2023 at 7:52 am #

    Hi Adrian!

    Great stuff as always!

    Maybe you can help me with this issue:
    when calling the function to create the self-attention layers, namely:

    > model = self_attention(input_shape=(seq_length, key_dim),
    > num_heads=num_heads, key_dim=key_dim)

    I receive this error message:

    TypeError: call() got an unexpected keyword argument ‘use_causal_mask’

    If I remove ‘use_causal_mask’ argument from

    > attout = attention(query=inputs, value=inputs, key=inputs, use_causal_mask=mask)

    the model works fine.
    However, as far as I understand this argument has to be set to True for the decoder.
    Do you know how I could solve this issue?

    I have updated tensorflow to the latest version but it didn’t help.

    Thank you very much!

    • James Carmichael January 12, 2023 at 8:37 am #

      Hi Tom…You may find the following resource of interest:

    • Martin January 15, 2023 at 1:59 pm #

      I have the same issue with Tensorflow/keras 2.9, in which “use_causal_mask” wasn’t supported. This argument was added after 2.10. If I can’t update my Tensorflow version, if there a easy way to fix this issue? I can remove this argument without changing anything else in the code and the code runs fine. However, the last part of the course doesn’t generate expected French output. Instead it always generated an incorrect string for each run in the test data. For example:

      do as i tell you .
      == [start]fais comme je te dis .[end]
      -> [start] 40 40 programmeur programmeur programmeur prendrais programmeur programmeur patiner patiner patiner c’était erreur .[end] ? ? ? ? ? ?

      Test 1:
      i suggested that we go fishing .
      == [start]j’ai proposé que nous allions pêcher .[end]
      -> [start] 40 40 programmeur programmeur programmeur prendrais programmeur programmeur patiner patiner patiner c’était erreur .[end] ? ? ? ? ? ?

      I suspect that this is due to the removal of the argument “use_causal_mask”. @Adrian, is there an easy way to fix this for Tensorflow 2.9?

      • Adrian Tam January 18, 2023 at 3:09 am #

        I think that’s a bit difficult, although not impossible. The causal mask is important here because the way inference is run. We would give you a complete English sentence, and a partial French sentence (at least the start sentinel), and the model should predict the next word to the French sentence. The causal mask is to hide out the later half of a sentence which is to be predicted during training.

        If you cannot use TensorFlow 2.10, the only way out is to avoid the default training loop provided by fit() function, but to write your own training loop using Gradient Tape. That’s more code to write.

  4. N Prinja January 21, 2023 at 10:02 am #

    When I run the following task in Lesson 5 , it gives error:
    # From Lesson 03:
    # train_ds = make_dataset(train_pairs)

    vocab_size_en = 10000
    seq_length = 20

    # test the dataset
    for inputs, targets in train_ds.take(1):
    embed_en = PositionalEmbedding(seq_length, vocab_size_en, embed_dim=512)
    en_emb = embed_en(inputs[“encoder_inputs”])

    The error messages are:

    InvalidArgumentError Traceback (most recent call last)
    ~\AppData\Local\Temp\ipykernel_41016\ in
    9 print(inputs[“encoder_inputs”])
    10 embed_en = PositionalEmbedding(seq_length, vocab_size_en, embed_dim=512)
    —> 11 en_emb = embed_en(inputs[“encoder_inputs”])
    12 print(en_emb.shape)
    13 print(en_emb._keras_mask)

    ~\.conda\envs\Tensorflow2\lib\site-packages\keras\utils\ in error_handler(*args, **kwargs)
    65 except Exception as e: # pylint: disable=broad-except
    66 filtered_tb = _process_traceback_frames(e.__traceback__)
    —> 67 raise e.with_traceback(filtered_tb) from None
    68 finally:
    69 del filtered_tb

    ~\AppData\Local\Temp\ipykernel_41016\ in call(self, inputs)
    51 with position vectors”””
    52 embedded_tokens = self.token_embeddings(inputs)
    —> 53 return embedded_tokens + self.position_embeddings
    55 # this layer is using an Embedding layer, which can take a mask

    InvalidArgumentError: Exception encountered when calling layer “positional_embedding” (type PositionalEmbedding).

    Expected ‘tf.Tensor(False, shape=(), dtype=bool)’ to be true. Summarized data: b’Unable to broadcast: dimension size mismatch in dimension’

    Please help.

    • James Carmichael January 22, 2023 at 11:23 am #

      Hi N Prinja…Did you type the code in or copy and paste it? Also, please try it in Google Colab to rule out any issues with your local Python environment.

    • Adrian Tam January 24, 2023 at 2:39 am #

      Please see if your inputs["encoder_inputs"] and the PositionalEmbedding() match in the dimension. It seems to be the error is about this.

  5. Fabio January 26, 2023 at 8:19 am #

    Hello Adrian Tam,

    Good job. Excellent idea… You and your team need to keep this going…. I know this takes a lot of time, so what do you think about the idea of launching a friendly priced membership/subscription for new projects, where the mentoring for the current project would be recorded where those joining later could use them????? I believe that today there are several problems to be solved and maybe even the contribution of subscribers with masses of data would help in current projects. It would be something like kaggle, but more interactive with a focus on knowledge dissemination.

    good job,

    • James Carmichael January 27, 2023 at 10:59 am #

      Thank you Fabio for your feedback, support and suggestions!

  6. Eric Lamey February 8, 2023 at 4:19 am #

    To confirm, would this code and model be appropriate for training a “custom language” model? In other words, i’d like to train my own “language.” For context, the “language” will be a propriety encoding of numeric data into English alpha characters. My current models predict the overall average data well, but struggle with the random outliers that occur less often. I’m interested in accurate prediction of the outliers.

    Thank you!

    And I just want to add, I’ve purchased a past version of your library, maybe 5 years ago, it was and is a fantastic resource. Thanks again!

    • Adrian Tam February 9, 2023 at 7:30 am #

      Appropriate as it should be. But don’t expect to fit all outliers. In fact, no machine learning model can be perfect and it is a hard problem on how to draw the line on which outlier can we tolerate. Besides, it is not a very big model but it does it job on many language pairs (you can try English vs Italian on the same code, for example). If you custom language is not very sophisticated, it may work too.

  7. Mohamed Lach March 5, 2023 at 2:41 am #

    I can’t see where the softmax is used. Can you explain ?
    Thanks for the turorial.

  8. Andre March 16, 2023 at 12:14 pm #

    Hi there! Thanks for the crash course!

    I have used your crash course as a reference to build a transformer for machine translation of English texts into Bengali. There aren’t any errors but my model keeps on overfitting, with increasing validation loss and very low validation accuracy. I tried increasing the dropout value to 0.5, reducing the batch size to 32, and increasing the dataset size but to no avail.

    I was wondering whether you could provide some insights on this?

  9. orhan April 11, 2023 at 7:55 pm #

    Hidden softmax -> logits=True . But when you to apply beam search in inference you need softmax on output.

    • James Carmichael April 12, 2023 at 7:36 am #

      Thank you for your reply orhan!

  10. orhan April 11, 2023 at 7:59 pm #

    pred = model([enc_tokens, dec_tokens])
    pred = tf.nn.softmax(pred, 2)

  11. orhan May 1, 2023 at 4:03 pm #

    Hi James,
    I have a question that i think is very important for machine translation. How can i get alignment information between source and target?

Leave a Reply