Joining the Transformer Encoder and Decoder Plus Masking

Last Updated on November 16, 2022

We have arrived at a point where we have implemented and tested the Transformer encoder and decoder separately, and we may now join the two together into a complete model. We will also see how to create padding and look-ahead masks by which we will suppress the input values that will not be considered in the encoder or decoder computations. Our end goal remains to apply the complete model to Natural Language Processing (NLP).

In this tutorial, you will discover how to implement the complete Transformer model and create padding and look-ahead masks. 

After completing this tutorial, you will know:

  • How to create a padding mask for the encoder and decoder
  • How to create a look-ahead mask for the decoder
  • How to join the Transformer encoder and decoder into a single model
  • How to print out a summary of the encoder and decoder layers

Let’s get started. 

Joining the Transformer encoder and decoder and Masking
Photo by John O’Nolan, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

  • Recap of the Transformer Architecture
  • Masking
    • Creating a Padding Mask
    • Creating a Look-Ahead Mask
  • Joining the Transformer Encoder and Decoder
  • Creating an Instance of the Transformer Model
    • Printing Out a Summary of the Encoder and Decoder Layers

Prerequisites

For this tutorial, we assume that you are already familiar with:

Recap of the Transformer Architecture

Recall having seen that the Transformer architecture follows an encoder-decoder structure. The encoder, on the left-hand side, is tasked with mapping an input sequence to a sequence of continuous representations; the decoder, on the right-hand side, receives the output of the encoder together with the decoder output at the previous time step to generate an output sequence.

The encoder-decoder structure of the Transformer architecture
Taken from “Attention Is All You Need

In generating an output sequence, the Transformer does not rely on recurrence and convolutions.

You have seen how to implement the Transformer encoder and decoder separately. In this tutorial, you will join the two into a complete Transformer model and apply padding and look-ahead masking to the input values.  

Let’s start first by discovering how to apply masking. 

Kick-start your project with my book Building Transformer Models with Attention. It provides self-study tutorials with working code to guide you into building a fully-working transformer model that can
translate sentences from one language to another...

Masking

Creating a Padding Mask

You should already be familiar with the importance of masking the input values before feeding them into the encoder and decoder. 

As you will see when you proceed to train the Transformer model, the input sequences fed into the encoder and decoder will first be zero-padded up to a specific sequence length. The importance of having a padding mask is to make sure that these zero values are not processed along with the actual input values by both the encoder and decoder. 

Let’s create the following function to generate a padding mask for both the encoder and decoder:

Upon receiving an input, this function will generate a tensor that marks by a value of one wherever the input contains a value of zero.  

Hence, if you input the following array:

Then the output of the padding_mask function would be the following:

Creating a Look-Ahead Mask

A look-ahead mask is required to prevent the decoder from attending to succeeding words, such that the prediction for a particular word can only depend on known outputs for the words that come before it.

For this purpose, let’s create the following function to generate a look-ahead mask for the decoder:

You will pass to it the length of the decoder input. Let’s make this length equal to 5, as an example:

Then the output that the lookahead_mask function returns is the following:

Again, the one values mask out the entries that should not be used. In this manner, the prediction of every word only depends on those that come before it. 

Joining the Transformer Encoder and Decoder

Let’s start by creating the class, TransformerModel, which inherits from the Model base class in Keras:

Our first step in creating the TransformerModel class is to initialize instances of the Encoder and Decoder classes implemented earlier and assign their outputs to the variables, encoder and decoder, respectively. If you saved these classes in separate Python scripts, do not forget to import them. I saved my code in the Python scripts encoder.py and decoder.py, so I need to import them accordingly. 

You will also include one final dense layer that produces the final output, as in the Transformer architecture of Vaswani et al. (2017). 

Next, you shall create the class method, call(), to feed the relevant inputs into the encoder and decoder.

A padding mask is first generated to mask the encoder input, as well as the encoder output, when this is fed into the second self-attention block of the decoder:

A padding mask and a look-ahead mask are then generated to mask the decoder input. These are combined together through an element-wise maximum operation:

Next, the relevant inputs are fed into the encoder and decoder, and the Transformer model output is generated by feeding the decoder output into one final dense layer:

Combining all the steps gives us the following complete code listing:

Note that you have performed a small change to the output that is returned by the padding_mask function. Its shape is made broadcastable to the shape of the attention weight tensor that it will mask when you train the Transformer model. 

Creating an Instance of the Transformer Model

You will work with the parameter values specified in the paper, Attention Is All You Need, by Vaswani et al. (2017):

As for the input-related parameters, you will work with dummy values for now until you arrive at the stage of training the complete Transformer model. At that point, you will use actual sentences:

You can now create an instance of the TransformerModel class as follows:

The complete code listing is as follows:

Printing Out a Summary of the Encoder and Decoder Layers

You may also print out a summary of the encoder and decoder blocks of the Transformer model. The choice to print them out separately will allow you to be able to see the details of their individual sub-layers. In order to do so, add the following line of code to the __init__() method of both the EncoderLayer and DecoderLayer classes:

Then you need to add the following method to the EncoderLayer class:

And the following method to the DecoderLayer class:

This results in the EncoderLayer class being modified as follows (the three dots under the call() method mean that this remains the same as the one that was implemented here):

Similar changes can be made to the DecoderLayer class too.

Once you have the necessary changes in place, you can proceed to create instances of the EncoderLayer and DecoderLayer classes and print out their summaries as follows:

The resulting summary for the encoder is the following:

While the resulting summary for the decoder is the following:

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

Papers

Summary

In this tutorial, you discovered how to implement the complete Transformer model and create padding and look-ahead masks.

Specifically, you learned:

  • How to create a padding mask for the encoder and decoder
  • How to create a look-ahead mask for the decoder
  • How to join the Transformer encoder and decoder into a single model
  • How to print out a summary of the encoder and decoder layers

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Learn Transformers and Attention!

Building Transformer Models with Attention

Teach your deep learning model to read a sentence

...using transformer models with attention

Discover how in my new Ebook:
Building Transformer Models with Attention

It provides self-study tutorials with working code to guide you into building a fully-working transformer models that can
translate sentences from one language to another...

Give magical power of understanding human language for
Your Projects


See What's Inside

, , , , ,

9 Responses to Joining the Transformer Encoder and Decoder Plus Masking

  1. Helen October 16, 2022 at 3:44 am #

    Thanks for your great tutorial!
    I found some errors when printing out the encoder and decoder structure information.
    It seems the reason is in the MultiHeadAttention/reshape_tensor function. Precisely, reshape is not correctly working for tensors with None dimension.
    I solved the problem by changing the function as below.
    _________________________________________________
    def reshape_tensor(self, x, heads, flag):
    if flag:
    # Tensor shape after reshaping and transposing: (batch_size, heads, seq_length, -1)
    x = reshape(x, shape=(shape(x)[0], shape(x)[1], heads, int(x.shape[2]/heads)))
    x = transpose(x, perm=(0, 2, 1, 3))
    else:
    # Reverting the reshaping and transposing operations: (batch_size, seq_length, d_model)
    x = transpose(x, perm=(0, 2, 1, 3))
    x = reshape(x, shape=(shape(x)[0], shape(x)[1], int(x.shape[2]*x.shape[3])))
    return x
    _________________________________________________
    It seems weird somehow. I want to know if there are any better solutions. Thanks again!

    • James Carmichael October 17, 2022 at 10:45 am #

      Thank you Helen for your feedback! What specific error messages did you encounter so that we may take note of any code listings that require review.

      • Helen October 17, 2022 at 11:00 pm #

        Thanks for your reply! The error is like below.

        in test_transformer
        encoder.build_graph().summary()
        in build_graph
        return Model(inputs=[input_layer], outputs=self.call(input_layer, None, True))
        in call
        multihead_output = self.multihead_attention(x, x, x, padding_mask)
        in call *
        return self.W_o(output)
        ValueError: The last dimension of the inputs to a Dense layer should be defined. Found None. Full input shape received: (None, 5, None)
        Call arguments received by layer “multi_head_attention_18” (type MultiHeadAttention):
        \u2022 queries=tf.Tensor(shape=(None, 5, 512), dtype=float32)
        \u2022 keys=tf.Tensor(shape=(None, 5, 512), dtype=float32)
        \u2022 values=tf.Tensor(shape=(None, 5, 512), dtype=float32)
        \u2022 mask=None

  2. Chunhua October 29, 2022 at 8:36 am #

    The last part of adding graph output is quite confusing.
    It invovles many changes in different places.
    In the end, I still got some error:

    49
    50 def build_graph(self):
    —> 51 input_layer = Input(shape=(self.sequence_length, self.d_model))
    52 return Model(inputs=[input_layer], outputs=self.call(input_layer, None, True))
    53 def call(self, x, padding_mask, training):

    The collab reproducing the error:

    NameError: name ‘Input’ is not defined

    • michel November 19, 2022 at 8:34 pm #

      I had the same error.

  3. michel November 19, 2022 at 8:08 pm #

    I got the following error when building the decoder summary, same as encoder summary

    TypeError Traceback (most recent call last)
    in
    2 # encoder.build_graph().summary()
    3
    —-> 4 decoder = DecoderLayer(dec_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate)
    5 decoder.build_graph().summary()

    TypeError: __init__() takes 7 positional arguments but 8 were given

    • James Carmichael November 20, 2022 at 11:54 am #

      Hi Michel…I would recommend trying the code in Google Colab.

  4. Michel November 19, 2022 at 8:36 pm #

    ValueError Traceback (most recent call last)
    in
    2
    3 encoder = EncoderLayer(enc_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate)
    —-> 4 encoder.build_graph().summary()
    5
    6 decoder = DecoderLayer(dec_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate)

    3 frames
    /tmp/__autograph_generated_file2sd23ix_.py in tf__call(self, queries, keys, values, mask)
    15 try:
    16 do_return = True
    —> 17 retval_ = ag__.converted_call(ag__.ld(self).W_o, (ag__.ld(output),), None, fscope)
    18 except:
    19 do_return = False

    ValueError: Exception encountered when calling layer “multi_head_attention_155” (type MultiHeadAttention).

    in user code:

    File “”, line 48, in call *
    return self.W_o(output)
    File “/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py”, line 67, in error_handler **
    raise e.with_traceback(filtered_tb) from None
    File “/usr/local/lib/python3.7/dist-packages/keras/layers/core/dense.py”, line 141, in build
    raise ValueError(‘The last dimension of the inputs to a Dense layer ‘

    ValueError: The last dimension of the inputs to a Dense layer should be defined. Found None. Full input shape received: (None, 5, None)

    Call arguments received by layer “multi_head_attention_155” (type MultiHeadAttention):
    • queries=tf.Tensor(shape=(None, 5, 512), dtype=float32)
    • keys=tf.Tensor(shape=(None, 5, 512), dtype=float32)
    • values=tf.Tensor(shape=(None, 5, 512), dtype=float32)
    • mask=None

    • James Carmichael November 20, 2022 at 11:51 am #

      Hi Michel…Did you type in the code or copy and paste it? You may wish to try it Google Colab.

Leave a Reply