Implementing the Transformer Decoder from Scratch in TensorFlow and Keras

By Stefania Cristina on January 6, 2023 in Attention 11

There are many similarities between the Transformer encoder and decoder, such as their implementation of multi-head attention, layer normalization, and a fully connected feed-forward network as their final sub-layer. Having implemented the Transformer encoder, we will now go ahead and apply our knowledge in implementing the Transformer decoder as a further step toward implementing the complete Transformer model. Your end goal remains to apply the complete model to Natural Language Processing (NLP).

In this tutorial, you will discover how to implement the Transformer decoder from scratch in TensorFlow and Keras.

After completing this tutorial, you will know:

The layers that form part of the Transformer decoder
How to implement the Transformer decoder from scratch

Kick-start your project with my book Building Transformer Models with Attention. It provides self-study tutorials with working code to guide you into building a fully-working transformer model that can
translate sentences from one language to another...

Let’s get started.

Implementing the Transformer decoder from scratch in TensorFlow and Keras
Photo by François Kaiser, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Recap of the Transformer Architecture
- The Transformer Decoder
Implementing the Transformer Decoder From Scratch
- The Decoder Layer
- The Transformer Decoder
Testing Out the Code

Prerequisites

For this tutorial, we assume that you are already familiar with:

Recap of the Transformer Architecture

Recall having seen that the Transformer architecture follows an encoder-decoder structure. The encoder, on the left-hand side, is tasked with mapping an input sequence to a sequence of continuous representations; the decoder, on the right-hand side, receives the output of the encoder together with the decoder output at the previous time step to generate an output sequence.

The encoder-decoder structure of the Transformer architecture
Taken from “Attention Is All You Need“

In generating an output sequence, the Transformer does not rely on recurrence and convolutions.

You have seen that the decoder part of the Transformer shares many similarities in its architecture with the encoder. This tutorial will explore these similarities.

The Transformer Decoder

Similar to the Transformer encoder, the Transformer decoder also consists of a stack of $N$ identical layers. The Transformer decoder, however, implements an additional multi-head attention block for a total of three main sub-layers:

The first sub-layer comprises a multi-head attention mechanism that receives the queries, keys, and values as inputs.
The second sub-layer comprises a second multi-head attention mechanism.
The third sub-layer comprises a fully-connected feed-forward network.

The decoder block of the Transformer architecture
Taken from “Attention Is All You Need“

Each one of these three sub-layers is also followed by layer normalization, where the input to the layer normalization step is its corresponding sub-layer input (through a residual connection) and output.

On the decoder side, the queries, keys, and values that are fed into the first multi-head attention block also represent the same input sequence. However, this time around, it is the target sequence that is embedded and augmented with positional information before being supplied to the decoder. On the other hand, the second multi-head attention block receives the encoder output in the form of keys and values and the normalized output of the first decoder attention block as the queries. In both cases, the dimensionality of the queries and keys remains equal to $d_k$, whereas the dimensionality of the values remains equal to $d_v$.

Vaswani et al. introduce regularization into the model on the decoder side, too, by applying dropout to the output of each sub-layer (before the layer normalization step), as well as to the positional encodings before these are fed into the decoder.

Let’s now see how to implement the Transformer decoder from scratch in TensorFlow and Keras.

Want to Get Started With Building Transformer Models with Attention?

Take my free 12-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Implementing the Transformer Decoder from Scratch

The Decoder Layer

Since you have already implemented the required sub-layers when you covered the implementation of the Transformer encoder, you will create a class for the decoder layer that makes use of these sub-layers straight away:

from multihead_attention import MultiHeadAttention
from encoder import AddNormalization, FeedForward

class DecoderLayer(Layer):
    def __init__(self, h, d_k, d_v, d_model, d_ff, rate, **kwargs):
        super(DecoderLayer, self).__init__(**kwargs)
        self.multihead_attention1 = MultiHeadAttention(h, d_k, d_v, d_model)
        self.dropout1 = Dropout(rate)
        self.add_norm1 = AddNormalization()
        self.multihead_attention2 = MultiHeadAttention(h, d_k, d_v, d_model)
        self.dropout2 = Dropout(rate)
        self.add_norm2 = AddNormalization()
        self.feed_forward = FeedForward(d_ff, d_model)
        self.dropout3 = Dropout(rate)
        self.add_norm3 = AddNormalization()
        ...

from multihead_attention import MultiHeadAttention

from encoder import AddNormalization, FeedForward

class DecoderLayer(Layer):

def __init__(self, h, d_k, d_v, d_model, d_ff, rate, **kwargs):

super(DecoderLayer, self).__init__(**kwargs)

self.multihead_attention1 = MultiHeadAttention(h, d_k, d_v, d_model)

self.dropout1 = Dropout(rate)

self.add_norm1 = AddNormalization()

self.multihead_attention2 = MultiHeadAttention(h, d_k, d_v, d_model)

self.dropout2 = Dropout(rate)

self.add_norm2 = AddNormalization()

self.feed_forward = FeedForward(d_ff, d_model)

self.dropout3 = Dropout(rate)

self.add_norm3 = AddNormalization()

...

Notice here that since the code for the different sub-layers had been saved into several Python scripts (namely, multihead_attention.py and encoder.py), it was necessary to import them to be able to use the required classes.

As you did for the Transformer encoder, you will now create the class method, call(), that implements all the decoder sub-layers:

...
def call(self, x, encoder_output, lookahead_mask, padding_mask, training):
    # Multi-head attention layer
    multihead_output1 = self.multihead_attention1(x, x, x, lookahead_mask)
    # Expected output shape = (batch_size, sequence_length, d_model)

    # Add in a dropout layer
    multihead_output1 = self.dropout1(multihead_output1, training=training)

    # Followed by an Add & Norm layer
    addnorm_output1 = self.add_norm1(x, multihead_output1)
    # Expected output shape = (batch_size, sequence_length, d_model)

    # Followed by another multi-head attention layer
    multihead_output2 = self.multihead_attention2(addnorm_output1, encoder_output, encoder_output, padding_mask)

    # Add in another dropout layer
    multihead_output2 = self.dropout2(multihead_output2, training=training)

    # Followed by another Add & Norm layer
    addnorm_output2 = self.add_norm1(addnorm_output1, multihead_output2)

    # Followed by a fully connected layer
    feedforward_output = self.feed_forward(addnorm_output2)
    # Expected output shape = (batch_size, sequence_length, d_model)

    # Add in another dropout layer
    feedforward_output = self.dropout3(feedforward_output, training=training)

    # Followed by another Add & Norm layer
    return self.add_norm3(addnorm_output2, feedforward_output)

...

def call(self, x, encoder_output, lookahead_mask, padding_mask, training):

# Multi-head attention layer

multihead_output1 = self.multihead_attention1(x, x, x, lookahead_mask)

# Expected output shape = (batch_size, sequence_length, d_model)

# Add in a dropout layer

multihead_output1 = self.dropout1(multihead_output1, training=training)

# Followed by an Add & Norm layer

addnorm_output1 = self.add_norm1(x, multihead_output1)

# Expected output shape = (batch_size, sequence_length, d_model)

# Followed by another multi-head attention layer

multihead_output2 = self.multihead_attention2(addnorm_output1, encoder_output, encoder_output, padding_mask)

# Add in another dropout layer

multihead_output2 = self.dropout2(multihead_output2, training=training)

# Followed by another Add & Norm layer

addnorm_output2 = self.add_norm1(addnorm_output1, multihead_output2)

# Followed by a fully connected layer

feedforward_output = self.feed_forward(addnorm_output2)

# Expected output shape = (batch_size, sequence_length, d_model)

# Add in another dropout layer

feedforward_output = self.dropout3(feedforward_output, training=training)

# Followed by another Add & Norm layer

return self.add_norm3(addnorm_output2, feedforward_output)

The multi-head attention sub-layers can also receive a padding mask or a look-ahead mask. As a brief reminder of what was said in a previous tutorial, the padding mask is necessary to suppress the zero padding in the input sequence from being processed along with the actual input values. The look-ahead mask prevents the decoder from attending to succeeding words, such that the prediction for a particular word can only depend on known outputs for the words that come before it.

The same call() class method can also receive a training flag to only apply the Dropout layers during training when the flag’s value is set to True.

The Transformer Decoder

The Transformer decoder takes the decoder layer you have just implemented and replicates it identically $N$ times.

You will create the following Decoder() class to implement the Transformer decoder:

from positional_encoding import PositionEmbeddingFixedWeights

class Decoder(Layer):
    def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, rate, **kwargs):
        super(Decoder, self).__init__(**kwargs)
        self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model)
        self.dropout = Dropout(rate)
        self.decoder_layer = [DecoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)
        ...

from positional_encoding import PositionEmbeddingFixedWeights

class Decoder(Layer):

def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, rate, **kwargs):

super(Decoder, self).__init__(**kwargs)

self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model)

self.dropout = Dropout(rate)

self.decoder_layer = [DecoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)

...

As in the Transformer encoder, the input to the first multi-head attention block on the decoder side receives the input sequence after this would have undergone a process of word embedding and positional encoding. For this purpose, an instance of the PositionEmbeddingFixedWeights class (covered in this tutorial) is initialized, and its output assigned to the pos_encoding variable.

The final step is to create a class method, call(), that applies word embedding and positional encoding to the input sequence and feeds the result, together with the encoder output, to $N$ decoder layers:

...
def call(self, output_target, encoder_output, lookahead_mask, padding_mask, training):
    # Generate the positional encoding
    pos_encoding_output = self.pos_encoding(output_target)
    # Expected output shape = (number of sentences, sequence_length, d_model)

    # Add in a dropout layer
    x = self.dropout(pos_encoding_output, training=training)

    # Pass on the positional encoded values to each encoder layer
    for i, layer in enumerate(self.decoder_layer):
        x = layer(x, encoder_output, lookahead_mask, padding_mask, training)

    return x

...

def call(self, output_target, encoder_output, lookahead_mask, padding_mask, training):

# Generate the positional encoding

pos_encoding_output = self.pos_encoding(output_target)

# Expected output shape = (number of sentences, sequence_length, d_model)

# Add in a dropout layer

x = self.dropout(pos_encoding_output, training=training)

# Pass on the positional encoded values to each encoder layer

for i, layer in enumerate(self.decoder_layer):

x = layer(x, encoder_output, lookahead_mask, padding_mask, training)

return x

The code listing for the full Transformer decoder is the following:

from tensorflow.keras.layers import Layer, Dropout
from multihead_attention import MultiHeadAttention
from positional_encoding import PositionEmbeddingFixedWeights
from encoder import AddNormalization, FeedForward
 
# Implementing the Decoder Layer
class DecoderLayer(Layer):
    def __init__(self, h, d_k, d_v, d_model, d_ff, rate, **kwargs):
        super(DecoderLayer, self).__init__(**kwargs)
        self.multihead_attention1 = MultiHeadAttention(h, d_k, d_v, d_model)
        self.dropout1 = Dropout(rate)
        self.add_norm1 = AddNormalization()
        self.multihead_attention2 = MultiHeadAttention(h, d_k, d_v, d_model)
        self.dropout2 = Dropout(rate)
        self.add_norm2 = AddNormalization()
        self.feed_forward = FeedForward(d_ff, d_model)
        self.dropout3 = Dropout(rate)
        self.add_norm3 = AddNormalization()

    def call(self, x, encoder_output, lookahead_mask, padding_mask, training):
        # Multi-head attention layer
        multihead_output1 = self.multihead_attention1(x, x, x, lookahead_mask)
        # Expected output shape = (batch_size, sequence_length, d_model)

        # Add in a dropout layer
        multihead_output1 = self.dropout1(multihead_output1, training=training)

        # Followed by an Add & Norm layer
        addnorm_output1 = self.add_norm1(x, multihead_output1)
        # Expected output shape = (batch_size, sequence_length, d_model)

        # Followed by another multi-head attention layer
        multihead_output2 = self.multihead_attention2(addnorm_output1, encoder_output, encoder_output, padding_mask)

        # Add in another dropout layer
        multihead_output2 = self.dropout2(multihead_output2, training=training)

        # Followed by another Add & Norm layer
        addnorm_output2 = self.add_norm1(addnorm_output1, multihead_output2)

        # Followed by a fully connected layer
        feedforward_output = self.feed_forward(addnorm_output2)
        # Expected output shape = (batch_size, sequence_length, d_model)

        # Add in another dropout layer
        feedforward_output = self.dropout3(feedforward_output, training=training)

        # Followed by another Add & Norm layer
        return self.add_norm3(addnorm_output2, feedforward_output)

# Implementing the Decoder
class Decoder(Layer):
    def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, rate, **kwargs):
        super(Decoder, self).__init__(**kwargs)
        self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model)
        self.dropout = Dropout(rate)
        self.decoder_layer = [DecoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)]

    def call(self, output_target, encoder_output, lookahead_mask, padding_mask, training):
        # Generate the positional encoding
        pos_encoding_output = self.pos_encoding(output_target)
        # Expected output shape = (number of sentences, sequence_length, d_model)

        # Add in a dropout layer
        x = self.dropout(pos_encoding_output, training=training)

        # Pass on the positional encoded values to each encoder layer
        for i, layer in enumerate(self.decoder_layer):
            x = layer(x, encoder_output, lookahead_mask, padding_mask, training)

        return x

from tensorflow.keras.layers import Layer, Dropout

from multihead_attention import MultiHeadAttention

from positional_encoding import PositionEmbeddingFixedWeights

from encoder import AddNormalization, FeedForward

# Implementing the Decoder Layer

class DecoderLayer(Layer):

def __init__(self, h, d_k, d_v, d_model, d_ff, rate, **kwargs):

super(DecoderLayer, self).__init__(**kwargs)

self.multihead_attention1 = MultiHeadAttention(h, d_k, d_v, d_model)

self.dropout1 = Dropout(rate)

self.add_norm1 = AddNormalization()

self.multihead_attention2 = MultiHeadAttention(h, d_k, d_v, d_model)

self.dropout2 = Dropout(rate)

self.add_norm2 = AddNormalization()

self.feed_forward = FeedForward(d_ff, d_model)

self.dropout3 = Dropout(rate)

self.add_norm3 = AddNormalization()

def call(self, x, encoder_output, lookahead_mask, padding_mask, training):

# Multi-head attention layer

multihead_output1 = self.multihead_attention1(x, x, x, lookahead_mask)

# Expected output shape = (batch_size, sequence_length, d_model)

# Add in a dropout layer

multihead_output1 = self.dropout1(multihead_output1, training=training)

# Followed by an Add & Norm layer

addnorm_output1 = self.add_norm1(x, multihead_output1)

# Expected output shape = (batch_size, sequence_length, d_model)

# Followed by another multi-head attention layer

multihead_output2 = self.multihead_attention2(addnorm_output1, encoder_output, encoder_output, padding_mask)

# Add in another dropout layer

multihead_output2 = self.dropout2(multihead_output2, training=training)

# Followed by another Add & Norm layer

addnorm_output2 = self.add_norm1(addnorm_output1, multihead_output2)

# Followed by a fully connected layer

feedforward_output = self.feed_forward(addnorm_output2)

# Expected output shape = (batch_size, sequence_length, d_model)

# Add in another dropout layer

feedforward_output = self.dropout3(feedforward_output, training=training)

# Followed by another Add & Norm layer

return self.add_norm3(addnorm_output2, feedforward_output)

# Implementing the Decoder

class Decoder(Layer):

def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, rate, **kwargs):

super(Decoder, self).__init__(**kwargs)

self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model)

self.dropout = Dropout(rate)

self.decoder_layer = [DecoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)]

def call(self, output_target, encoder_output, lookahead_mask, padding_mask, training):

# Generate the positional encoding

pos_encoding_output = self.pos_encoding(output_target)

# Expected output shape = (number of sentences, sequence_length, d_model)

# Add in a dropout layer

x = self.dropout(pos_encoding_output, training=training)

# Pass on the positional encoded values to each encoder layer

for i, layer in enumerate(self.decoder_layer):

x = layer(x, encoder_output, lookahead_mask, padding_mask, training)

return x

Testing Out the Code

You will work with the parameter values specified in the paper, Attention Is All You Need, by Vaswani et al. (2017):

h = 8  # Number of self-attention heads
d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values
d_ff = 2048  # Dimensionality of the inner fully connected layer
d_model = 512  # Dimensionality of the model sub-layers' outputs
n = 6  # Number of layers in the encoder stack

batch_size = 64  # Batch size from the training process
dropout_rate = 0.1  # Frequency of dropping the input units in the dropout layers
...

h = 8 # Number of self-attention heads

d_k = 64 # Dimensionality of the linearly projected queries and keys

d_v = 64 # Dimensionality of the linearly projected values

d_ff = 2048 # Dimensionality of the inner fully connected layer

d_model = 512 # Dimensionality of the model sub-layers' outputs

n = 6 # Number of layers in the encoder stack

batch_size = 64 # Batch size from the training process

dropout_rate = 0.1 # Frequency of dropping the input units in the dropout layers

...

As for the input sequence, you will work with dummy data for the time being until you arrive at the stage of training the complete Transformer model in a separate tutorial, at which point you will use actual sentences:

...
dec_vocab_size = 20 # Vocabulary size for the decoder
input_seq_length = 5  # Maximum length of the input sequence

input_seq = random.random((batch_size, input_seq_length))
enc_output = random.random((batch_size, input_seq_length, d_model))
...

...

dec_vocab_size = 20 # Vocabulary size for the decoder

input_seq_length = 5 # Maximum length of the input sequence

input_seq = random.random((batch_size, input_seq_length))

enc_output = random.random((batch_size, input_seq_length, d_model))

...

Next, you will create a new instance of the Decoder class, assigning its output to the decoder variable, subsequently passing in the input arguments, and printing the result. You will set the padding and look-ahead masks to None for the time being, but you will return to these when you implement the complete Transformer model:

...
decoder = Decoder(dec_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)
print(decoder(input_seq, enc_output, None, True)

...

decoder = Decoder(dec_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)

print(decoder(input_seq, enc_output, None, True)

Tying everything together produces the following code listing:

from numpy import random

dec_vocab_size = 20  # Vocabulary size for the decoder
input_seq_length = 5  # Maximum length of the input sequence
h = 8  # Number of self-attention heads
d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values
d_ff = 2048  # Dimensionality of the inner fully connected layer
d_model = 512  # Dimensionality of the model sub-layers' outputs
n = 6  # Number of layers in the decoder stack

batch_size = 64  # Batch size from the training process
dropout_rate = 0.1  # Frequency of dropping the input units in the dropout layers

input_seq = random.random((batch_size, input_seq_length))
enc_output = random.random((batch_size, input_seq_length, d_model))

decoder = Decoder(dec_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)
print(decoder(input_seq, enc_output, None, True))

from numpy import random

dec_vocab_size = 20 # Vocabulary size for the decoder

input_seq_length = 5 # Maximum length of the input sequence

h = 8 # Number of self-attention heads

d_k = 64 # Dimensionality of the linearly projected queries and keys

d_v = 64 # Dimensionality of the linearly projected values

d_ff = 2048 # Dimensionality of the inner fully connected layer

d_model = 512 # Dimensionality of the model sub-layers' outputs

n = 6 # Number of layers in the decoder stack

batch_size = 64 # Batch size from the training process

dropout_rate = 0.1 # Frequency of dropping the input units in the dropout layers

input_seq = random.random((batch_size, input_seq_length))

enc_output = random.random((batch_size, input_seq_length, d_model))

decoder = Decoder(dec_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)

print(decoder(input_seq, enc_output, None, True))

Running this code produces an output of shape (batch size, sequence length, model dimensionality). Note that you will likely see a different output due to the random initialization of the input sequence and the parameter values of the Dense layers.

tf.Tensor(
[[[-0.04132953 -1.7236308   0.5391184  ... -0.76394725  1.4969798
    0.37682498]
  [ 0.05501875 -1.7523409   0.58404493 ... -0.70776534  1.4498456
    0.32555297]
  [ 0.04983566 -1.8431275   0.55850077 ... -0.68202156  1.4222856
    0.32104644]
  [-0.05684051 -1.8862512   0.4771412  ... -0.7101341   1.431343
    0.39346313]
  [-0.15625843 -1.7992781   0.40803364 ... -0.75190556  1.4602519
    0.53546077]]
...

 [[-0.58847624 -1.646842    0.5973466  ... -0.47778523  1.2060764
    0.34091905]
  [-0.48688865 -1.6809179   0.6493542  ... -0.41274604  1.188649
    0.27100053]
  [-0.49568555 -1.8002801   0.61536175 ... -0.38540334  1.2023914
    0.24383534]
  [-0.59913146 -1.8598882   0.5098136  ... -0.3984461   1.2115746
    0.3186561 ]
  [-0.71045107 -1.7778647   0.43008155 ... -0.42037937  1.2255307
    0.47380894]]], shape=(64, 5, 512), dtype=float32)

tf.Tensor(

[[[-0.04132953 -1.7236308 0.5391184 ... -0.76394725 1.4969798

0.37682498]

[ 0.05501875 -1.7523409 0.58404493 ... -0.70776534 1.4498456

0.32555297]

[ 0.04983566 -1.8431275 0.55850077 ... -0.68202156 1.4222856

0.32104644]

[-0.05684051 -1.8862512 0.4771412 ... -0.7101341 1.431343

0.39346313]

[-0.15625843 -1.7992781 0.40803364 ... -0.75190556 1.4602519

0.53546077]]

...

[[-0.58847624 -1.646842 0.5973466 ... -0.47778523 1.2060764

0.34091905]

[-0.48688865 -1.6809179 0.6493542 ... -0.41274604 1.188649

0.27100053]

[-0.49568555 -1.8002801 0.61536175 ... -0.38540334 1.2023914

0.24383534]

[-0.59913146 -1.8598882 0.5098136 ... -0.3984461 1.2115746

0.3186561 ]

[-0.71045107 -1.7778647 0.43008155 ... -0.42037937 1.2255307

0.47380894]]], shape=(64, 5, 512), dtype=float32)

Summary

In this tutorial, you discovered how to implement the Transformer decoder from scratch in TensorFlow and Keras.

Specifically, you learned:

The layers that form part of the Transformer decoder
How to implement the Transformer decoder from scratch

Do you have any questions?
Ask your questions in the comments below, and I will do my best to answer.

11 Responses to Implementing the Transformer Decoder from Scratch in TensorFlow and Keras

Dev October 7, 2022 at 2:23 pm #

These series of blogs on transformers are the best way to learn about transformers on the internet. Thank you!

- James Carmichael October 8, 2022 at 6:53 am #
  
  You are very welcome Dev! We appreciate the feedback and support!
  
zahra November 5, 2022 at 6:56 am #

Very informative like always,
one question, can you consider any limitation with only decoder transformer such as GPT, in any approaches related to NLP?

Sreedhar M May 24, 2023 at 12:43 am #

Any student discount for GAn and transformer models and how these models can be applied especially transformer models for satellite umages

- James Carmichael May 24, 2023 at 8:37 am #
  
  Hi Sreedhar…Please send an email regarding your questions on student discounts.
  
maximoskp October 24, 2023 at 5:05 pm #

Hi and thank you for all those amazing free tutorials!

I think I have spotted a typo:

In the code listing for the full Transformer decoder (and the respective part given above it), in line 39, instead of

addnorm_output2 = self.add_norm1(addnorm_output1, multihead_output2)

I think it should be:

addnorm_output2 = self.add_norm2(addnorm_output1, multihead_output2)

Sorry if I missed something! Thanks again – cheers!

- James Carmichael October 25, 2023 at 9:06 am #
  
  Thank you for your feedback and support! We greatly appreciate it!
  
  - chuck January 13, 2024 at 8:47 pm #
    
    Yet it still hasn’t been fixed. Does Jason Brownlee still run this site?
    
    - James Carmichael January 14, 2024 at 9:04 am #
      
      Thank you for your feedback! Yes, he does!
      
Amir December 3, 2023 at 3:36 am #

Many thanks, Dear Jason Brownlee.

I’ve followed all of your tutorials on transformers.

I’ve learned a lot, and I just want to express my thanks.

However, I have a small suggestion. Could you please create a guide on implementing Transformers specifically for time series data, focusing on forecasting, classification, or anomaly detection? One explanation would be sufficient for us.

Thank you in advance.

- James Carmichael December 3, 2023 at 7:39 am #
  
  Hi Amir…Thank you for your support, feedback and suggestions! Your suggestion is a great one! Please ensure you are subscribed to our newsletter so that you will be notified of new content.

Navigation

Implementing the Transformer Decoder from Scratch in TensorFlow and Keras

Tutorial Overview

Prerequisites

Recap of the Transformer Architecture

The Transformer Decoder

Want to Get Started With Building Transformer Models with Attention?

Implementing the Transformer Decoder from Scratch

The Decoder Layer

The Transformer Decoder

Testing Out the Code

Further Reading

Books

Papers

Summary

Learn Transformers and Attention!

Teach your deep learning model to read a sentence

Give magical power of understanding human language for
Your Projects

More On This Topic

11 Responses to Implementing the Transformer Decoder from Scratch in TensorFlow and Keras

Leave a Reply Click here to cancel reply.

Navigation

Tutorial Overview

Prerequisites

Recap of the Transformer Architecture

The Transformer Decoder

Want to Get Started With Building Transformer Models with Attention?

Implementing the Transformer Decoder from Scratch

The Decoder Layer

The Transformer Decoder

Testing Out the Code

Further Reading

Books

Papers

Summary

Learn Transformers and Attention!

Teach your deep learning model to read a sentence

Give magical power of understanding human language for Your Projects

More On This Topic

11 Responses to Implementing the Transformer Decoder from Scratch in TensorFlow and Keras

Leave a Reply Click here to cancel reply.

Give magical power of understanding human language for
Your Projects