[New Book] Click to get The Beginner's Guide to Data Science!
Use the offer code 20offearlybird to get 20% off. Hurry, sale ends soon!

How to Implement Scaled Dot-Product Attention from Scratch in TensorFlow and Keras

Having familiarized ourselves with the theory behind the Transformer model and its attention mechanism, we’ll start our journey of implementing a complete Transformer model by first seeing how to implement the scaled-dot product attention. The scaled dot-product attention is an integral part of the multi-head attention, which, in turn, is an important component of both the Transformer encoder and decoder. Our end goal will be to apply the complete Transformer model to Natural Language Processing (NLP).

In this tutorial, you will discover how to implement scaled dot-product attention from scratch in TensorFlow and Keras. 

After completing this tutorial, you will know:

  • The operations that form part of the scaled dot-product attention mechanism
  • How to implement the scaled dot-product attention mechanism from scratch   

Kick-start your project with my book Building Transformer Models with Attention. It provides self-study tutorials with working code to guide you into building a fully-working transformer model that can
translate sentences from one language to another...

Let’s get started. 

How to implement scaled dot-product attention from scratch in TensorFlow and Keras
Photo by Sergey Shmidt, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  • Recap of the Transformer Architecture
    • The Transformer Scaled Dot-Product Attention
  • Implementing the Scaled Dot-Product Attention From Scratch
  • Testing Out the Code

Prerequisites

For this tutorial, we assume that you are already familiar with:

Recap of the Transformer Architecture

Recall having seen that the Transformer architecture follows an encoder-decoder structure. The encoder, on the left-hand side, is tasked with mapping an input sequence to a sequence of continuous representations; the decoder, on the right-hand side, receives the output of the encoder together with the decoder output at the previous time step to generate an output sequence.

The encoder-decoder structure of the Transformer architecture
Taken from “Attention Is All You Need

In generating an output sequence, the Transformer does not rely on recurrence and convolutions.

You have seen that the decoder part of the Transformer shares many similarities in its architecture with the encoder. One of the core components that both the encoder and decoder share within their multi-head attention blocks is the scaled dot-product attention. 

The Transformer Scaled Dot-Product Attention

First, recall the queries, keys, and values as the important components you will work with. 

In the encoder stage, they each carry the same input sequence after this has been embedded and augmented by positional information. Similarly, on the decoder side, the queries, keys, and values fed into the first attention block represent the same target sequence after this would have also been embedded and augmented by positional information. The second attention block of the decoder receives the encoder output in the form of keys and values and the normalized output of the first attention block as the queries. The dimensionality of the queries and keys is denoted by $d_k$, whereas the dimensionality of the values is denoted by $d_v$.

The scaled dot-product attention receives these queries, keys, and values as inputs and first computes the dot-product of the queries with the keys. The result is subsequently scaled by the square root of $d_k$, producing the attention scores. They are then fed into a softmax function, obtaining a set of attention weights. Finally, the attention weights are used to scale the values through a weighted multiplication operation. This entire process can be explained mathematically as follows, where $\mathbf{Q}$, $\mathbf{K}$ and $\mathbf{V}$ denote the queries, keys, and values, respectively:

$$\text{attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax} \left( \frac{\mathbf{Q} \mathbf{K}^\mathsf{T}}{\sqrt{d_k}} \right) \mathbf{V}$$

Each multi-head attention block in the Transformer model implements a scaled dot-product attention operation as shown below:

Scaled dot-product attention and multi-head attention
Taken from “Attention Is All You Need

You may note that the scaled dot-product attention can also apply a mask to the attention scores before feeding them into the softmax function. 

Since the word embeddings are zero-padded to a specific sequence length, a padding mask needs to be introduced in order to prevent the zero tokens from being processed along with the input in both the encoder and decoder stages. Furthermore, a look-ahead mask is also required to prevent the decoder from attending to succeeding words, such that the prediction for a particular word can only depend on known outputs for the words that come before it.

These look-ahead and padding masks are applied inside the scaled dot-product attention set to -$\infty$ all the values in the input to the softmax function that should not be considered. For each of these large negative inputs, the softmax function will, in turn, produce an output value that is close to zero, effectively masking them out. The use of these masks will become clearer when you progress to the implementation of the encoder and decoder blocks in separate tutorials. 

For the time being, let’s see how to implement the scaled dot-product attention from scratch in TensorFlow and Keras.

Want to Get Started With Building Transformer Models with Attention?

Take my free 12-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Implementing the Scaled Dot-Product Attention from Scratch

For this purpose, you will create a class called DotProductAttention that inherits from the Layer base class in Keras. 

In it, you will create the class method, call(), that takes as input arguments the queries, keys, and values, as well as the dimensionality, $d_k$, and a mask (that defaults to None):

The first step is to perform a dot-product operation between the queries and the keys, transposing the latter. The result will be scaled through a division by the square root of $d_k$. You will add the following line of code to the call() class method:

Next, you will check whether the mask argument has been set to a value that is not the default None. 

The mask will contain either 0 values to indicate that the corresponding token in the input sequence should be considered in the computations or a 1 to indicate otherwise. The mask will be multiplied by -1e9 to set the 1 values to large negative numbers (remember having mentioned this in the previous section), subsequently applied to the attention scores:

The attention scores will then be passed through a softmax function to generate the attention weights:

The final step weights the values with the computed attention weights through another dot-product operation:

The complete code listing is as follows:

Testing Out the Code

You will be working with the parameter values specified in the paper, Attention Is All You Need, by Vaswani et al. (2017):

As for the sequence length and the queries, keys, and values, you will be working with dummy data for the time being until you arrive at the stage of training the complete Transformer model in a separate tutorial, at which point you will use actual sentences. Similarly, for the mask,  leave it set to its default value for the time being:

In the complete Transformer model, values for the sequence length and the queries, keys, and values will be obtained through a process of word tokenization and embedding. You will be covering this in a separate tutorial. 

Returning to the testing procedure, the next step is to create a new instance of the DotProductAttention class, assigning its output to the attention variable:

Since the DotProductAttention class inherits from the Layer base class, the call() method of the former will be automatically invoked by the magic __call()__ method of the latter. The final step is to feed in the input arguments and print the result:

Tying everything together produces the following code listing:

Running this code produces an output of shape (batch size, sequence length, values dimensionality). Note that you will likely see a different output due to the random initialization of the queries, keys, and values.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

Papers

Summary

In this tutorial, you discovered how to implement scaled dot-product attention from scratch in TensorFlow and Keras.

Specifically, you learned:

  • The operations that form part of the scaled dot-product attention mechanism
  • How to implement the scaled dot-product attention mechanism from scratch

Do you have any questions?
Ask your questions in the comments below, and I will do my best to answer.

Learn Transformers and Attention!

Building Transformer Models with Attention

Teach your deep learning model to read a sentence

...using transformer models with attention

Discover how in my new Ebook:
Building Transformer Models with Attention

It provides self-study tutorials with working code to guide you into building a fully-working transformer models that can
translate sentences from one language to another...

Give magical power of understanding human language for
Your Projects


See What's Inside

, , ,

5 Responses to How to Implement Scaled Dot-Product Attention from Scratch in TensorFlow and Keras

  1. Avatar
    Janarddan Sarkar October 18, 2022 at 5:36 pm #

    TypeError: DotProductAttention.__init__() takes 1 positional argument but 5 were given.

    I am getting this error. Instead I have to call attention.cal(….). Any fix for this?

    • Avatar
      Emanuele Fittipaldi December 30, 2022 at 12:42 am #

      you have to import “from dumpy import random”

  2. Avatar
    Emanuele Fittipaldi December 30, 2022 at 12:43 am #

    *numpy

  3. Avatar
    Giovanni May 5, 2023 at 8:42 pm #

    Hi, thanks for the great article! I appreciate you took the time to add parameters for testing the function, I think the testing would be way more useful if you add random seeding before the random generation of Q, K, V.

    As an example, by calling random.seed(42) before generating the three random vector, I get the following output:

    [[[0.42829984 0.5291363 0.48467717 … 0.60236526 0.6314437 0.36796492]
    [0.42059597 0.51898783 0.46809804 … 0.59751767 0.63140476 0.39604473]
    [0.45291767 0.53372955 0.4822161 … 0.5861658 0.61705434 0.35611778]
    [0.43538865 0.52972203 0.47826144 … 0.5917443 0.6259302 0.36665624]
    [0.42998832 0.5189111 0.48113108 … 0.61032706 0.63044846 0.39192218]]

    [[0.6105153 0.50249505 0.40130395 … 0.71487725 0.36341453 0.5512418 ]
    [0.58420086 0.5239525 0.4311911 … 0.72335523 0.36001056 0.5697574 ]
    [0.5644941 0.5598139 0.44120124 … 0.69758904 0.34060007 0.57147545]
    [0.58783877 0.5212065 0.42275837 … 0.70439875 0.34812242 0.5561169 ]
    [0.5880349 0.52016133 0.43390357 … 0.70503277 0.35547623 0.56170976]

    • Avatar
      James Carmichael May 6, 2023 at 10:25 am #

      Thank you for your feedback and suggestions Giovanni!

Leave a Reply