The Attention Mechanism from Scratch

By Stefania Cristina on January 6, 2023 in Attention 27

The attention mechanism was introduced to improve the performance of the encoder-decoder model for machine translation. The idea behind the attention mechanism was to permit the decoder to utilize the most relevant parts of the input sequence in a flexible manner, by a weighted combination of all the encoded input vectors, with the most relevant vectors being attributed the highest weights.

In this tutorial, you will discover the attention mechanism and its implementation.

After completing this tutorial, you will know:

How the attention mechanism uses a weighted sum of all the encoder hidden states to flexibly focus the attention of the decoder on the most relevant parts of the input sequence
How the attention mechanism can be generalized for tasks where the information may not necessarily be related in a sequential fashion
How to implement the general attention mechanism in Python with NumPy and SciPy

Kick-start your project with my book Building Transformer Models with Attention. It provides self-study tutorials with working code to guide you into building a fully-working transformer model that can
translate sentences from one language to another...

Let’s get started.

The attention mechanism from scratch
Photo by Nitish Meena, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

The Attention Mechanism
The General Attention Mechanism
The General Attention Mechanism with NumPy and SciPy

The Attention Mechanism

The attention mechanism was introduced by Bahdanau et al. (2014) to address the bottleneck problem that arises with the use of a fixed-length encoding vector, where the decoder would have limited access to the information provided by the input. This is thought to become especially problematic for long and/or complex sequences, where the dimensionality of their representation would be forced to be the same as for shorter or simpler sequences.

Note that Bahdanau et al.’s attention mechanism is divided into the step-by-step computations of the alignment scores, the weights, and the context vector:

Alignment scores: The alignment model takes the encoded hidden states, $\mathbf{h}_i$, and the previous decoder output, $\mathbf{s}_{t-1}$, to compute a score, $e_{t,i}$, that indicates how well the elements of the input sequence align with the current output at the position, $t$. The alignment model is represented by a function, $a(.)$, which can be implemented by a feedforward neural network:

$$e_{t,i} = a(\mathbf{s}_{t-1}, \mathbf{h}_i)$$

Weights: The weights, $\alpha_{t,i}$, are computed by applying a softmax operation to the previously computed alignment scores:

$$\alpha_{t,i} = \text{softmax}(e_{t,i})$$

Context vector: A unique context vector, $\mathbf{c}_t$, is fed into the decoder at each time step. It is computed by a weighted sum of all, $T$, encoder hidden states:

$$\mathbf{c}_t = \sum_{i=1}^T \alpha_{t,i} \mathbf{h}_i$$

Bahdanau et al. implemented an RNN for both the encoder and decoder.

However, the attention mechanism can be re-formulated into a general form that can be applied to any sequence-to-sequence (abbreviated to seq2seq) task, where the information may not necessarily be related in a sequential fashion.

In other words, the database doesn’t have to consist of the hidden RNN states at different steps, but could contain any kind of information instead.

– Advanced Deep Learning with Python, 2019.

The General Attention Mechanism

The general attention mechanism makes use of three main components, namely the queries, $\mathbf{Q}$, the keys, $\mathbf{K}$, and the values, $\mathbf{V}$.

If you had to compare these three components to the attention mechanism as proposed by Bahdanau et al., then the query would be analogous to the previous decoder output, $\mathbf{s}_{t-1}$, while the values would be analogous to the encoded inputs, $\mathbf{h}_i$. In the Bahdanau attention mechanism, the keys and values are the same vector.

In this case, we can think of the vector $\mathbf{s}_{t-1}$ as a query executed against a database of key-value pairs, where the keys are vectors and the hidden states $\mathbf{h}_i$ are the values.

– Advanced Deep Learning with Python, 2019.

The general attention mechanism then performs the following computations:

Each query vector, $\mathbf{q} = \mathbf{s}_{t-1}$, is matched against a database of keys to compute a score value. This matching operation is computed as the dot product of the specific query under consideration with each key vector, $\mathbf{k}_i$:

$$e_{\mathbf{q},\mathbf{k}_i} = \mathbf{q} \cdot \mathbf{k}_i$$

The scores are passed through a softmax operation to generate the weights:

$$\alpha_{\mathbf{q},\mathbf{k}_i} = \text{softmax}(e_{\mathbf{q},\mathbf{k}_i})$$

The generalized attention is then computed by a weighted sum of the value vectors, $\mathbf{v}_{\mathbf{k}_i}$, where each value vector is paired with a corresponding key:

$$\text{attention}(\mathbf{q}, \mathbf{K}, \mathbf{V}) = \sum_i \alpha_{\mathbf{q},\mathbf{k}_i} \mathbf{v}_{\mathbf{k}_i}$$

Within the context of machine translation, each word in an input sentence would be attributed its own query, key, and value vectors. These vectors are generated by multiplying the encoder’s representation of the specific word under consideration with three different weight matrices that would have been generated during training.

In essence, when the generalized attention mechanism is presented with a sequence of words, it takes the query vector attributed to some specific word in the sequence and scores it against each key in the database. In doing so, it captures how the word under consideration relates to the others in the sequence. Then it scales the values according to the attention weights (computed from the scores) to retain focus on those words relevant to the query. In doing so, it produces an attention output for the word under consideration.

Want to Get Started With Building Transformer Models with Attention?

Take my free 12-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The General Attention Mechanism with NumPy and SciPy

This section will explore how to implement the general attention mechanism using the NumPy and SciPy libraries in Python.

For simplicity, you will initially calculate the attention for the first word in a sequence of four. You will then generalize the code to calculate an attention output for all four words in matrix form.

Hence, let’s start by first defining the word embeddings of the four different words to calculate the attention. In actual practice, these word embeddings would have been generated by an encoder; however, for this particular example, you will define them manually.

# encoder representations of four different words
word_1 = array([1, 0, 0])
word_2 = array([0, 1, 0])
word_3 = array([1, 1, 0])
word_4 = array([0, 0, 1])

# encoder representations of four different words

word_1 = array([1, 0, 0])

word_2 = array([0, 1, 0])

word_3 = array([1, 1, 0])

word_4 = array([0, 0, 1])

The next step generates the weight matrices, which you will eventually multiply to the word embeddings to generate the queries, keys, and values. Here, you shall generate these weight matrices randomly; however, in actual practice, these would have been learned during training.

...
# generating the weight matrices
random.seed(42) # to allow us to reproduce the same attention values
W_Q = random.randint(3, size=(3, 3))
W_K = random.randint(3, size=(3, 3))
W_V = random.randint(3, size=(3, 3))

...

# generating the weight matrices

random.seed(42) # to allow us to reproduce the same attention values

W_Q = random.randint(3, size=(3, 3))

W_K = random.randint(3, size=(3, 3))

W_V = random.randint(3, size=(3, 3))

Notice how the number of rows of each of these matrices is equal to the dimensionality of the word embeddings (which in this case is three) to allow us to perform the matrix multiplication.

Subsequently, the query, key, and value vectors for each word are generated by multiplying each word embedding by each of the weight matrices.

...
# generating the queries, keys and values
query_1 = word_1 @ W_Q
key_1 = word_1 @ W_K
value_1 = word_1 @ W_V

query_2 = word_2 @ W_Q
key_2 = word_2 @ W_K
value_2 = word_2 @ W_V

query_3 = word_3 @ W_Q
key_3 = word_3 @ W_K
value_3 = word_3 @ W_V

query_4 = word_4 @ W_Q
key_4 = word_4 @ W_K
value_4 = word_4 @ W_V

...

# generating the queries, keys and values

query_1 = word_1 @ W_Q

key_1 = word_1 @ W_K

value_1 = word_1 @ W_V

query_2 = word_2 @ W_Q

key_2 = word_2 @ W_K

value_2 = word_2 @ W_V

query_3 = word_3 @ W_Q

key_3 = word_3 @ W_K

value_3 = word_3 @ W_V

query_4 = word_4 @ W_Q

key_4 = word_4 @ W_K

value_4 = word_4 @ W_V

Considering only the first word for the time being, the next step scores its query vector against all the key vectors using a dot product operation.

...
# scoring the first query vector against all key vectors
scores = array([dot(query_1, key_1), dot(query_1, key_2), dot(query_1, key_3), dot(query_1, key_4)])

...

# scoring the first query vector against all key vectors

scores = array([dot(query_1, key_1), dot(query_1, key_2), dot(query_1, key_3), dot(query_1, key_4)])

The score values are subsequently passed through a softmax operation to generate the weights. Before doing so, it is common practice to divide the score values by the square root of the dimensionality of the key vectors (in this case, three) to keep the gradients stable.

...
# computing the weights by a softmax operation
weights = softmax(scores / key_1.shape[0] ** 0.5)

...

# computing the weights by a softmax operation

weights = softmax(scores / key_1.shape[0] ** 0.5)

Finally, the attention output is calculated by a weighted sum of all four value vectors.

...
# computing the attention by a weighted sum of the value vectors
attention = (weights[0] * value_1) + (weights[1] * value_2) + (weights[2] * value_3) + (weights[3] * value_4)

print(attention)

...

# computing the attention by a weighted sum of the value vectors

attention = (weights[0] * value_1) + (weights[1] * value_2) + (weights[2] * value_3) + (weights[3] * value_4)

print(attention)

[0.98522025 1.74174051 0.75652026]

1	[0.98522025 1.74174051 0.75652026]

For faster processing, the same calculations can be implemented in matrix form to generate an attention output for all four words in one go:

from numpy import array
from numpy import random
from numpy import dot
from scipy.special import softmax

# encoder representations of four different words
word_1 = array([1, 0, 0])
word_2 = array([0, 1, 0])
word_3 = array([1, 1, 0])
word_4 = array([0, 0, 1])

# stacking the word embeddings into a single array
words = array([word_1, word_2, word_3, word_4])

# generating the weight matrices
random.seed(42)
W_Q = random.randint(3, size=(3, 3))
W_K = random.randint(3, size=(3, 3))
W_V = random.randint(3, size=(3, 3))

# generating the queries, keys and values
Q = words @ W_Q
K = words @ W_K
V = words @ W_V

# scoring the query vectors against all key vectors
scores = Q @ K.transpose()

# computing the weights by a softmax operation
weights = softmax(scores / K.shape[1] ** 0.5, axis=1)

# computing the attention by a weighted sum of the value vectors
attention = weights @ V

print(attention)

from numpy import array

from numpy import random

from numpy import dot

from scipy.special import softmax

# encoder representations of four different words

word_1 = array([1, 0, 0])

word_2 = array([0, 1, 0])

word_3 = array([1, 1, 0])

word_4 = array([0, 0, 1])

# stacking the word embeddings into a single array

words = array([word_1, word_2, word_3, word_4])

# generating the weight matrices

random.seed(42)

W_Q = random.randint(3, size=(3, 3))

W_K = random.randint(3, size=(3, 3))

W_V = random.randint(3, size=(3, 3))

# generating the queries, keys and values

Q = words @ W_Q

K = words @ W_K

V = words @ W_V

# scoring the query vectors against all key vectors

scores = Q @ K.transpose()

# computing the weights by a softmax operation

weights = softmax(scores / K.shape[1] ** 0.5, axis=1)

# computing the attention by a weighted sum of the value vectors

attention = weights @ V

print(attention)

[[0.98522025 1.74174051 0.75652026]
 [0.90965265 1.40965265 0.5       ]
 [0.99851226 1.75849334 0.75998108]
 [0.99560386 1.90407309 0.90846923]]

[[0.98522025 1.74174051 0.75652026]

[0.90965265 1.40965265 0.5 ]

[0.99851226 1.75849334 0.75998108]

[0.99560386 1.90407309 0.90846923]]

Summary

In this tutorial, you discovered the attention mechanism and its implementation.

Specifically, you learned:

How the attention mechanism uses a weighted sum of all the encoder hidden states to flexibly focus the attention of the decoder to the most relevant parts of the input sequence
How the attention mechanism can be generalized for tasks where the information may not necessarily be related in a sequential fashion
How to implement the general attention mechanism with NumPy and SciPy

Do you have any questions?
Ask your questions in the comments below, and I will do my best to answer.

27 Responses to The Attention Mechanism from Scratch

Shad April 8, 2022 at 8:43 pm #

Hi Stefania,
I have a rather naïve question, can attention be used for Tabular data (where the values are floating point units, i.e similar to Iris dataset) ? If yes, could you kindly suggest (or if possible) how ? I am dealing with highly-imbalanced multi-dimensional tabular-data and I would like to implement Attention to my 1d-cnn. I would greatly appreciate your input.

Thanks.

Reply
- James Carmichael April 9, 2022 at 8:41 am #
  
  Hi Shad…the following resource may prove helpful regarding tabular data and deep learning:
  
  https://medium.com/codon-consulting/tabular-data-and-deep-learning-where-do-we-stand-209b202e443a
  
  Reply
Deep Learner August 6, 2022 at 5:23 pm #

Sir, the last computed attention matrix shows what to us?
Is it the showing the co-relation between words?

Reply
- James Carmichael August 7, 2022 at 7:15 am #
  
  Hi Deep Learner…Your understanding is correct! Keep up the great work!
  
  Reply
Shuja August 12, 2022 at 8:57 pm #

Hi, I have a small question, new to all this, what is this operator ‘@’ in:

query_1 = word_1 @ W_Q

Thanks,

Reply
- James Carmichael August 13, 2022 at 6:13 am #
  
  Hi Shuja…The following resource may help clarify:
  
  https://www.codingem.com/numpy-at-operator/
  
  Reply
  - Shuja October 4, 2022 at 6:04 am #
    
    Thanks James, I really appreciate
    
    Reply
    - James Carmichael October 4, 2022 at 7:00 am #
      
      You are very welcome Shuja!
      
      Reply
Diego September 4, 2022 at 12:03 am #

Hi professor,

I have a naive doubt as to the form and meaning of the attention result: Is the number of columns dictated by the encoder (which has dimension 3), or because we are using 4 words and then each word correlates with the other three words in the sentence as Deep Learner commented?

In the latter case, if I get, for example, word 3, then the first element of the corresponding attention vector (row) is related to the first word, the second element to the second, and the third element to the fourth word ?.

Reply
- James Carmichael September 4, 2022 at 9:58 am #
  
  Hi Diego…You may find the following of interest:
  
  https://machinelearningmastery.com/a-tour-of-attention-based-architectures/
  
  Reply
Diego September 5, 2022 at 10:46 pm #

Thanks!

Reply
- Ricardo July 23, 2023 at 1:42 am #
  
  I would also like to have this clear, and I think James Carmichael answer could be straightforward…
  
  Reply
Vaibhav September 8, 2022 at 4:03 am #

Within the context of machine translation, each word in an input sentence ….

Shouldn’t it be —- Within the context of self-attention, …

Because in self-attention, we take h for each word and multiply it by weights and then perform the the dot product and other operations.

Reply
harshavardhana September 10, 2022 at 2:22 am #

Awesome… many thanks

Reply
- James Carmichael September 10, 2022 at 7:32 am #
  
  You are very welcome harshavardhana!
  
  Reply
ginko September 11, 2022 at 2:17 pm #

Can someone elaborate on the meaning of the attention matrix? It’s not obvious to me why this tells us the “correlation” between words. Looking at the first and second row,

[0.98522025 1.74174051 0.75652026]
[0.90965265 1.40965265 0.5 ]

What does this tell us between the first and second word?

Reply
- James Carmichael September 12, 2022 at 5:07 am #
  
  Hi ginko…The following resource should add clarity:
  
  https://towardsdatascience.com/attention-and-its-different-forms-7fc3674d14dc
  
  Reply
ginko September 12, 2022 at 10:08 am #

Thanks for this link! I learned a lot, but this link does not explicitly explain what these attention vectors explicitly mean. Here’s my understanding – is this an accurate analogy for this tutorial?

We have an English phrase with 4 words (rows). We translate it to French, and the translation has 4 words (columns).
The score matrix (which the other link says is the “Attention” matrix):

F1 F2 F3 F4
E1 [[2.36e-01, 7.39e-03, 7.49e-01, 7.39e-03],
E2 [4.55e-01, 4.52e-02, 4.55e-01, 4.52e-02],
E3 [2.39e-01, 7.44e-04, 7.59e-01, 7.44e-04],
E4 [9.00e-02, 2.82e-03, 9.06e-01, 1.58e-03]]

– suggests that all 4 English words (E1, E2, E3, E4) can be encapsulated by French word F3. However, the best corresponding word in English for French word F3 is E4 (highest score).
– suggests that for English word E2, it is providing context for French word F1 and F3 in ~equal magnitude.
– in theory, the diagonal should be 1 (if 1-to-1 translation).

The attention matrix (which the link calls ‘Context’ matrix) is a little ambiguous in interpretation. If someone is able to correct this, and/or provide an explanation for the 4 x 3 attention matrix I believe it would greatly improve this tutorial! Thank you for putting this together.

Reply
- Emanuele Fittipaldi December 27, 2022 at 1:52 am #
  
  Hi ginko, I was wondering the same thing “How do I have to interpret the attention matrix?”. I tried my best to give an answer to this question and this is my thought about it:
  – The score matrix is 4×4 so here we still have a 1:1 association about how well words scored related to each other.
  – The confusion comes when we compute the final attention matrix because multiplying a 4×4 matrix with a 4×3 matrix the end matrix will be 4×3
  
  But if we have thought for example the first row of the score matrix as:
  [ word1/word1 word1/word2 word1/word3 word1/word4 ]
  
  then the elements of the attention vector (a row of the attention matrix) might be interpreted as somewhat a coded version of the row of the score matrix since in the product between two matrix we do rows*columns product.
  So it is like each element in the attention vector carry the information of
  word1/word1 word1/word2 word1/word3 word1/word4
  in this case 3 times since the dimensionality of the key vectors is 3
  
  Reply
zaid October 28, 2022 at 7:34 pm #

how can we implement it in a regression problem? is it possible? with xgboost or lightgbm?

Reply
Emanuele Fittipaldi December 27, 2022 at 1:54 am #

I made a mistake, it’s not 4×3 but 3×3 matrix we are multiplying the weights to

Reply
MAK January 18, 2023 at 3:03 am #

Hello,
If I want to add attention to AutoEncoder neural network.
My main goal in training AutoEncoder is for feature reduction purpose .
It’s useful to use in the attention mechanism , if yes , how ? (like it’s look it’s more suitable to seq2seq problems)
Thanks,
MAK

Reply
zakaria January 26, 2023 at 11:37 pm #

Than you for your remarkable explination and all your work.

i have only one issue that is that the mathemathical formulas are not displayed in latex but ther are still in they html form for example i see all formulas in this way:

$$e_{\mathbf{q},\mathbf{k}_i} = \mathbf{q} \cdot \mathbf{k}_i$$

Reply
- James Carmichael January 27, 2023 at 10:52 am #
  
  Thank you for your feedback and suggestions!
  
  Reply
Ricardo July 23, 2023 at 1:52 am #

I may have misunderstood, but could we multiplicate a 4×4 matrix with a 3×3?

Reply
- James Carmichael July 23, 2023 at 2:56 pm #
  
  Hi Ricardo…The answer is no. The following resource explains matrix muliplication.
  
  https://www.mathsisfun.com/algebra/matrix-multiplying.html
  
  Reply
John August 1, 2023 at 4:50 pm #

Since you have already provided the code and description, what is the benefit of buying the Ebook?

Reply

Navigation

The Attention Mechanism from Scratch

Tutorial Overview

The Attention Mechanism

The General Attention Mechanism

Want to Get Started With Building Transformer Models with Attention?

The General Attention Mechanism with NumPy and SciPy

Further Reading

Books

Papers

Summary

Learn Transformers and Attention!

Teach your deep learning model to read a sentence

Give magical power of understanding human language for
Your Projects

More On This Topic

27 Responses to The Attention Mechanism from Scratch

Leave a Reply Click here to cancel reply.

Navigation

Tutorial Overview

The Attention Mechanism

The General Attention Mechanism

Want to Get Started With Building Transformer Models with Attention?

The General Attention Mechanism with NumPy and SciPy

Further Reading

Books

Papers

Summary

Learn Transformers and Attention!

Teach your deep learning model to read a sentence

Give magical power of understanding human language for Your Projects

More On This Topic

27 Responses to The Attention Mechanism from Scratch

Leave a Reply Click here to cancel reply.

Give magical power of understanding human language for
Your Projects