The Bahdanau Attention Mechanism

By Stefania Cristina on January 6, 2023 in Attention 7

Conventional encoder-decoder architectures for machine translation encoded every source sentence into a fixed-length vector, regardless of its length, from which the decoder would then generate a translation. This made it difficult for the neural network to cope with long sentences, essentially resulting in a performance bottleneck.

The Bahdanau attention was proposed to address the performance bottleneck of conventional encoder-decoder architectures, achieving significant improvements over the conventional approach.

In this tutorial, you will discover the Bahdanau attention mechanism for neural machine translation.

After completing this tutorial, you will know:

Where the Bahdanau attention derives its name from and the challenge it addresses
The role of the different components that form part of the Bahdanau encoder-decoder architecture
The operations performed by the Bahdanau attention algorithm

Kick-start your project with my book Building Transformer Models with Attention. It provides self-study tutorials with working code to guide you into building a fully-working transformer model that can
translate sentences from one language to another...

Let’s get started.

The Bahdanau attention mechanism
Photo by Sean Oulashin, some rights reserved.

Tutorial Overview

This tutorial is divided into two parts; they are:

Introduction to the Bahdanau Attention
The Bahdanau Architecture
- The Encoder
- The Decoder
- The Bahdanau Attention Algorithm

Prerequisites

For this tutorial, we assume that you are already familiar with:

Introduction to the Bahdanau Attention

The Bahdanau attention mechanism inherited its name from the first author of the paper in which it was published.

It follows the work of Cho et al. (2014) and Sutskever et al. (2014), who also employed an RNN encoder-decoder framework for neural machine translation, specifically by encoding a variable-length source sentence into a fixed-length vector. The latter would then be decoded into a variable-length target sentence.

Bahdanau et al. (2014) argued that this encoding of a variable-length input into a fixed-length vector squashes the information of the source sentence, irrespective of its length, causing the performance of a basic encoder-decoder model to deteriorate rapidly with an increasing length of the input sentence. The approach they proposed replaces the fixed-length vector with a variable-length one to improve the translation performance of the basic encoder-decoder model.

The most important distinguishing feature of this approach from the basic encoder-decoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation.

– Neural Machine Translation by Jointly Learning to Align and Translate, 2014.

Want to Get Started With Building Transformer Models with Attention?

Take my free 12-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The Bahdanau Architecture

The main components in use by the Bahdanau encoder-decoder architecture are the following:

$\mathbf{s}_{t-1}$ is the hidden decoder state at the previous time step, $t-1$.
$\mathbf{c}_t$ is the context vector at time step, $t$. It is uniquely generated at each decoder step to generate a target word, $y_t$.
$\mathbf{h}_i$ is an annotation that captures the information contained in the words forming the entire input sentence, $\{ x_1, x_2, \dots, x_T \}$, with strong focus around the $i$-th word out of $T$ total words.
$\alpha_{t,i}$ is a weight value assigned to each annotation, $\mathbf{h}_i$, at the current time step, $t$.
$e_{t,i}$ is an attention score generated by an alignment model, $a(.)$, that scores how well $\mathbf{s}_{t-1}$ and $\mathbf{h}_i$ match.

These components find their use at different stages of the Bahdanau architecture, which employs a bidirectional RNN as an encoder and an RNN decoder, with an attention mechanism in between:

The Bahdanau architecture
Taken from “Neural Machine Translation by Jointly Learning to Align and Translate“

The Encoder

The role of the encoder is to generate an annotation, $\mathbf{h}_i$, for every word, $x_i$, in an input sentence of length $T$ words.

For this purpose, Bahdanau et al. employ a bidirectional RNN, which reads the input sentence in the forward direction to produce a forward hidden state, $\overrightarrow{\mathbf{h}_i}$, and then reads the input sentence in the reverse direction to produce a backward hidden state, $\overleftarrow{\mathbf{h}_i}$. The annotation for some particular word, $x_i$, concatenates the two states:

$$\mathbf{h}_i = \left[ \overrightarrow{\mathbf{h}_i^T} \; ; \; \overleftarrow{\mathbf{h}_i^T} \right]^T$$

The idea behind generating each annotation in this manner was to capture a summary of both the preceding and succeeding words.

In this way, the annotation $\mathbf{h}_i$ contains the summaries of both the preceding words and the following words.

– Neural Machine Translation by Jointly Learning to Align and Translate, 2014.

The generated annotations are then passed to the decoder to generate the context vector.

The Decoder

The role of the decoder is to produce the target words by focusing on the most relevant information contained in the source sentence. For this purpose, it makes use of an attention mechanism.

Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated. The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.

– Neural Machine Translation by Jointly Learning to Align and Translate, 2014.

The decoder takes each annotation and feeds it to an alignment model, $a(.)$, together with the previous hidden decoder state, $\mathbf{s}_{t-1}$. This generates an attention score:

$$e_{t,i} = a(\mathbf{s}_{t-1}, \mathbf{h}_i)$$

The function implemented by the alignment model here combines $\mathbf{s}_{t-1}$ and $\mathbf{h}_i$ using an addition operation. For this reason, the attention mechanism implemented by Bahdanau et al. is referred to as additive attention.

This can be implemented in two ways, either (1) by applying a weight matrix, $\mathbf{W}$, over the concatenated vectors, $\mathbf{s}_{t-1}$ and $\mathbf{h}_i$, or (2) by applying the weight matrices, $\mathbf{W}_1$ and $\mathbf{W}_2$, to $\mathbf{s}_{t-1}$ and $\mathbf{h}_i$ separately:

$$a(\mathbf{s}_{t-1}, \mathbf{h}_i) = \mathbf{v}^T \tanh(\mathbf{W}[\mathbf{h}_i \; ; \; \mathbf{s}_{t-1}])$$
$$a(\mathbf{s}_{t-1}, \mathbf{h}_i) = \mathbf{v}^T \tanh(\mathbf{W}_1 \mathbf{h}_i + \mathbf{W}_2 \mathbf{s}_{t-1})$$

Here, $\mathbf{v}$ is a weight vector.

The alignment model is parametrized as a feedforward neural network and jointly trained with the remaining system components.

Subsequently, a softmax function is applied to each attention score to obtain the corresponding weight value:

$$\alpha_{t,i} = \text{softmax}(e_{t,i})$$

The application of the softmax function essentially normalizes the annotation values to a range between 0 and 1; hence, the resulting weights can be considered probability values. Each probability (or weight) value reflects how important $\mathbf{h}_i$ and $\mathbf{s}_{t-1}$ are in generating the next state, $\mathbf{s}_t$, and the next output, $y_t$.

Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed-length vector.

– Neural Machine Translation by Jointly Learning to Align and Translate, 2014.

This is finally followed by the computation of the context vector as a weighted sum of the annotations:

$$\mathbf{c}_t = \sum^T_{i=1} \alpha_{t,i} \mathbf{h}_i$$

The Bahdanau Attention Algorithm

In summary, the attention algorithm proposed by Bahdanau et al. performs the following operations:

The encoder generates a set of annotations, $\mathbf{h}_i$, from the input sentence.
These annotations are fed to an alignment model and the previous hidden decoder state. The alignment model uses this information to generate the attention scores, $e_{t,i}$.
A softmax function is applied to the attention scores, effectively normalizing them into weight values, $\alpha_{t,i}$, in a range between 0 and 1.
Together with the previously computed annotations, these weights are used to generate a context vector, $\mathbf{c}_t$, through a weighted sum of the annotations.
The context vector is fed to the decoder together with the previous hidden decoder state and the previous output to compute the final output, $y_t$.
Steps 2-6 are repeated until the end of the sequence.

Bahdanau et al. tested their architecture on the task of English-to-French translation. They reported that their model significantly outperformed the conventional encoder-decoder model, regardless of the sentence length.

There have been several improvements over the Bahdanau attention proposed, such as those of Luong et al. (2015), which we shall review in a separate tutorial.

Summary

In this tutorial, you discovered the Bahdanau attention mechanism for neural machine translation.

Specifically, you learned:

Where the Bahdanau attention derives its name from and the challenge it addresses.
The role of the different components that form part of the Bahdanau encoder-decoder architecture
The operations performed by the Bahdanau attention algorithm

Do you have any questions?
Ask your questions in the comments below, and I will do my best to answer.

7 Responses to The Bahdanau Attention Mechanism

Misaal November 26, 2021 at 4:24 pm #

Thank you so much machinelearningmastery. Everything is so well researched and explained. Have always searched for ur articles because it clears all the doubts and it is so easy to grasp the information. Thank you once again !!!!!!!!!!!!!! 🙂

- Adrian Tam November 29, 2021 at 8:33 am #
  
  Thanks. Glad you like it.
  
Binata December 4, 2022 at 7:29 pm #

Thanks for your post. Can you provide an example code for timeseries forecasting using Encoder Decoder with Bahdanau attention mechanism? Thanks.

- James Carmichael December 5, 2022 at 10:21 am #
  
  Hi Binata…The following resource may be of interest:
  
  https://www.kaggle.com/code/kmkarakaya/encoder-decoder-with-bahdanau-luong-attention
  
Partha June 22, 2023 at 2:25 pm #

Not able to understand a thing , get more confused !!

- James Carmichael June 23, 2023 at 8:14 am #
  
  Hi Partha…This is a challenging topic! Hang in there and keep learning. The following resource has excellent examples you can work through to gain better understanding:
  
  https://machinelearningmastery.com/transformer-models-with-attention/
  
Jack September 28, 2024 at 12:40 am #

This is a nice explanation, but what is v in the score calculation? Its purpose is never explained. It doesn’t seem necessary, and no explanation is offered. Weirdly, I don’t see a lot of explanation for it in most of the online tutorials on attention. It always just magically appears and gets handwaved off as a learned weight. But why? Isn’t weighting already handled by W? It seems redundant.

Navigation

The Bahdanau Attention Mechanism

Tutorial Overview

Prerequisites

Introduction to the Bahdanau Attention

Want to Get Started With Building Transformer Models with Attention?