SALE! Use code midyear2022 for 25% off everything!
Hurry, sale ends Sunday! Click to see the full catalog.

The Bahdanau Attention Mechanism

Last Updated on October 17, 2021

Conventional encoder-decoder architectures for machine translation encoded every source sentence into a fixed-length vector, irrespective of its length, from which the decoder would then generate a translation. This made it difficult for the neural network to cope with long sentences, essentially resulting in a performance bottleneck.

The Bahdanau attention was proposed to address the performance bottleneck of conventional encoder-decoder architectures, achieving significant improvements over the conventional approach.

In this tutorial, you will discover the Bahdanau attention mechanism for neural machine translation.

After completing this tutorial, you will know:

• Where the Bahdanau attention derives its name from, and the challenge it addresses.
• The role of the different components that form part of the Bahdanau encoder-decoder architecture.
• The operations performed by the Bahdanau attention algorithm.

Let’s get started.

The Bahdanau Attention Mechanism
Photo by Sean Oulashin, some rights reserved.

Tutorial Overview

This tutorial is divided into two parts; they are:

• Introduction to the Bahdanau Attention
• The Bahdanau Architecture
• The Encoder
• The Decoder
• The Bahdanau Attention Algorithm

Prerequisites

For this tutorial, we assume that you are already familiar with:

Introduction to the Bahdanau Attention

The Bahdanau attention mechanism has inherited its name from the first author of the paper in which it was published.

It follows the work of Cho et al. (2014) and Sutskever et al. (2014), who had also employed an RNN encoder-decoder framework for neural machine translation, specifically by encoding a variable-length source sentence into a fixed-length vector. The latter would then be decoded into a variable-length target sentence.

Bahdanau et al. (2014) argue that this encoding of a variable-length input into a fixed-length vector squashes the information of the source sentence, irrespective of its length, causing the performance of a basic encoder-decoder model to deteriorate rapidly with an increasing length of the input sentence. The approach they propose, on the other hand, replaces the fixed-length vector with a variable-length one, to improve the translation performance of the basic encoder-decoder model.

The most important distinguishing feature of this approach from the basic encoder–decoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation.

The Bahdanau Architecture

The main components in use by the Bahdanau encoder-decoder architecture are the following:

• $\mathbf{s}_{t-1}$ is the hidden decoder state at the previous time step, $t-1$.
• $\mathbf{c}_t$ is the context vector at time step, $t$. It is uniquely generated at each decoder step to generate a target word, $y_t$.
• $\mathbf{h}_i$ is an annotation that captures the information contained in the words forming the entire input sentence, $\{ x_1, x_2, \dots, x_T \}$, with strong focus around the $i$-th word out of $T$ total words.
• $\alpha_{t,i}$ is a weight value assigned to each annotation, $\mathbf{h}_i$, at the current time step, $t$.
• $e_{t,i}$ is an attention score generated by an alignment model, $a(.)$, that scores how well $\mathbf{s}_{t-1}$ and $\mathbf{h}_i$ match.

These components find their use at different stages of the Bahdanau architecture, which employs a bidirectional RNN as an encoder and an RNN decoder, with an attention mechanism in between:

The Encoder

The role of the encoder is generate an annotation, $\mathbf{h}_i$, for every word, $x_i$, in an input sentence of length $T$ words.

For this purpose, Bahdanau et al. employ a bidirectional RNN, which reads the input sentence in the forward direction to produce a forward hidden state, $\overrightarrow{\mathbf{h}_i}$, and then reads the input sentence in the reverse direction to produce a backward hidden state, $\overleftarrow{\mathbf{h}_i}$. The annotation for some particular word, $x_i$, concatenates the two states:

$$\mathbf{h}_i = \left[ \overrightarrow{\mathbf{h}_i^T} \; ; \; \overleftarrow{\mathbf{h}_i^T} \right]^T$$

The idea behind generating each annotation in this manner was to capture a summary of both preceding and succeeding words.

In this way, the annotation $\mathbf{h}_i$ contains the summaries of both the preceding words and the following words.

The generated annotations are then passed to the decoder to generate the context vector.

The Decoder

The role of the decoder is to produce the target words by focusing on the most relevant information contained in the source sentence. For this purpose, it makes use of an attention mechanism.

Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated. The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.

The decoder takes each annotation and feeds it to an alignment model, $a(.)$, together with the previous hidden decoder state, $\mathbf{s}_{t-1}$. This generates an attention score:

$$e_{t,i} = a(\mathbf{s}_{t-1}, \mathbf{h}_i)$$

The function implemented by the alignment model, here, combines $\mathbf{s}_{t-1}$ and $\mathbf{h}_i$ by means of an addition operation. For this reason, the attention mechanism implemented by Bahdanau et al. is referred to as additive attention.

This can be implemented in two ways, either (1) by applying a weight matrix, $\mathbf{W}$, over the concatenated vectors, $\mathbf{s}_{t-1}$ and $\mathbf{h}_i$, or (2) by applying the weight matrices, $\mathbf{W}_1$ and $\mathbf{W}_2$, to $\mathbf{s}_{t-1}$ and $\mathbf{h}_i$ separately:

1. $$a(\mathbf{s}_{t-1}, \mathbf{h}_i) = \mathbf{v}^T \tanh(\mathbf{W}[\mathbf{h}_i \; ; \; \mathbf{s}_{t-1}])$$
2. $$a(\mathbf{s}_{t-1}, \mathbf{h}_i) = \mathbf{v}^T \tanh(\mathbf{W}_1 \mathbf{h}_i + \mathbf{W}_2 \mathbf{s}_{t-1})$$

Here, $\mathbf{v}$, is a weight vector.

The alignment model is parametrized as a feedforward neural network, and jointly trained with the remaining system components.

Subsequently, a softmax function is applied to each attention score to obtain the corresponding weight value:

$$\alpha_{t,i} = \text{softmax}(e_{t,i})$$

The application of the softmax function essentially normalizes the annotation values to a range between 0 and 1 and, hence, the resulting weights can be considered as probability values. Each probability (or weight) value reflects how important $\mathbf{h}_i$ and $\mathbf{s}_{t-1}$ are in generating the next state, $\mathbf{s}_t$, and the next output, $y_t$.

Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed- length vector.

This is finally followed by the computation of the context vector as a weighted sum of the annotations:

$$\mathbf{c}_t = \sum^T_{i=1} \alpha_{t,i} \mathbf{h}_i$$

The Bahdanau Attention Algorithm

In summary, the attention algorithm proposed by Bahdanau et al. performs the following operations:

1. The encoder generates a set of annotations, $\mathbf{h}_i$, from the input sentence.
2. These annotations are fed to an alignment model together with the previous hidden decoder state. The alignment model uses this information to generate the attention scores, $e_{t,i}$.
3. A softmax function is applied to the attention scores, effectively normalizing them into weight values, $\alpha_{t,i}$, in a range between 0 and 1.
4. These weights together with the previously computed annotations are used to generate a context vector, $\mathbf{c}_t$, through a weighted sum of the annotations.
5. The context vector is fed to the decoder together with the previous hidden decoder state and the previous output, to compute the final output, $y_t$.
6. Steps 2-6 are repeated until the end of the sequence.

Bahdanau et al. had tested their architecture on the task of English-to-French translation, and had reported that their model outperformed the conventional encoder-decoder model significantly, irrespective of the sentence length.

There had been several improvements over the Bahdanau attention that had been proposed  thereafter, such as those of Luong et al. (2015), which we shall review in a separate tutorial.

This section provides more resources on the topic if you are looking to go deeper.

Summary

In this tutorial, you discovered the Bahdanau attention mechanism for neural machine translation.

Specifically, you learned:

• Where the Bahdanau attention derives its name from, and the challenge it addresses.
• The role of the different components that form part of the Bahdanau encoder-decoder architecture.
• The operations performed by the Bahdanau attention algorithm.

Do you have any questions?