Building a Seq2Seq Model with Attention for Language Translation

The attention mechanism, introduced by Bahdanau et al. in 2014, significantly improved sequence-to-sequence (seq2seq) models. In this post, you’ll learn how to build and train a seq2seq model with attention for language translation, focusing on:

  • Why attention mechanisms are essential
  • How to implement attention in a seq2seq model

Kick-start your project with my book Building Transformer Models From Scratch with PyTorch. It provides self-study tutorials with working code.

Let’s get started.

Building a Seq2Seq Model with Attention for Language Translation
Photo by Esther T. Some rights reserved.

Overview

This post is divided into four parts; they are:

  • Why Attnetion Matters: Limitations of Basic Seq2Seq Models
  • Implementing Seq2Seq Model with Attention
  • Training and Evaluating the Model
  • Using the Model

Why Attention Matters: Limitations of Basic Seq2Seq Models

Traditional seq2seq models use an encoder-decoder architecture where the encoder compresses the input sequence into a single context vector, which the decoder then uses to generate the output sequence. This approach has a critical limitation: the decoder must rely on this single context vector regardless of the output sequence length.

This becomes problematic with longer sequences as the model struggles to retain important details from earlier parts of the sequence. Consider English to French translation: The decoder uses the context vector as its initial state to generate the first token, then uses each previous output as input for subsequent tokens. As the hidden state updates, the decoder gradually loses information from the original context vector.

Attention mechanisms solve this by:

  • Giving the decoder access to all encoder hidden states during generation
  • Allowing focus on relevant input parts for each output token
  • Eliminating reliance on a single context vector

Implementing Seq2Seq Model with Attention

Let’s implement a seq2seq model with attention following Bahdanau et al. (2014). You’ll use GRU (Gated Recurrent Unit) modules instead of LSTM for their simplicity and faster training while maintaining comparable performance.

With the same dataset for training and similar to the encoder in the plain seq2seq model in the previous post, the encoder is implemented as follows:

The dropout module prevents overfitting by being applied to the embedding layer output. The RNN uses nn.GRU with batch_first=True to accept input shaped as (batch_size, seq_len, embedding_dim). The encoder’s forward() method returns:

  • A 3D tensor of shape (batch_size, seq_len, hidden_dim) containing RNN outputs
  • A 2D tensor of shape (1, batch_size, hidden_dim) containing the final hidden state

The Bahdanau attention mechanism differs from modern transformer attention. Here’s its implementation:

The attention mechanism is defined mathematically as:

$$
y = \textrm{softmax}\big(W^V \tanh(W^Q Q + W^K K)\big) K
$$

Unlike scaled dot-product attention, it uses summed projections of query and key.

With the Bahdanau attention module, the decoder is implemented as follows:

The decoder’s forward() method expects three inputs: A single-token input sequence, the latest RNN hidden state, and the encoder’s full output sequence. It will process align input token to the encoder’s output sequence using attention to generate a context vector for the decoder. Then this context vector, together with the input token, is used to generate the next token using the GRU module. The output is then projected to a logit vector of the same size as the vocabulary.

The seq2seq model is then built by connecting the encoder and decoder modules, as follows:

The seq2seq model employs teacher forcing during training, where ground-truth tokens (instead of decoder outputs from the previous step) are used as inputs to accelerate learning. In this implementation, the encoder is invoked once, but the decoder is invoked multiple times to generate the output sequence.

Training and Evaluating the Model

With the modules you created in the previous section, you can initialize a seq2seq model:

The training loop is very similar to the one in the previous post,

The training process utilizes cross-entropy loss to compare the output logits with the ground-truth French translation. The decoder begins with [start] and predicts one token at a time. Since training data includes padding and special tokens, we compare output with fr_ids[:, 1:] for alignment. Note that the [pad] token is included in the loss calculation, but you can skip it by specifying the ignore_index parameter when you create the loss function.

The model is trained for 50 epochs. Evaluation is performed once every five epochs. Since you don’t have a separate test set, you can use the training data for evaluation. You should toggle the model to evaluation mode and use the model under torch.no_grad() to avoid computing the gradients.

Using the Model

A well-trained model typically achieves a mean cross-entropy loss around 0.1. While the training loop in the previous section outlines how you can use a model, you should use the encoder and decoder separately for inference since the forward() method of the Seq2SeqRNN class is created for training. Here’s how to use the trained model for translation:

During inference, you pass on a tensor of sequence length 1 and batch size 1 as the input to the decoder in each step. The decoder will give you a logit vector of sequence length 1 and batch size 1. You use argmax() to decode the output token id. This output token is then used as the input to the next iteration of the loop, until [end] token is generated or or reached the maximum length.

Sample outputs below demonstrate the model’s capabilities:

To further improve the model’s performance, you can:

  • Increase the size of the vocabulary in the tokenizer
  • Revise the model architecture, e.g., a larger embedding dimension, a larger hidden state dimension, or more layers of GRU.
  • Improve the training process, e.g., adjust the learning rate, number of epochs, a different optimizer, or to use a separate test set for evaluation.

For completeness, below is the complete code you created in this post:

Note that, the code above uses GRU as the RNN module in the decoder and encoder. You can also use other RNN modules, such as LSTM or bi-directional RNN. All you need to just to swap the nn.GRU module in the encoder and decoder with a different module. Below is an implementation of the encoder and decoder using LSTM and scaled dot-product attention. You can replace the implementation above the the code should just run fine.

Further Readings

Below are some resources that you may find useful:

Summary

In this post, you learned how to build and train an attention-based seq2seq model for English to French translation. Specifically, you learned about:

  • How to build an encoder-decoder architecture with GRU
  • Implementing attention mechanisms to help the model focus on relevant input
  • Building a complete translation model in PyTorch
  • Training effectively using teacher forcing

Attention mechanisms significantly improve translation by enabling dynamic focus on relevant input parts during generation.

Building Transformer Models From Scratch with PyTorch

Building Transformer Models From Scratch with PyTorch

Build, train, and understand Transformers in pure PyTorch

...step by step

Learn how in my new Ebook:
Building Transformer Models From Scratch with PyTorch

Covers self-study tutorials and end-to-end projects like:
Tokenizers, embeddings, attention mechanisms, normalization layers, and much more...

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

See What's Inside

No comments yet.

Leave a Reply

Machine Learning Mastery is part of Guiding Tech Media, a leading digital media publisher focused on helping people figure out technology. Visit our corporate website to learn more about our mission and team.