Building a Plain Seq2Seq Model for Language Translation

Sequence-to-sequence (seq2seq) models are powerful architectures for tasks that transform one sequence into another, such as machine translation. These models employ an encoder-decoder architecture, where the encoder processes the input sequence and the decoder generates an output sequence based on the encoder’s output. The attention mechanism was developed for seq2seq models, and understanding how seq2seq works helps clarify the rationale behind attention. In this post, you will explore how to build and train a plain seq2seq model with LSTM for language translation. Specifically:

  • How to implement an encoder-decoder architecture with LSTM cells in PyTorch
  • How to train the model using sentence pairs from a dataset
  • How to generate a variable-length sequence with a seq2seq model

Kick-start your project with my book Building Transformer Models From Scratch with PyTorch. It provides self-study tutorials with working code.

Let’s get started.

Building a Plain Seq2Seq Model for Language Translation
Photo by Pourya Gohari. Some rights reserved.

Overview

This post is divided into five parts; they are:

  • Preparing the Dataset for Training
  • Implementing the Seq2Seq Model with LSTM
  • Training the Seq2Seq Model
  • Using the Seq2Seq Model
  • Improving the Seq2Seq Model

Preparing the Dataset for Training

In a previous post, you learned how to build a transformer model for translating French sentences to English. In this post, you will reuse the same dataset and build a seq2seq model for the same task.

The seq2seq model consists of two main components: an encoder and a decoder. The encoder processes the input sequence (French sentences) and generates a fixed-size representation, known as the context vector. The decoder then uses this context vector to generate the output sequence (English sentences) one token at a time.

To train such a model, you need a dataset of sentence pairs. The model learns how to translate from the example sentence pairs in the dataset. You can source your own dataset. In this post, you will use the Anki dataset, which can be downloaded from https://www.manythings.org/anki/, and you can also use the copy hosted in Google:

This is how you can use the requests library to download a file in Python. This zip file contains only one file, fra.txt, which is a plain text file. Each line consists of an English sentence, followed by a tab character, and then a corresponding sentence in French.

To make the data useful for training, it needs to be normalized. Firstly, French sentences are in Unicode, but some characters may have multiple representations. To help your model understand the sentence better, you want to normalize the Unicode representations, such as NFKC. You may also want to convert the alphabet to lowercase to reduce the size of the vocabulary (since the model will consider the same word in different cases as different words). You can read the sentence pairs and perform the normalization as follows:

The model you will build is a seq2seq model using LSTM. It is a recurrent neural network that can handle variable-length sequences. It cannot handle sequences of words directly but needs them to be tokenized and encoded into numerical form first. You can create a dictionary as a tokenizer to map each word in the vocabulary to a unique integer. You can also use a more advanced technique, such as Byte Pair Encoding (BPE), which allows it to handle unknown words more effectively by recognizing subword units. Let’s create a separate tokenizer for English and French, respectively:

Here, the BPE tokenizer is from the tokenizers library. The trained tokenizers are saved to en_tokenizer.json and fr_tokenizer.json for future use. To train the BPE, you need to specify the maximum vocabulary size. The code above sets it to 8000, which is a small number (consider that this dataset has around 15,000 unique words in English and 30,000 unique words in French). You can increase the vocabulary size if you think the model is not performing the translation well. There are some special handling implementations in the BPE above:

  • The pre-tokenizer splits the text on whitespace and punctuation by default. But you also added a space at the beginning of the sentence so that all words are prefixed by a space. This helps to reuse vocabulary regardless of the word’s position in the sentence.
  • Three special tokens are added to the vocabulary: [start], [end], and [pad]. These tokens are added before the tokenizer is trained. The [pad] token, in particular, is set as the padding token to fill up the sentence to a longer sequence length

The BPE tokenizers are trained from the dataset, as stored in the list of string pairs text_pairs. The same trainer is used for both languages in the code above but the tokenizers are separate.

After the tokenizers are trained, you can test them on a few sentences:

The output would be like the following:

Seq2Seq Architecture with LSTM

Traditionally, handling a sequence of arbitrary length using a neural network requires a recurrent neural network (RNN) architecture. It is a type of neural network where a module maintains a hidden state and updates it as it processes the sequence.

Several modules can be used to implement an RNN. LSTM is one of them. Building a simple LSTM encoder for the input sequence is straightforward:

LSTM is special because it has two hidden states, named hidden and cell in the code above. In PyTorch, you don’t need to implement the recurrent structure. The module nn.LSTM can handle this well.

In the implementation above, you implemented the encoder part of the seq2seq model as a class derived from nn.Module. You expect to pass on a 2D tensor of integer IDs as a batch of input sequences. This input will be converted into a 3D tensor by replacing each token ID with an embedding vector. In the forward() function above, the variable embedded is a 3D tensor of shape (batch_size, seq_len, embedding_dim). This is then processed by the LSTM module. The output of the LSTM module is a 3D tensor of shape (batch_size, seq_len, hidden_dim), which corresponds to the hidden states of the LSTM at each step while processing the input sequence. The final hidden state and cell state are also returned.

Note that you created the LSTM module with batch_first=True, which means that the first dimension of the input tensor is the batch size. This is a common convention in language data. This module also sets num_layers in LSTM to 1 by default. It is believed that a multi-layer LSTM is more powerful; however, you will build a larger model and require a longer training time.

Creating the decoder part of the seq2seq model is similar, except that you also need to produce the output:

The decoder LSTM is similar to the encoder LSTM. In the forward() method, the input sequence is the partial target sequence, and the hidden and cell states are the last hidden and cell states from the encoder’s LSTM module. When the decoder’s LSTM module is called, the encoder’s hidden and cell states are used. If not provided, the hidden and cell states are initialized to zeros, as in the encoder.

The input to the forward method is a 2D tensor of token IDs. This needs to be converted into a 3D tensor by the embedding layer before the LSTM module can consume it. The output of the LSTM module is a sequence of hidden states. They should be converted into a logit vector by a linear layer to predict the next token.

The design of the decoder module expects you to pass on a partial target sequence of shape (batch_size, seq_len). The forward() method returns you a predicted sequence of shape (batch, seq_len+1, hidden_dim), which is the output from the LSTM module, transformed by the linear layer. You take the last token in the sequence length dimension as the predicted next token. You need to call the decoder module multiple times to generate the entire target sequence.

To build a complete seq2seq model, you need to connect the encoder and decoder modules. This is how you can do it:

This module just connects the encoder and decoder modules. The forward() method is created to help train the model. It takes the input sequence (in English) and the target sequence (in French) as input. The English sentence will be converted into “context vectors” using the encoder. The encoder also outputs a processed sequence, but it is not used.

The decoder set up the context vector as provided by the encoder in its LSTM module. Then process a partial target sequence to produce the next-token prediction. Initially, the decoder begins with the special token [start]. Iteratively, it produces one more token at a time until the length of the target sequence is filled.

Note that the model above does not read the content of the target sequence, but uses its length to control the number of iterations. Also, note that the same decoder is called multiple times within a single call to the forward() method.

Training the Seq2Seq Model

To train the above model for English-to-French translation, you need to create a dataset object such that you can iterate over the dataset in batches and in random order. You already collected the data in the previous section and stored it as text_pairs. PyTorch provides a Dataset class to help you shuffle and batch the data. This is how you can create a dataset object:

You can try to print one sample from the dataset:

The dataloader object is an iterable that scans the entire dataset in a random order. It returns a tuple of two tensors, each of shape (batch_size, seq_len). You will see that the two tensors you printed are integers, as token IDs are represented by integers.

The dataloader is created with the collate_fn() function. PyTorch dataloader only collects elements from a dataset object as a list, and each element in this case is a tuple of two strings. The collate function converts the strings into token IDs using the BPE tokenizers and then creates a PyTorch tensor.

The next step is to create a model. It is straightforward:

This will print:

So you can see that the model is very simple. In fact, there are only 7 million parameters, but it is large enough to require a sizable time to train.

The code for training is as follows:

This is a simple training loop, and many techniques for improved training are not implemented. For example, train-test split of the data, early stopping, and gradient clipping are not used. What it does is to read the dataset in batches, then run the model with forward and backward passes, then update the model parameters.

The loss function used is cross-entropy, as the model is to predict the next token among the vocabulary. When you create the overall model, it generates the entire output sequence that matches the length of the target sequence. Therefore, the loss function can compare the sequence in one shot, rather than computing the loss token by token. However, in this application, the tensors are batches of sequences, and the sequences will be padded to match the longest length. A sequence should be terminated with the [end] token. The positions of padding tokens should be included in the overall loss calculation. That’s why the ignore_index parameter is used when we create the loss function with nn.CrossEntropyLoss().

If you have a separate test set, you can use that for evaluation. In the above, you reused the training data for evaluation once every 5 epochs in the latter half of the for-loop. Remember to toggle the model between model.train() and model.eval() for the correct training/inference behavior.

Using the Model

In the code above, you saved the model at the end of each epoch using torch.save(). When you have the model file, you can load it back using:

With a trained model, you can use it to generate translations. However, you do not use the same forward() method as in the training. Instead, you use a loop to call the decoder multiple times until the target sequence is generated.

Below is an implementation on how to perform translation of a few random sentences from the original dataset:

Initially, switch the model into evaluation mode and run it under the context torch.no_grad(). This will save time and memory.

You pick a few samples from the dataset using random.sample(). The input sentence (English) is tokenized and encoded into the tensor en_ids. It is a 2D tensor of shape (1, seq_len), as the model always expects a batch of sequences, even if the batch size is 1.

You run the English sentence through the model’s encoder to extract the context vector, which represents the final state of the LSTM module. Then, you start with the special token [start] and generate the French sentence in a loop.

This is a typical loop to use the seq2seq model. You expect the model to generate the [end] token eventually; otherwise, you will stop the generation when the length of the generated sequence reaches the maximum length. In each iteration of the loop, you create a new input tensor for the decoder. Then the decoder will generate one extra token, as the last token in the decoder’s output sequence. This output is a logit vector of the size of the vocabulary. You take the token with the highest probability as the next token, via the argmax() method in PyTorch.

The list pred_ids accumulates the list of token IDs. Each iteration of the loop generates the input tensor for the decoder based on this list. When the loop terminates, you run the tokenizer again to convert the token IDs into a string of sentences.

When you run the code above, you may see the following output:

Improving the Model

The above outlines how you can build a plain seq2seq model with LSTM for translation. As you can see from above, the output is not perfect. There are several ways to improve it:

  • Improve the tokenizer: The vocabulary size used is small, which may limit the model’s ability to understand word meanings. You can improve the model by incorporating a larger vocabulary. But this may require more training data.
  • Use a larger model: One layer of LSTM is used above, and you may see an improvement if you use more layers. You can also add dropout to the LSTM module to prevent overfitting when more layers are used.
  • Improve the training: Split the dataset into training and test sets, and use the test set to evaluate the model. In this case, it is easier to determine which epoch produced the best model, allowing you to use it for inference or to early stop the training. You can also tell if the model is converged by monitoring the loss on the test set.
  • Experiment with a different decoder model: The decoder above runs the entire target partial sequence with the encoder’s state as the initial state. Alternatively, you can pass on only the last token to the decoder to generate the next token. The difference is that the initial state is used directly to generate the next token in the latter, while the former will mutate the states by scanning the previously generated sequences. It is believed that a recurrent neural network is easy to “forget” the initial state (i.e., the context vector) when the sequence is long.

For completeness, below is the complete code you created in this post:

Further Readings

Below are some resources that you may find useful:

Summary

In this post, you learned about building and training a seq2seq model with LSTM for English to French translation. Specifically, you learned about:

  • How encoder-decoder architectures work with LSTM cells
  • How to prepare the dataset for training a seq2seq model
  • How to implement and train the complete translation model in PyTorch

The implementation is straightforward, but it outlines the general mechanism for the seq2seq model.

Building Transformer Models From Scratch with PyTorch

Building Transformer Models From Scratch with PyTorch

Build, train, and understand Transformers in pure PyTorch

...step by step

Learn how in my new Ebook:
Building Transformer Models From Scratch with PyTorch

Covers self-study tutorials and end-to-end projects like:
Tokenizers, embeddings, attention mechanisms, normalization layers, and much more...

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

See What's Inside

No comments yet.

Leave a Reply

Machine Learning Mastery is part of Guiding Tech Media, a leading digital media publisher focused on helping people figure out technology. Visit our corporate website to learn more about our mission and team.