Plotting the Training and Validation Loss Curves for the Transformer Model

We have previously seen how to train the Transformer model for neural machine translation. Before moving on to inferencing the trained model, let us first explore how to modify the training code slightly to be able to plot the training and validation loss curves that can be generated during the learning process. 

The training and validation loss values provide important information because they give us a better insight into how the learning performance changes over the number of epochs and help us diagnose any problems with learning that can lead to an underfit or an overfit model. They will also inform us about the epoch with which to use the trained model weights at the inferencing stage.

In this tutorial, you will discover how to plot the training and validation loss curves for the Transformer model. 

After completing this tutorial, you will know:

  • How to modify the training code to include validation and test splits, in addition to a training split of the dataset
  • How to modify the training code to store the computed training and validation loss values, as well as the trained model weights
  • How to plot the saved training and validation loss curves

Kick-start your project with my book Building Transformer Models with Attention. It provides self-study tutorials with working code to guide you into building a fully-working transformer model that can
translate sentences from one language to another...

Let’s get started.

Plotting the training and validation loss curves for the Transformer model
Photo by Jack Anstey, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

  • Recap of the Transformer Architecture
  • Preparing the Training, Validation, and Testing Splits of the Dataset
  • Training the Transformer Model
  • Plotting the Training and Validation Loss Curves

Prerequisites

For this tutorial, we assume that you are already familiar with:

Recap of the Transformer Architecture

Recall having seen that the Transformer architecture follows an encoder-decoder structure. The encoder, on the left-hand side, is tasked with mapping an input sequence to a sequence of continuous representations; the decoder, on the right-hand side, receives the output of the encoder together with the decoder output at the previous time step to generate an output sequence.

The encoder-decoder structure of the Transformer architecture
Taken from “Attention Is All You Need

In generating an output sequence, the Transformer does not rely on recurrence and convolutions.

You have seen how to train the complete Transformer model, and you shall now see how to generate and plot the training and validation loss values that will help you diagnose the model’s learning performance. 

Want to Get Started With Building Transformer Models with Attention?

Take my free 12-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Preparing the Training, Validation, and Testing Splits of the Dataset

In order to be able to include validation and test splits of the data, you will modify the code that prepares the dataset by introducing the following lines of code, which:

  • Specify the size of the validation data split. This, in turn, determines the size of the training and test splits of the data, which we will be dividing into a ratio of 80:10:10 for the training, validation, and test sets, respectively:

  • Split the dataset into validation and test sets in addition to the training set:

  • Prepare the validation data by tokenizing, padding, and converting to a tensor. For this purpose, you will collect these operations into a function called encode_pad, as shown in the complete code listing below. This will avoid excessive repetition of code when performing these operations on the training data as well:

  • Save the encoder and decoder tokenizers into pickle files and the test dataset into a text file to be used later during the inferencing stage:

The complete code listing is now updated as follows:

Training the Transformer Model

We shall introduce similar modifications to the code that trains the Transformer model to:

  • Prepare the validation dataset batches:

  • Monitor the validation loss metric:

  • Initialize dictionaries to store the training and validation losses and eventually store the loss values in the respective dictionaries:

  • Compute the validation loss:

  • Save the trained model weights at every epoch. You will use these at the inferencing stage to investigate the differences in results that the model produces at different epochs.  In practice, it would be more efficient to include a callback method that halts the training process based on the metrics that are being monitored during training and only then save the model weights:

  • Finally, save the training and validation loss values into pickle files:

The modified code listing now becomes:

Plotting the Training and Validation Loss Curves

In order to be able to plot the training and validation loss curves, you will first load the pickle files containing the training and validation loss dictionaries that you saved when training the Transformer model earlier. 

Then you will retrieve the training and validation loss values from the respective dictionaries and graph them on the same plot.

The code listing is as follows, which you should save into a separate Python script:

Running the code above generates a similar plot of the training and validation loss curves to the one below:

Line plots of the training and validation loss values over several training epochs

Note that although you might see similar loss curves, they might not necessarily be identical to the ones above. This is because you are training the Transformer model from scratch, and the resulting training and validation loss values depend on the random initialization of the model weights. 

Nonetheless, these loss curves give us a better insight into how the learning performance changes over the number of epochs and help us diagnose any problems with learning that can lead to an underfit or an overfit model. 

For more details on using the training and validation loss curves to diagnose the learning performance of a model, you can refer to this tutorial by Jason Brownlee. 

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

Papers

Websites

Summary

In this tutorial, you discovered how to plot the training and validation loss curves for the Transformer model. 

Specifically, you learned:

  • How to modify the training code to include validation and test splits, in addition to a training split of the dataset
  • How to modify the training code to store the computed training and validation loss values, as well as the trained model weights
  • How to plot the saved training and validation loss curves

Do you have any questions?
Ask your questions in the comments below, and I will do my best to answer.

Learn Transformers and Attention!

Building Transformer Models with Attention

Teach your deep learning model to read a sentence

...using transformer models with attention

Discover how in my new Ebook:
Building Transformer Models with Attention

It provides self-study tutorials with working code to guide you into building a fully-working transformer models that can
translate sentences from one language to another...

Give magical power of understanding human language for
Your Projects


See What's Inside

, , , ,

7 Responses to Plotting the Training and Validation Loss Curves for the Transformer Model

  1. Avatar
    Brett November 3, 2022 at 7:11 am #

    To get this to work, I had to cast the two items to a list (lines 6 & 7), like this:

    # Retrieve each dictionary’s values
    train_values = list(train_loss.values())
    val_values = list(val_loss.values())

    Great series, thanks!

  2. Avatar
    khatija March 9, 2023 at 3:11 am #

    How can plot accuracy for each individual category in classifiction of image after trainnig

  3. Avatar
    Olufunke May 6, 2023 at 12:01 am #

    Hi, thanks so much. I am trying to adapt this for pytorch. Please how do I define the weights

  4. Avatar
    Oliver January 11, 2024 at 9:28 am #

    Thank you for the excellent series! It has been very helpful.

    I think there are two minor typos in this post:
    – Line 70 in the code for prepare-dataset should have “dataset” instead of “train”.
    – Line 51 in the code for training is missing “-both” in the name for the utilized dataset.

    After making these modifications I got the same validation loss curve.

    Thank you again and all the best!

    • Avatar
      James Carmichael January 11, 2024 at 9:38 am #

      Hi Oliver…You are very welcome! Thank you for your feedback!

Leave a Reply