Inferencing the Transformer Model

We have seen how to train the Transformer model on a dataset of English and German sentence pairs and how to plot the training and validation loss curves to diagnose the model’s learning performance and decide at which epoch to run inference on the trained model. We are now ready to run inference on the trained Transformer model to translate an input sentence.

In this tutorial, you will discover how to run inference on the trained Transformer model for neural machine translation. 

After completing this tutorial, you will know:

  • How to run inference on the trained Transformer model
  • How to generate text translations

Kick-start your project with my book Building Transformer Models with Attention. It provides self-study tutorials with working code to guide you into building a fully-working transformer model that can
translate sentences from one language to another...

Let’s get started. 

Inferencing the Transformer model
Photo by Karsten Würth, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  • Recap of the Transformer Architecture
  • Inferencing the Transformer Model
  • Testing Out the Code

Prerequisites

For this tutorial, we assume that you are already familiar with:

Recap of the Transformer Architecture

Recall having seen that the Transformer architecture follows an encoder-decoder structure. The encoder, on the left-hand side, is tasked with mapping an input sequence to a sequence of continuous representations; the decoder, on the right-hand side, receives the output of the encoder together with the decoder output at the previous time step to generate an output sequence.

The encoder-decoder structure of the Transformer architecture
Taken from “Attention Is All You Need

In generating an output sequence, the Transformer does not rely on recurrence and convolutions.

You have seen how to implement the complete Transformer model and subsequently train it on a dataset of English and German sentence pairs. Let’s now proceed to run inference on the trained model for neural machine translation. 

Inferencing the Transformer Model

Let’s start by creating a new instance of the TransformerModel class that was previously implemented in this tutorial. 

You will feed into it the relevant input arguments as specified in the paper of Vaswani et al. (2017) and the relevant information about the dataset in use: 

Here, note that the last input being fed into the TransformerModel corresponded to the dropout rate for each of the Dropout layers in the Transformer model. These Dropout layers will not be used during model inferencing (you will eventually set the training argument to False), so you may safely set the dropout rate to 0.

Furthermore, the TransformerModel class was already saved into a separate script named model.py. Hence, to be able to use the TransformerModel class, you need to include from model import TransformerModel.

Next, let’s create a class, Translate, that inherits from the Module base class in Keras and assign the initialized inferencing model to the variable transformer:

When you trained the Transformer model, you saw that you first needed to tokenize the sequences of text that were to be fed into both the encoder and decoder. You achieved this by creating a vocabulary of words and replacing each word with its corresponding vocabulary index. 

You will need to implement a similar process during the inferencing stage before feeding the sequence of text to be translated into the Transformer model. 

For this purpose, you will include within the class the following load_tokenizer method, which will serve to load the encoder and decoder tokenizers that you would have generated and saved during the training stage:

It is important that you tokenize the input text at the inferencing stage using the same tokenizers generated at the training stage of the Transformer model since these tokenizers would have already been trained on text sequences similar to your testing data. 

The next step is to create the class method, call(), that will take care to:

  • Append the start (<START>) and end-of-string (<EOS>) tokens to the input sentence:

  • Load the encoder and decoder tokenizers (in this case, saved in the enc_tokenizer.pkl and dec_tokenizer.pkl pickle files, respectively):

  • Prepare the input sentence by tokenizing it first, then padding it to the maximum phrase length, and subsequently converting it to a tensor:

  • Repeat a similar tokenization and tensor conversion procedure for the <START> and <EOS> tokens at the output:

  • Prepare the output array that will contain the translated text. Since you do not know the length of the translated sentence in advance, you will initialize the size of the output array to 0, but set its dynamic_size parameter to True so that it may grow past its initial size. You will then set the first value in this output array to the <START> token:

  • Iterate, up to the decoder sequence length, each time calling the Transformer model to predict an output token. Here, the training input, which is then passed on to each of the Transformer’s Dropout layers, is set to False so that no values are dropped during inference. The prediction with the highest score is then selected and written at the next available index of the output array. The for loop is terminated with a break statement as soon as an <EOS> token is predicted:

  • Decode the predicted tokens into an output list and return it:

The complete code listing, so far, is as follows:

Want to Get Started With Building Transformer Models with Attention?

Take my free 12-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Testing Out the Code

In order to test out the code, let’s have a look at the test_dataset.txt file that you would have saved when preparing the dataset for training. This text file contains a set of English-German sentence pairs that have been reserved for testing, from which you can select a couple of sentences to test.

Let’s start with the first sentence:

The corresponding ground truth translation in German for this sentence, including the <START> and <EOS> decoder tokens, should be: <START> ich bin durstig <EOS>.

If you have a look at the plotted training and validation loss curves for this model (here, you are training for 20 epochs), you may notice that the validation loss curve slows down considerably and starts plateauing at around epoch 16. 

So let’s proceed to load the saved model’s weights at the 16th epoch and check out the prediction that is generated by the model:

Running the lines of code above produces the following translated list of words:

Which is equivalent to the ground truth German sentence that was expected (always keep in mind that since you are training the Transformer model from scratch, you may arrive at different results depending on the random initialization of the model weights). 

Let’s check out what would have happened if you had, instead, loaded a set of weights corresponding to a much earlier epoch, such as the 4th epoch. In this case, the generated translation is the following:

In English, this translates to: I in not not, which is clearly far off from the input English sentence, but which is expected since, at this epoch, the learning process of the Transformer model is still at the very early stages. 

Let’s try again with a second sentence from the test dataset:

The corresponding ground truth translation in German for this sentence, including the <START> and <EOS> decoder tokens, should be: <START> sind wir dann durch <EOS>.

The model’s translation for this sentence, using the weights saved at epoch 16, is:

Which, instead, translates to: I was ready. While this is also not equal to the ground truth, it is close to its meaning. 

What the last test suggests, however, is that the Transformer model might have required many more data samples to train effectively. This is also corroborated by the validation loss at which the validation loss curve plateaus remain relatively high. 

Indeed, Transformer models are notorious for being very data hungry. Vaswani et al. (2017), for example, trained their English-to-German translation model using a dataset containing around 4.5 million sentence pairs. 

We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs…For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences…

Attention Is All You Need, 2017.

They reported that it took them 3.5 days on 8 P100 GPUs to train the English-to-German translation model. 

In comparison, you have only trained on a dataset comprising 10,000 data samples here, split between training, validation, and test sets. 

So the next task is actually for you. If you have the computational resources available, try to train the Transformer model on a much larger set of sentence pairs and see if you can obtain better results than the translations obtained here with a limited amount of data. 

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

Papers

Summary

In this tutorial, you discovered how to run inference on the trained Transformer model for neural machine translation.

Specifically, you learned:

  • How to run inference on the trained Transformer model
  • How to generate text translations

Do you have any questions?
Ask your questions in the comments below, and I will do my best to answer.

Learn Transformers and Attention!

Building Transformer Models with Attention

Teach your deep learning model to read a sentence

...using transformer models with attention

Discover how in my new Ebook:
Building Transformer Models with Attention

It provides self-study tutorials with working code to guide you into building a fully-working transformer models that can
translate sentences from one language to another...

Give magical power of understanding human language for
Your Projects


See What's Inside

, , ,

17 Responses to Inferencing the Transformer Model

  1. Jerzy October 21, 2022 at 8:11 pm #

    @ Jason Brownlee and @ Stefania Cristina do you plan to release book about transformers?

    • James Carmichael October 22, 2022 at 6:15 am #

      Great suggestion Jerzy! We appreciate the recommendation.

  2. Helen October 22, 2022 at 5:16 pm #

    Thanks for the great tutorial!
    Some errors happened when I ran the code. The traceback is as below.
    I am still struggling to find the bugs. I did not change any parameters in this tutorial.

    Traceback (most recent call last):
    File “E:\code\transformer\inference_trans.py”, line 101, in
    print(translator(sentence))
    File “E:\code\transformer\inference_trans.py”, line 45, in __call__
    prediction = self.transformer(encoder_input, transpose(decoder_output.stack()), training=False)
    File “C:\Anaconda3\envs\ML\lib\site-packages\keras\utils\traceback_utils.py”, line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
    File “E:\code\transformer\transformer.py”, line 198, in call
    decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, training)
    File “E:\code\transformer\transformer.py”, line 158, in call
    pos_encoding_output = self.pos_encoding(output_target)
    File “E:\code\transformer\positional_encoding.py”, line 47, in call
    embedded_words = self.word_embedding_layer(inputs)
    ValueError: Exception encountered when calling layer “position_embedding_fixed_weights_1″ ” f”(type PositionEmbeddingFixedWeights).

    In this tf.Variable creation, the initial value’s shape ((2404, 512)) is not compatible with the explicitly supplied shape argument ((2405, 512)).

    Call arguments received by layer “position_embedding_fixed_weights” ” f”(type PositionEmbeddingFixedWeights):
    \u2022 inputs=tf.Tensor(shape=(1, 7), dtype=int64)

    • Stefania Cristina
      Stefania Cristina October 22, 2022 at 7:25 pm #

      Hi Helen, thank you for your message!

      When you inference the Transformer model, you need to make sure that you set these parameter values according to how your dataset was prepared at the training stage:

      # Define the dataset parameters
      enc_seq_length = 7 # Encoder sequence length
      dec_seq_length = 12 # Decoder sequence length
      enc_vocab_size = 2405 # Encoder vocabulary size
      dec_vocab_size = 3858 # Decoder vocabulary size

      From your error, I suspect that (at least) the value of the enc_vocab_size needs to change to 2404. Can you, please, check if your error is originating from here?

  3. Helen October 23, 2022 at 5:49 pm #

    Thanks for your help!
    It turned out that both enc_vocab_size and dec_vocab_size are set wrong.

  4. Alex October 25, 2022 at 7:06 pm #

    Hi! This is an excellent post, thanks for the efforts!

    I have the following doubt: during inference, the decoder is fed the token “START” from which it predicts “dec_seq_length”, 12 in this case. The shape of the decoder thus would be [batch_size, 12, d_model], from which only the last prediction is taken (prediction = prediction[:, -1, :]).

    My question is, do the remaining 11 predictions have any meaning? As the Transformer is trained with the values shifted to the right one unit I understand that those 11 are the previous words during training but in inference, I´m having a hard time understanding what it is predicting or if these values should just be omitted because they don´t have any meaning at all. From the forecasting point of view, I guess you can just omit them but I´m just curious.

    Thanks in advance!

  5. Lokesh January 21, 2023 at 2:43 am #

    Hi,

    Nice explanation. I created a hindi to english transliteration model using transformer in keras. The model is working really well. The problem I am facing is with inference time. Do you have any suggestions to reduce inference time?

    • James Carmichael January 21, 2023 at 8:48 am #

      Hi Lokesh…You may find value in using Google Colab with a GPU option.

  6. Gabriel August 17, 2023 at 7:27 am #

    Hi,

    As always great explanation and clean code! Thank you very much for such a great place to learn.

    Working on an implementation of Decision Transformers (DT) I realized that the authors don’t pad the inputs during inference, like you are doing here.

    It got me wondering why there is no padding for inference. Could this be just because there are only decoders? What would happen if we didn’t use padding during inference?

    • James Carmichael August 17, 2023 at 9:54 am #

      Hi Gabriel…You are very welcome! This is a great question. Can you give it a try so that we can learn from your results?

  7. Raja Ahsan July 29, 2024 at 10:18 pm #

    Hi,

    In the previous tutorial the method for saving the weights of model for every epoch, tensorflow raises an exception that the weight must be saved as “__.weights.h5” . Now during inferencing I have loaded the model instance first as shown in this tutorial and the arguments [enc_vocab_size = 2189 and dec_vocab_size=3447], i have trained the model for 10000 datasets.
    But for loading the weights, following error is raised

    Traceback (most recent call last):
    File “”, line 1, in
    training_model.load_weights(“wght0.weights.h5”)
    File “C:\Users\ABC\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\src\utils\traceback_utils.py”, line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
    File “C:\Users\ABC\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\src\saving\saving_lib.py”, line 456, in _raise_loading_failure
    raise ValueError(msg)
    ValueError: A total of 1 objects could not be loaded. Example error message for object :

    Layer ’embedding_2′ expected 1 variables, but received 0 variables during loading. Expected: [’embeddings’]

    List of objects that could not be loaded:
    []

    • James Carmichael July 31, 2024 at 8:23 am #

      Hi Raja…It appears that there might be a mismatch between the saved model weights and the model architecture during the loading process. This can happen if the model’s architecture has changed between saving and loading or if the weights are not properly aligned with the model layers.

      Here’s a step-by-step approach to troubleshoot and resolve this issue:

      ### Step-by-Step Guide to Load Model Weights Correctly

      #### 1. **Ensure Consistency in Model Architecture**
      – Ensure that the model architecture is exactly the same when you load the weights as it was when you saved them. Any changes in the model structure can cause issues during weight loading.

      #### 2. **Save Model Weights Correctly**
      – Use a consistent naming convention and ensure the file paths are correct.
      – Example:
      python
      # Saving weights after each epoch
      checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
      filepath='model_weights_epoch_{epoch:02d}.weights.h5',
      save_weights_only=True,
      save_freq='epoch'
      )

      #### 3. **Load Model and Weights**
      – Define your model architecture before loading the weights.
      – Load the weights using the correct method.
      – Example:
      python
      # Define model architecture
      model = create_model(enc_vocab_size=2189, dec_vocab_size=3447) # Ensure this matches the training model architecture

      # Load weights
      model.load_weights('model_weights_epoch_10.weights.h5')

      #### 4. **Example Code**

      Here is a complete example showing how to define, save, and load model weights:

      python
      import tensorflow as tf
      from tensorflow.keras.models import Model
      from tensorflow.keras.layers import Input, Embedding, LSTM, Dense

      def create_model(enc_vocab_size, dec_vocab_size, embedding_dim=256, units=512):
      # Define the model architecture
      encoder_inputs = Input(shape=(None,), name='encoder_inputs')
      encoder_embedding = Embedding(input_dim=enc_vocab_size, output_dim=embedding_dim, name='encoder_embedding')(encoder_inputs)
      encoder_lstm = LSTM(units, return_state=True, name='encoder_lstm')
      encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
      encoder_states = [state_h, state_c]

      decoder_inputs = Input(shape=(None,), name='decoder_inputs')
      decoder_embedding = Embedding(input_dim=dec_vocab_size, output_dim=embedding_dim, name='decoder_embedding')(decoder_inputs)
      decoder_lstm = LSTM(units, return_sequences=True, return_state=True, name='decoder_lstm')
      decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
      decoder_dense = Dense(dec_vocab_size, activation='softmax', name='decoder_dense')
      decoder_outputs = decoder_dense(decoder_outputs)

      model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
      return model

      # Instantiate the model
      model = create_model(enc_vocab_size=2189, dec_vocab_size=3447)

      # Compile the model (if necessary)
      model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

      # Train the model and save weights
      # Assume training_data and validation_data are defined
      # model.fit(training_data, epochs=10, validation_data=validation_data, callbacks=[checkpoint_callback])

      # Load weights
      try:
      model.load_weights('model_weights_epoch_10.weights.h5')
      print("Weights loaded successfully.")
      except ValueError as e:
      print("Error loading weights:", e)

      ### Key Points
      1. **Consistent Model Architecture**: Ensure the model architecture is the same during both saving and loading weights.
      2. **File Naming and Paths**: Double-check the file paths and names to ensure they match what was saved.
      3. **Model Compilation**: Sometimes, compiling the model before loading weights can resolve issues.

      By following these steps, you should be able to correctly load your model weights and avoid the ValueError related to mismatched variables. If you continue to face issues, ensure that the model architecture and weights files are compatible and correctly aligned.

      • Raja Ahsan August 4, 2024 at 10:36 pm #

        Thanks for this guide, I have resolved this issue as I have provided incorrect parameters while creating the instance of transformer, which causes the above-mentioned error.

        Now while inferencing the model predicts only the “eos” token instead of predicting the translation of the sentence. Any help in this regard is appreciated.

        Thanks

        • Raja Ahsan August 5, 2024 at 1:59 am #

          Hi ,

          while training the model, I added a cross-check for every epoch to see the trained model prediction to the input sentence. for every epoch iteration i have called the Translator and passed the training model instance to its argument to initiate translator instance and passed the input sentence to that instance but for every epoch the prediction of the model is “”. token

          INput and output of Translator

          1. tokenized input sentence:
          [[1, 3, 151, 1336, 2]]

          2. padded tokens:
          tf.Tensor([[ 1 3 151 1336 2 0 0]], shape=(1, 7), dtype=int64)

          3. Decoder output PREDICTION scores for each tokens in the vocabulary
          tf.Tensor([[-12.633082 -12.154094 5.066319 … -3.5435305 -3.546515
          -3.5011318]], shape=(1, 3474), dtype=float32)
          4. get the token with maximum score using tensorflow argmax
          tf.Tensor([2], shape=(1,), dtype=int64)
          5. Decoder output stack: tf.Tensor([1 2], shape=(2,), dtype=int64)

          Input sentence: [” i made cookies “]

          Result:
          translation: [‘start’, ‘eos’]

          • Markus Eder November 19, 2024 at 12:47 am #

            I have exactly the same problem. Did you find out what causes this problem?

      • Raja ahsan August 8, 2024 at 9:58 pm #

        Thanks for the tutorial, accomplished seq2seq translation from English to German…

Leave a Reply

Machine Learning Mastery is part of Guiding Tech Media, a leading digital media publisher focused on helping people figure out technology. Visit our corporate website to learn more about our mission and team.