Inferencing the Transformer Model

Last Updated on January 6, 2023

We have seen how to train the Transformer model on a dataset of English and German sentence pairs and how to plot the training and validation loss curves to diagnose the model’s learning performance and decide at which epoch to run inference on the trained model. We are now ready to run inference on the trained Transformer model to translate an input sentence.

In this tutorial, you will discover how to run inference on the trained Transformer model for neural machine translation. 

After completing this tutorial, you will know:

  • How to run inference on the trained Transformer model
  • How to generate text translations

Kick-start your project with my book Building Transformer Models with Attention. It provides self-study tutorials with working code to guide you into building a fully-working transformer model that can
translate sentences from one language to another...

Let’s get started. 

Inferencing the Transformer model
Photo by Karsten Würth, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  • Recap of the Transformer Architecture
  • Inferencing the Transformer Model
  • Testing Out the Code


For this tutorial, we assume that you are already familiar with:

Recap of the Transformer Architecture

Recall having seen that the Transformer architecture follows an encoder-decoder structure. The encoder, on the left-hand side, is tasked with mapping an input sequence to a sequence of continuous representations; the decoder, on the right-hand side, receives the output of the encoder together with the decoder output at the previous time step to generate an output sequence.

The encoder-decoder structure of the Transformer architecture
Taken from “Attention Is All You Need

In generating an output sequence, the Transformer does not rely on recurrence and convolutions.

You have seen how to implement the complete Transformer model and subsequently train it on a dataset of English and German sentence pairs. Let’s now proceed to run inference on the trained model for neural machine translation. 

Inferencing the Transformer Model

Let’s start by creating a new instance of the TransformerModel class that was previously implemented in this tutorial. 

You will feed into it the relevant input arguments as specified in the paper of Vaswani et al. (2017) and the relevant information about the dataset in use: 

Here, note that the last input being fed into the TransformerModel corresponded to the dropout rate for each of the Dropout layers in the Transformer model. These Dropout layers will not be used during model inferencing (you will eventually set the training argument to False), so you may safely set the dropout rate to 0.

Furthermore, the TransformerModel class was already saved into a separate script named Hence, to be able to use the TransformerModel class, you need to include from model import TransformerModel.

Next, let’s create a class, Translate, that inherits from the Module base class in Keras and assign the initialized inferencing model to the variable transformer:

When you trained the Transformer model, you saw that you first needed to tokenize the sequences of text that were to be fed into both the encoder and decoder. You achieved this by creating a vocabulary of words and replacing each word with its corresponding vocabulary index. 

You will need to implement a similar process during the inferencing stage before feeding the sequence of text to be translated into the Transformer model. 

For this purpose, you will include within the class the following load_tokenizer method, which will serve to load the encoder and decoder tokenizers that you would have generated and saved during the training stage:

It is important that you tokenize the input text at the inferencing stage using the same tokenizers generated at the training stage of the Transformer model since these tokenizers would have already been trained on text sequences similar to your testing data. 

The next step is to create the class method, call(), that will take care to:

  • Append the start (<START>) and end-of-string (<EOS>) tokens to the input sentence:

  • Load the encoder and decoder tokenizers (in this case, saved in the enc_tokenizer.pkl and dec_tokenizer.pkl pickle files, respectively):

  • Prepare the input sentence by tokenizing it first, then padding it to the maximum phrase length, and subsequently converting it to a tensor:

  • Repeat a similar tokenization and tensor conversion procedure for the <START> and <EOS> tokens at the output:

  • Prepare the output array that will contain the translated text. Since you do not know the length of the translated sentence in advance, you will initialize the size of the output array to 0, but set its dynamic_size parameter to True so that it may grow past its initial size. You will then set the first value in this output array to the <START> token:

  • Iterate, up to the decoder sequence length, each time calling the Transformer model to predict an output token. Here, the training input, which is then passed on to each of the Transformer’s Dropout layers, is set to False so that no values are dropped during inference. The prediction with the highest score is then selected and written at the next available index of the output array. The for loop is terminated with a break statement as soon as an <EOS> token is predicted:

  • Decode the predicted tokens into an output list and return it:

The complete code listing, so far, is as follows:

Want to Get Started With Building Transformer Models with Attention?

Take my free 12-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Testing Out the Code

In order to test out the code, let’s have a look at the test_dataset.txt file that you would have saved when preparing the dataset for training. This text file contains a set of English-German sentence pairs that have been reserved for testing, from which you can select a couple of sentences to test.

Let’s start with the first sentence:

The corresponding ground truth translation in German for this sentence, including the <START> and <EOS> decoder tokens, should be: <START> ich bin durstig <EOS>.

If you have a look at the plotted training and validation loss curves for this model (here, you are training for 20 epochs), you may notice that the validation loss curve slows down considerably and starts plateauing at around epoch 16. 

So let’s proceed to load the saved model’s weights at the 16th epoch and check out the prediction that is generated by the model:

Running the lines of code above produces the following translated list of words:

Which is equivalent to the ground truth German sentence that was expected (always keep in mind that since you are training the Transformer model from scratch, you may arrive at different results depending on the random initialization of the model weights). 

Let’s check out what would have happened if you had, instead, loaded a set of weights corresponding to a much earlier epoch, such as the 4th epoch. In this case, the generated translation is the following:

In English, this translates to: I in not not, which is clearly far off from the input English sentence, but which is expected since, at this epoch, the learning process of the Transformer model is still at the very early stages. 

Let’s try again with a second sentence from the test dataset:

The corresponding ground truth translation in German for this sentence, including the <START> and <EOS> decoder tokens, should be: <START> sind wir dann durch <EOS>.

The model’s translation for this sentence, using the weights saved at epoch 16, is:

Which, instead, translates to: I was ready. While this is also not equal to the ground truth, it is close to its meaning. 

What the last test suggests, however, is that the Transformer model might have required many more data samples to train effectively. This is also corroborated by the validation loss at which the validation loss curve plateaus remain relatively high. 

Indeed, Transformer models are notorious for being very data hungry. Vaswani et al. (2017), for example, trained their English-to-German translation model using a dataset containing around 4.5 million sentence pairs. 

We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs…For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences…

Attention Is All You Need, 2017.

They reported that it took them 3.5 days on 8 P100 GPUs to train the English-to-German translation model. 

In comparison, you have only trained on a dataset comprising 10,000 data samples here, split between training, validation, and test sets. 

So the next task is actually for you. If you have the computational resources available, try to train the Transformer model on a much larger set of sentence pairs and see if you can obtain better results than the translations obtained here with a limited amount of data. 

Further Reading

This section provides more resources on the topic if you are looking to go deeper.




In this tutorial, you discovered how to run inference on the trained Transformer model for neural machine translation.

Specifically, you learned:

  • How to run inference on the trained Transformer model
  • How to generate text translations

Do you have any questions?
Ask your questions in the comments below, and I will do my best to answer.

Learn Transformers and Attention!

Building Transformer Models with Attention

Teach your deep learning model to read a sentence

...using transformer models with attention

Discover how in my new Ebook:
Building Transformer Models with Attention

It provides self-study tutorials with working code to guide you into building a fully-working transformer models that can
translate sentences from one language to another...

Give magical power of understanding human language for
Your Projects

See What's Inside

, , ,

11 Responses to Inferencing the Transformer Model

  1. Avatar
    Jerzy October 21, 2022 at 8:11 pm #

    @ Jason Brownlee and @ Stefania Cristina do you plan to release book about transformers?

    • Avatar
      James Carmichael October 22, 2022 at 6:15 am #

      Great suggestion Jerzy! We appreciate the recommendation.

  2. Avatar
    Helen October 22, 2022 at 5:16 pm #

    Thanks for the great tutorial!
    Some errors happened when I ran the code. The traceback is as below.
    I am still struggling to find the bugs. I did not change any parameters in this tutorial.

    Traceback (most recent call last):
    File “E:\code\transformer\”, line 101, in
    File “E:\code\transformer\”, line 45, in __call__
    prediction = self.transformer(encoder_input, transpose(decoder_output.stack()), training=False)
    File “C:\Anaconda3\envs\ML\lib\site-packages\keras\utils\”, line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
    File “E:\code\transformer\”, line 198, in call
    decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, training)
    File “E:\code\transformer\”, line 158, in call
    pos_encoding_output = self.pos_encoding(output_target)
    File “E:\code\transformer\”, line 47, in call
    embedded_words = self.word_embedding_layer(inputs)
    ValueError: Exception encountered when calling layer “position_embedding_fixed_weights_1″ ” f”(type PositionEmbeddingFixedWeights).

    In this tf.Variable creation, the initial value’s shape ((2404, 512)) is not compatible with the explicitly supplied shape argument ((2405, 512)).

    Call arguments received by layer “position_embedding_fixed_weights” ” f”(type PositionEmbeddingFixedWeights):
    \u2022 inputs=tf.Tensor(shape=(1, 7), dtype=int64)

    • Avatar
      Stefania Cristina October 22, 2022 at 7:25 pm #

      Hi Helen, thank you for your message!

      When you inference the Transformer model, you need to make sure that you set these parameter values according to how your dataset was prepared at the training stage:

      # Define the dataset parameters
      enc_seq_length = 7 # Encoder sequence length
      dec_seq_length = 12 # Decoder sequence length
      enc_vocab_size = 2405 # Encoder vocabulary size
      dec_vocab_size = 3858 # Decoder vocabulary size

      From your error, I suspect that (at least) the value of the enc_vocab_size needs to change to 2404. Can you, please, check if your error is originating from here?

  3. Avatar
    Helen October 23, 2022 at 5:49 pm #

    Thanks for your help!
    It turned out that both enc_vocab_size and dec_vocab_size are set wrong.

  4. Avatar
    Alex October 25, 2022 at 7:06 pm #

    Hi! This is an excellent post, thanks for the efforts!

    I have the following doubt: during inference, the decoder is fed the token “START” from which it predicts “dec_seq_length”, 12 in this case. The shape of the decoder thus would be [batch_size, 12, d_model], from which only the last prediction is taken (prediction = prediction[:, -1, :]).

    My question is, do the remaining 11 predictions have any meaning? As the Transformer is trained with the values shifted to the right one unit I understand that those 11 are the previous words during training but in inference, I´m having a hard time understanding what it is predicting or if these values should just be omitted because they don´t have any meaning at all. From the forecasting point of view, I guess you can just omit them but I´m just curious.

    Thanks in advance!

  5. Avatar
    Lokesh January 21, 2023 at 2:43 am #


    Nice explanation. I created a hindi to english transliteration model using transformer in keras. The model is working really well. The problem I am facing is with inference time. Do you have any suggestions to reduce inference time?

    • Avatar
      James Carmichael January 21, 2023 at 8:48 am #

      Hi Lokesh…You may find value in using Google Colab with a GPU option.

  6. Avatar
    Gabriel August 17, 2023 at 7:27 am #


    As always great explanation and clean code! Thank you very much for such a great place to learn.

    Working on an implementation of Decision Transformers (DT) I realized that the authors don’t pad the inputs during inference, like you are doing here.

    It got me wondering why there is no padding for inference. Could this be just because there are only decoders? What would happen if we didn’t use padding during inference?

    • Avatar
      James Carmichael August 17, 2023 at 9:54 am #

      Hi Gabriel…You are very welcome! This is a great question. Can you give it a try so that we can learn from your results?

Leave a Reply