A Gentle Introduction to Positional Encoding in Transformer Models, Part 1

Last Updated on January 6, 2023

In languages, the order of the words and their position in a sentence really matters. The meaning of the entire sentence can change if the words are re-ordered. When implementing NLP solutions, recurrent neural networks have an inbuilt mechanism that deals with the order of sequences. The transformer model, however, does not use recurrence or convolution and treats each data point as independent of the other. Hence, positional information is added to the model explicitly to retain the information regarding the order of words in a sentence. Positional encoding is the scheme through which the knowledge of the order of objects in a sequence is maintained.

For this tutorial, we’ll simplify the notations used in this remarkable paper, Attention Is All You Need by Vaswani et al. After completing this tutorial, you will know:

  • What is positional encoding, and why it’s important
  • Positional encoding in transformers
  • Code and visualize a positional encoding matrix in Python using NumPy

Kick-start your project with my book Building Transformer Models with Attention. It provides self-study tutorials with working code to guide you into building a fully-working transformer model that can
translate sentences from one language to another...

Let’s get started.

A gentle introduction to positional encoding in transformer models
Photo by Muhammad Murtaza Ghani on Unsplash, some rights reserved

Tutorial Overview

This tutorial is divided into four parts; they are:

  1. What is positional encoding
  2. Mathematics behind positional encoding in transformers
  3. Implementing the positional encoding matrix using NumPy
  4. Understanding and visualizing the positional encoding matrix

What Is Positional Encoding?

Positional encoding describes the location or position of an entity in a sequence so that each position is assigned a unique representation. There are many reasons why a single number, such as the index value, is not used to represent an item’s position in transformer models. For long sequences, the indices can grow large in magnitude. If you normalize the index value to lie between 0 and 1, it can create problems for variable length sequences as they would be normalized differently.

Transformers use a smart positional encoding scheme, where each position/index is mapped to a vector. Hence, the output of the positional encoding layer is a matrix, where each row of the matrix represents an encoded object of the sequence summed with its positional information. An example of the matrix that encodes only the positional information is shown in the figure below.

A Quick Run-Through of the Trigonometric Sine Function

This is a quick recap of sine functions; you can work equivalently with cosine functions. The function’s range is [-1,+1]. The frequency of this waveform is the number of cycles completed in one second. The wavelength is the distance over which the waveform repeats itself. The wavelength and frequency for different waveforms are shown below:


Want to Get Started With Building Transformer Models with Attention?

Take my free 12-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Positional Encoding Layer in Transformers

Let’s dive straight into this. Suppose you have an input sequence of length $L$ and require the position of the $k^{th}$ object within this sequence. The positional encoding is given by sine and cosine functions of varying frequencies:

\begin{eqnarray}
P(k, 2i) &=& \sin\Big(\frac{k}{n^{2i/d}}\Big)\\
P(k, 2i+1) &=& \cos\Big(\frac{k}{n^{2i/d}}\Big)
\end{eqnarray}

Here:

$k$: Position of an object in the input sequence, $0 \leq k < L/2$

$d$: Dimension of the output embedding space

$P(k, j)$: Position function for mapping a position $k$ in the input sequence to index $(k,j)$ of the positional matrix

$n$: User-defined scalar, set to 10,000 by the authors of Attention Is All You Need.

$i$: Used for mapping to column indices $0 \leq i < d/2$, with a single value of $i$ maps to both sine and cosine functions

In the above expression, you can see that even positions correspond to a sine function and odd positions correspond to cosine functions.

Example

To understand the above expression, let’s take an example of the phrase “I am a robot,” with n=100 and d=4. The following table shows the positional encoding matrix for this phrase. In fact, the positional encoding matrix would be the same for any four-letter phrase with n=100 and d=4.

Coding the Positional Encoding Matrix from Scratch

Here is a short Python code to implement positional encoding using NumPy. The code is simplified to make the understanding of positional encoding easier.

Understanding the Positional Encoding Matrix

To understand the positional encoding, let’s start by looking at the sine wave for different positions with n=10,000 and d=512.

The following figure is the output of the above code:

Sine wave for different position indices

You can see that each position $k$ corresponds to a different sinusoid, which encodes a single position into a vector. If you look closely at the positional encoding function, you can see that the wavelength for a fixed $i$ is given by:

$$
\lambda_{i} = 2 \pi n^{2i/d}
$$

Hence, the wavelengths of the sinusoids form a geometric progression and vary from $2\pi$ to $2\pi n$. The scheme for positional encoding has a number of advantages.

  1. The sine and cosine functions have values in [-1, 1], which keeps the values of the positional encoding matrix in a normalized range.
  2. As the sinusoid for each position is different, you have a unique way of encoding each position.
  3. You have a way of measuring or quantifying the similarity between different positions, hence enabling you to encode the relative positions of words.

Visualizing the Positional Matrix

Let’s visualize the positional matrix on bigger values. Use Python’s matshow() method from the matplotlib library. Setting n=10,000 as done in the original paper, you get the following:

The positional encoding matrix for n=10,000, d=512, sequence length=100

What Is the Final Output of the Positional Encoding Layer?

The positional encoding layer sums the positional vector with the word encoding and outputs this matrix for the subsequent layers. The entire process is shown below.

The positional encoding layer in the transformer

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

Papers

Articles

Summary

In this tutorial, you discovered positional encoding in transformers.

Specifically, you learned:

  • What is positional encoding, and why it is needed.
  • How to implement positional encoding in Python using NumPy
  • How to visualize the positional encoding matrix

Do you have any questions about positional encoding discussed in this post? Ask your questions in the comments below, and I will do my best to answer.

Learn Transformers and Attention!

Building Transformer Models with Attention

Teach your deep learning model to read a sentence

...using transformer models with attention

Discover how in my new Ebook:
Building Transformer Models with Attention

It provides self-study tutorials with working code to guide you into building a fully-working transformer models that can
translate sentences from one language to another...

Give magical power of understanding human language for
Your Projects


See What's Inside

, ,

29 Responses to A Gentle Introduction to Positional Encoding in Transformer Models, Part 1

  1. Avatar
    yuanmu April 13, 2022 at 11:42 pm #

    Thanks for the great explanation!
    Should the range of k be [0, L) instead of [0, L/2)?
    Since the code: for k in range(seq_len)

      • Avatar
        seth G June 22, 2022 at 11:05 am #

        Sorry but I skimmed through the link and I still don’t see why k<L/2.

    • Avatar
      Lucas Thimoteo October 10, 2022 at 11:23 am #

      Hello, I read both links and I think k < L/2 is a mistake. It should be k < L, since k is the index corresponding to the token in the sequence.

      On the other hand, i < d/2 makes total sense because the progression of the dimension is built on 2i and 2i+1.

      • Avatar
        Anonymous January 30, 2023 at 8:47 pm #

        Completely agree with you!

      • Avatar
        Andrei Serebro April 3, 2023 at 2:43 am #

        Yes, that’s right. James, seems you do not try to understand what people in comments are saying to you, I had a feeling that there is no good understanding from your side, or it was just lack of time you were ready to invest into the material you prepared.

  2. Avatar
    Yoan B. June 4, 2022 at 4:26 am #

    Thanks Jason great tutorials !

    I think there’re errors in the trigonometric table section.
    The graphs for sin(2 * 2Pi) and sin(t) go beyond the range [-1:1], either the graph is wrong or the formulas on the left are not the corresponding one.

    • Avatar
      James Carmichael June 4, 2022 at 10:15 am #

      Thank you for the feedback Yoan B!

  3. Avatar
    Tom O. June 4, 2022 at 1:56 pm #

    Very good! Note that I would add plt.show() to avoid head scratching when pasting the examples into ipython.

    • Avatar
      James Carmichael June 5, 2022 at 10:20 am #

      Great feedback Tom!

  4. Avatar
    Shrikant Malviya August 15, 2022 at 11:21 pm #

    “In the above expression we can see that even positions correspond to sine function and odd positions correspond to even positions.”

    Something is wrong or missing in the above statement.

    • Avatar
      James Carmichael August 16, 2022 at 9:47 am #

      Hi Shrikant…Thank you for the feedback! We will review statement in question.

  5. Avatar
    Noman Saleem August 26, 2022 at 10:15 pm #

    Very Nicely Explained. Thanks 🙂

    • Avatar
      James Carmichael August 27, 2022 at 6:08 am #

      You are very welcome! Thank you for your feedback and support Noman!

  6. Avatar
    abraham September 24, 2022 at 12:26 am #

    Hi,
    Is it plausible to use positional encoding for time series prediction with LSTM and Conv1D?

  7. Avatar
    Lucas Thimoteo October 10, 2022 at 11:23 am #

    Hello, I read both links and I think k < L/2 is a mistake. It should be k < L, since k is the index corresponding to the token in the sequence.

    On the other hand, i < d/2 makes total sense because the progression of the dimension is built on 2i and 2i+1.

    • Avatar
      James Carmichael October 11, 2022 at 6:57 am #

      Hi Luca…Thank you for your support and feedback! We will review the content.

  8. Avatar
    Mayank November 6, 2022 at 7:26 am #

    Hi, I have two questions

    1. Do positional embeddings learn just like word embeddings or the embedding values are assigned just based on sine and the cosine graph?

    2. Are positional embedding and word embedding values independent of each other?

  9. Avatar
    A March 10, 2023 at 1:44 am #

    The code doesnt work for an odd embedding vector dimension, the last position would always be left without any assign, could easily be solved with a if statement, but I wonder if odd dimensions for embeddings are even used.

    • Avatar
      James Carmichael March 10, 2023 at 8:05 am #

      Thank you for your feedback A!

  10. Avatar
    Abdi April 7, 2023 at 10:17 pm #

    Excuse me if, for time series forecasting with transformer encoder positional encoding and input, masking is necessary. Because I think in chronicle arranged time series, inputs are ordered in time, and there is no any displacement, also when we use walk forward valodation.

  11. Avatar
    sergiu June 29, 2023 at 2:32 am #

    The positional encoding for example “I am a robot” looks strange for me, same as the output of “getPositionEncoding” function and many questions arise, which makes me more confused. First question, Values are not unique : P01 and P03 or P03 and P13, which one should contain unique values: rows or cols? As I understood the row size = embedding size, and if a row represents the positional encoding for one token, then we can use same values for row. Second question, values are not linear increasing: P00, P10, P20, P30. There is no order, how do we know which one is first, second, … last one if values are not increasing linearly?

    • Avatar
      James Carmichael June 29, 2023 at 8:52 am #

      Hi sergiu…Please rephrase and/or simplify your query if possible so that we may better assist you.

  12. Avatar
    Ryan July 5, 2023 at 10:30 am #

    I think this article is not as good as the others on this website. The concept is not clearly explained. Suggest adding more details and having a good logic between the contents.

    • Avatar
      James Carmichael July 6, 2023 at 8:33 am #

      Thank you Ryan for your feedback and suggestions!

Leave a Reply