The post The Transformer Positional Encoding Layer in Keras, Part 2 appeared first on Machine Learning Mastery.

]]>After completing this tutorial, you will know:

- Text vectorization in Keras
- Embedding layer in Keras
- How to subclass the embedding layer and write your own positional encoding layer.

Let’s get started.

This tutorial is divided into 3 parts; they are:

- Text vectorization and embedding layer in Keras
- Writing your own positional encoding layer in Keras
- Randomly initialized and tunable embeddings
- Fixed weight embeddings from Attention is All You Need

- Graphical view of the output of the positional encoding layer

First let’s write the section to import all the required libraries:

import tensorflow as tf from tensorflow import convert_to_tensor, string from tensorflow.keras.layers import TextVectorization, Embedding, Layer from tensorflow.data import Dataset import numpy as np import matplotlib.pyplot as plt

We’ll start with a set of English phrases, which are already preprocessed and cleaned. The text vectorization layer creates a dictionary of words and replaces each word by its corresponding index in the dictionary. Let’s see how we can map these two sentences using the text vectorization layer:

- I am a robot
- you too robot

Note we have already converted the text to lowercase and removed all the punctuations and noise in text. We’ll convert these two phrases to vectors of a fixed length 5. The `TextVectorization`

layer of Keras requires a maximum vocabulary size and the required length of output sequence for initialization. The output of the layer is a tensor of shape:

`(number of sentences, output sequence length)`

The following code snippet uses the `adapt`

method to generate a vocabulary. It next creates a vectorized representation of text.

output_sequence_length = 5 vocab_size = 10 sentences = [["I am a robot"], ["you too robot"]] sentence_data = Dataset.from_tensor_slices(sentences) # Create the TextVectorization layer vectorize_layer = TextVectorization( output_sequence_length=output_sequence_length, max_tokens=vocab_size) # Train the layer to create a dictionary vectorize_layer.adapt(sentence_data) # Convert all sentences to tensors word_tensors = convert_to_tensor(sentences, dtype=tf.string) # Use the word tensors to get vectorized phrases vectorized_words = vectorize_layer(word_tensors) print("Vocabulary: ", vectorize_layer.get_vocabulary()) print("Vectorized words: ", vectorized_words)

Vocabulary: ['', '[UNK]', 'robot', 'you', 'too', 'i', 'am', 'a'] Vectorized words: tf.Tensor( [[5 6 7 2 0] [3 4 2 0 0]], shape=(2, 5), dtype=int64)

The Keras `Embedding`

layer converts integers to dense vectors. This layer maps these integers to random numbers, which are later tuned during the training phase. However, you also have the option to set the mapping to some predefined weight values (shown later). To initialize this layer, we need to specify the maximum value of an integer to map, along with the length of the output sequence.

Let’s see how the layer converts our `vectorized_text`

to tensors.

output_length = 6 word_embedding_layer = Embedding(vocab_size, output_length) embedded_words = word_embedding_layer(vectorized_words) print(embedded_words)

I have annotated the output with my comments as shown below. Note, you will see a different output every time you run this code because the weights have been initialized randomly.

We also need the embeddings for the corresponding positions. The maximum positions correspond to the output sequence length of the `TextVectorization`

layer.

position_embedding_layer = Embedding(output_sequence_length, output_length) position_indices = tf.range(output_sequence_length) embedded_indices = position_embedding_layer(position_indices) print(embedded_indices)

The output is shown below:

In a transformer model the final output is the sum of both the word embeddings and the position embeddings. Hence, when you set up both embedding layers, you need to make sure that the `output_length`

is the same for both.

final_output_embedding = embedded_words + embedded_indices print("Final output: ", final_output_embedding)

The output is shown below, annotated with my comments. Again, this will be different from your run of the code because of the random weight initialization.

When implementing a transformer model, you’ll have to write your own position encoding layer. This is quite simple as the basic functionality is already provided for you. This Keras example shows how you can subclass the `Embedding`

layer to implement your own functionality. You can add more methods to it as you require.

class PositionEmbeddingLayer(Layer): def __init__(self, sequence_length, vocab_size, output_dim, **kwargs): super(PositionEmbeddingLayer, self).__init__(**kwargs) self.word_embedding_layer = Embedding( input_dim=vocab_size, output_dim=output_dim ) self.position_embedding_layer = Embedding( input_dim=sequence_length, output_dim=output_dim ) def call(self, inputs): position_indices = tf.range(tf.shape(inputs)[-1]) embedded_words = self.word_embedding_layer(inputs) embedded_indices = self.position_embedding_layer(position_indices) return embedded_words + embedded_indices

Let’s run this layer.

my_embedding_layer = PositionEmbeddingLayer(output_sequence_length, vocab_size, output_length) embedded_layer_output = my_embedding_layer(vectorized_words) print("Output from my_embedded_layer: ", embedded_layer_output)

Output from my_embedded_layer: tf.Tensor( [[[ 0.06798736 -0.02821309 0.00571618 0.00314623 -0.03060734 0.01111387] [-0.06097465 0.03966043 -0.05164248 0.06578685 0.03638128 -0.03397174] [ 0.06715029 -0.02453769 0.02205854 0.01110986 0.02345785 0.05879898] [-0.04625867 0.07500569 -0.05690887 -0.07615659 0.01962536 0.00035865] [ 0.01423577 -0.03938593 -0.08625181 0.04841495 0.06951572 0.08811047]] [[ 0.0163899 0.06895607 -0.01131684 0.01810524 -0.05857501 0.01811318] [ 0.01915303 -0.0163289 -0.04133433 0.06810946 0.03736673 0.04218033] [ 0.00795418 -0.00143972 -0.01627307 -0.00300788 -0.02759011 0.09251165] [ 0.0028762 0.04526488 -0.05222676 -0.02007698 0.07879823 0.00541583] [ 0.01423577 -0.03938593 -0.08625181 0.04841495 0.06951572 0.08811047]]], shape=(2, 5, 6), dtype=float32)

Note, the above class creates an embedding layer that has trainable weights. Hence, the weights are initialized randomly and tuned in the training phase.

The authors of Attention is All You Need have specified a positional encoding scheme as shown below. You can read the full details in part 1 of this tutorial:

\begin{eqnarray}

P(k, 2i) &=& \sin\Big(\frac{k}{n^{2i/d}}\Big)\\

P(k, 2i+1) &=& \cos\Big(\frac{k}{n^{2i/d}}\Big)

\end{eqnarray}

P(k, 2i) &=& \sin\Big(\frac{k}{n^{2i/d}}\Big)\\

P(k, 2i+1) &=& \cos\Big(\frac{k}{n^{2i/d}}\Big)

\end{eqnarray}

If you want to use the same positional encoding scheme, you can specify your own embedding matrix as discussed in part 1, which shows how to create your own embeddings in NumPy. When specifying the

`Embedding`

layer, you need to provide the positional encoding matrix as weights along with `trainable=False`

. Let’s create another positional embedding class that does exactly this.class PositionEmbeddingFixedWeights(Layer): def __init__(self, sequence_length, vocab_size, output_dim, **kwargs): super(PositionEmbeddingFixedWeights, self).__init__(**kwargs) word_embedding_matrix = self.get_position_encoding(vocab_size, output_dim) position_embedding_matrix = self.get_position_encoding(sequence_length, output_dim) self.word_embedding_layer = Embedding( input_dim=vocab_size, output_dim=output_dim, weights=[word_embedding_matrix], trainable=False ) self.position_embedding_layer = Embedding( input_dim=sequence_length, output_dim=output_dim, weights=[position_embedding_matrix], trainable=False ) def get_position_encoding(self, seq_len, d, n=10000): P = np.zeros((seq_len, d)) for k in range(seq_len): for i in np.arange(int(d/2)): denominator = np.power(n, 2*i/d) P[k, 2*i] = np.sin(k/denominator) P[k, 2*i+1] = np.cos(k/denominator) return P def call(self, inputs): position_indices = tf.range(tf.shape(inputs)[-1]) embedded_words = self.word_embedding_layer(inputs) embedded_indices = self.position_embedding_layer(position_indices) return embedded_words + embedded_indices

Next, we set up everything to run this layer.

attnisallyouneed_embedding = PositionEmbeddingFixedWeights(output_sequence_length, vocab_size, output_length) attnisallyouneed_output = attnisallyouneed_embedding(vectorized_words) print("Output from my_embedded_layer: ", attnisallyouneed_output)

Output from my_embedded_layer: tf.Tensor( [[[-0.9589243 1.2836622 0.23000172 1.9731903 0.01077196 1.9999421 ] [ 0.56205547 1.5004725 0.3213085 1.9603932 0.01508068 1.9999142 ] [ 1.566284 0.3377554 0.41192317 1.9433732 0.01938933 1.999877 ] [ 1.0504174 -1.4061394 0.2314966 1.9860148 0.01077211 1.9999698 ] [-0.7568025 0.3463564 0.18459873 1.982814 0.00861763 1.9999628 ]] [[ 0.14112 0.0100075 0.1387981 1.9903207 0.00646326 1.9999791 ] [ 0.08466846 -0.11334133 0.23099795 1.9817369 0.01077207 1.9999605 ] [ 1.8185948 -0.8322937 0.185397 1.9913884 0.00861771 1.9999814 ] [ 0.14112 0.0100075 0.1387981 1.9903207 0.00646326 1.9999791 ] [-0.7568025 0.3463564 0.18459873 1.982814 0.00861763 1.9999628 ]]], shape=(2, 5, 6), dtype=float32)

In order to visualize the embeddings, let’s take two bigger sentences, one technical and the other one just a quote. We’ll set up the `TextVectorization`

layer along with the positional encoding layer and see what the final output looks like.

technical_phrase = "to understand machine learning algorithms you need" +\ " to understand concepts such as gradient of a function "+\ "Hessians of a matrix and optimization etc" wise_phrase = "patrick henry said give me liberty or give me death "+\ "when he addressed the second virginia convention in march" total_vocabulary = 200 sequence_length = 20 final_output_len = 50 phrase_vectorization_layer = TextVectorization( output_sequence_length=sequence_length, max_tokens=total_vocabulary) # Learn the dictionary phrase_vectorization_layer.adapt([technical_phrase, wise_phrase]) # Convert all sentences to tensors phrase_tensors = convert_to_tensor([technical_phrase, wise_phrase], dtype=tf.string) # Use the word tensors to get vectorized phrases vectorized_phrases = phrase_vectorization_layer(phrase_tensors) random_weights_embedding_layer = PositionEmbeddingLayer(sequence_length, total_vocabulary, final_output_len) fixed_weights_embedding_layer = PositionEmbeddingFixedWeights(sequence_length, total_vocabulary, final_output_len) random_embedding = random_weights_embedding_layer(vectorized_phrases) fixed_embedding = fixed_weights_embedding_layer(vectorized_phrases)

Now let’s see what the random embeddings look like for both phrases.

fig = plt.figure(figsize=(15, 5)) title = ["Tech Phrase", "Wise Phrase"] for i in range(2): ax = plt.subplot(1, 2, 1+i) matrix = tf.reshape(random_embedding[i, :, :], (sequence_length, final_output_len)) cax = ax.matshow(matrix) plt.gcf().colorbar(cax) plt.title(title[i], y=1.2) fig.suptitle("Random Embedding") plt.show()

The embedding from the fixed weights layer are visualized below.

fig = plt.figure(figsize=(15, 5)) title = ["Tech Phrase", "Wise Phrase"] for i in range(2): ax = plt.subplot(1, 2, 1+i) matrix = tf.reshape(fixed_embedding[i, :, :], (sequence_length, final_output_len)) cax = ax.matshow(matrix) plt.gcf().colorbar(cax) plt.title(title[i], y=1.2) fig.suptitle("Fixed Weight Embedding from Attention is All You Need") plt.show()

We can see that the embedding layer initialized using the default parameter outputs random values. On the other hand, the fixed weights generated using sinusoids create a unique signature for every phrase with information on each word position encoded within it.

You can experiment with both tunable or fixed weight implementations for your particular application.

This section provides more resources on the topic if you are looking to go deeper.

- Transformers for natural language processing, by Denis Rothman.

- Attention Is All You Need, 2017.

- The Transformer Attention Mechanism
- The Transformer Model
- Transformer Model for Language Understanding
- Using Pre-Trained Word Embeddings in a Keras Model
- English-to-Spanish translation with a sequence-to-sequence Transformer
- A Gentle Introduction to Positional Encoding in Transformer Models, Part 1

In this tutorial, you discovered the implementation of positional encoding layer in Keras.

Specifically, you learned:

- Text vectorization layer in Keras
- Positional encoding layer in Keras
- Creating your own class for positional encoding
- Setting your own weights for the positional encoding layer in Keras

Do you have any questions about positional encoding discussed in this post? Ask your questions in the comments below and I will do my best to answer.

The post The Transformer Positional Encoding Layer in Keras, Part 2 appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Positional Encoding In Transformer Models, Part 1 appeared first on Machine Learning Mastery.

]]>For this tutorial, we’ll simplify the notations used in this awesome paper Attention is all You Need by Vaswani et al. After completing this tutorial, you will know:

- What is positional encoding and why it’s important
- Positional encoding in transformers
- Code and visualize a positional encoding matrix in Python using NumPy

Let’s get started.

This tutorial is divided into four parts; they are:

- What is positional encoding
- Mathematics behind positional encoding in transformers
- Implementing the positional encoding matrix using NumPy
- Understanding and visualizing the positional encoding matrix

Positional encoding describes the location or position of an entity in a sequence so that each position is assigned a unique representation. There are many reasons why a single number such as the index value is not used to represent an item’s position in transformer models. For long sequences, the indices can grow large in magnitude. If you normalize the index value to lie between 0 and 1, it can create problems for variable length sequences as they would be normalized differently.

Transformers use a smart positional encoding scheme, where each position/index is mapped to a vector. Hence, the output of the positional encoding layer is a matrix, where each row of the matrix represents an encoded object of the sequence summed with its positional information. An example of the matrix that encodes only the positional information is shown in the figure below.

This is a quick recap of sine functions and you can work equivalently with cosine functions. The function’s range is [-1,+1]. The frequency of this waveform is the number of cycles completed in one second. The wavelength is the distance over which the waveform repeats itself. The wavelength and frequency for different waveforms is shown below:

Let’s dive straight into this. Suppose we have an input sequence of length $L$ and we require the position of the $k^{th}$ object within this sequence. The positional encoding is given by sine and cosine functions of varying frequencies:

\begin{eqnarray}

P(k, 2i) &=& \sin\Big(\frac{k}{n^{2i/d}}\Big)\\

P(k, 2i+1) &=& \cos\Big(\frac{k}{n^{2i/d}}\Big)

\end{eqnarray}

Here:

$k$: Position of an object in input sequence, $0 \leq k < L/2$

$d$: Dimension of the output embedding space

$P(k, j)$: Position function for mapping a position $k$ in the input sequence to index $(k,j)$ of the positional matrix

$n$: User defined scalar. Set to 10,000 by the authors of Attention is all You Need.

$i$: Used for mapping to column indices $0 \leq i < d/2$. A single value of $i$ maps to both sine and cosine functions

In the above expression we can see that even positions correspond to sine function and odd positions correspond to even positions.

To understand the above expression, let’s take an example of the phrase ‘I am a robot’, with n=100 and d=4. The following table shows the positional encoding matrix for this phrase. In fact the positional encoding matrix would be the same for any 4 letter phrase with n=100 and d=4.

Here is a short Python code to implement positional encoding using NumPy. The code is simplified to make the understanding of positional encoding easier.

import numpy as np import matplotlib.pyplot as plt def getPositionEncoding(seq_len, d, n=10000): P = np.zeros((seq_len, d)) for k in range(seq_len): for i in np.arange(int(d/2)): denominator = np.power(n, 2*i/d) P[k, 2*i] = np.sin(k/denominator) P[k, 2*i+1] = np.cos(k/denominator) return P P = getPositionEncoding(seq_len=4, d=4, n=100) print(P)

[[ 0. 1. 0. 1. ] [ 0.84147098 0.54030231 0.09983342 0.99500417] [ 0.90929743 -0.41614684 0.19866933 0.98006658] [ 0.14112001 -0.9899925 0.29552021 0.95533649]]

To understand the positional encoding, let’s start by looking at the sine wave for different positions with n=10,000 and d=512.

def plotSinusoid(k, d=512, n=10000): x = np.arange(0, 100, 1) denominator = np.power(n, 2*x/d) y = np.sin(k/denominator) plt.plot(x, y) plt.title('k = ' + str(k)) fig = plt.figure(figsize=(15, 4)) for i in range(4): plt.subplot(141 + i) plotSinusoid(i*4)

The following figure is the output of the above code:

We can see that each position $k$ corresponds to a different sinusoid, which encodes a single position into a vector. If we look closely at the positional encoding function, we can see that the wavelength for a fixed $i$ is given by:

$$

\lambda_{i} = 2 \pi n^{2i/d}

$$

Hence, the wavelengths of the sinusoids form a geometric progression and vary from $2\pi$ to $2\pi n$. The scheme for positional encoding has a number of advantages.

- The sine and cosine functions have values in [-1, 1], which keeps the values of the positional encoding matrix in a normalized range.
- As the sinusoid for each position is different, we have a unique way of encoding each position.
- We have a way of measuring or quantifying the similarity between different positions, hence enabling us to encode relative positions of words.

Let’s visualize the positional matrix on bigger values. We’ll use Python’s `matshow()`

method from the `matplotlib`

library. Setting n=10,000 as done in the original paper, we get the following:

P = getPositionEncoding(seq_len=100, d=512, n=10000) cax = plt.matshow(P) plt.gcf().colorbar(cax)

The positional encoding layer sums the positional vector with the word encoding and outputs this matrix for the next layers. The entire process is shown below.

This section provides more resources on the topic if you are looking to go deeper.

- Transformers for natural language processing, by Denis Rothman.

- Attention Is All You Need, 2017.

- The Transformer Attention Mechanism
- The Transformer Model
- Transformer model for language understanding

In this tutorial, you discovered positional encoding in transformers.

Specifically, you learned:

- What is positional encoding and why it is needed.
- How to implement positional encoding in Python using NumPy
- How to visualize the positional encoding matrix

Do you have any questions about positional encoding discussed in this post? Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Positional Encoding In Transformer Models, Part 1 appeared first on Machine Learning Mastery.

]]>The post The Transformer Model appeared first on Machine Learning Mastery.

]]>In this tutorial, you will discover the network architecture of the Transformer model.

After completing this tutorial, you will know:

- How the Transformer architecture implements an encoder-decoder structure without recurrence and convolutions.
- How the Transformer encoder and decoder work.
- How the Transformer self-attention compares to the use of recurrent and convolutional layers.

Let’s get started.

This tutorial is divided into three parts; they are:

- The Transformer Architecture
- The Encoder
- The Decoder

- Sum Up: The Transformer Model
- Comparison to Recurrent and Convolutional Layers

For this tutorial, we assume that you are already familiar with:

The Transformer architecture follows an encoder-decoder structure, but does not rely on recurrence and convolutions in order to generate an output.

In a nutshell, the task of the encoder, on the left half of the Transformer architecture, is to map an input sequence to a sequence of continuous representations, which is then fed into a decoder.

The decoder, on the right half of the architecture, receives the output of the encoder together with the decoder output at the previous time step, to generate an output sequence.

At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.

–Attention Is All You Need, 2017.

The encoder consists of a stack of $N$ = 6 identical layers, where each layer is composed of two sublayers:

- The first sublayer implements a multi-head self-attention mechanism. We had seen that the multi-head mechanism implements $h$ heads that receive a (different) linearly projected version of the queries, keys and values each, to produce $h$ outputs in parallel that are then used to generate a final result.

- The second sublayer is a fully connected feed-forward network, consisting of two linear transformations with Rectified Linear Unit (ReLU) activation in between:

$$\text{FFN}(x) = \text{ReLU}(\mathbf{W}_1 x + b_1) \mathbf{W}_2 + b_2$$

The six layers of the Transformer encoder apply the same linear transformations to all of the words in the input sequence, but *each* layer employs different weight ($\mathbf{W}_1, \mathbf{W}_2$) and bias ($b_1, b_2$) parameters to do so.

Furthermore, each of these two sublayers has a residual connection around it.

Each sublayer is also succeeded by a normalization layer, $\text{layernorm}(.)$, which normalizes the sum computed between the sublayer input, $x$, and the output generated by the sublayer itself, $\text{sublayer}(x)$:

$$\text{layernorm}(x + \text{sublayer}(x))$$

An important consideration to keep in mind is that the Transformer architecture cannot inherently capture any information about the relative positions of the words in the sequence, since it does not make use of recurrence. This information has to be injected by introducing *positional encodings* to the input embeddings.

The positional encoding vectors are of the same dimension as the input embeddings, and are generated using sine and cosine functions of different frequencies. Then, they are simply summed to the input embeddings in order to *inject* the positional information.

The decoder shares several similarities with the encoder.

The decoder also consists of a stack of $N$ = 6 identical layers that are, each, composed of three sublayers:

- The first sublayer receives the previous output of the decoder stack, augments it with positional information, and implements multi-head self-attention over it. While the encoder is designed to attend to all words in the input sequence,
*regardless*of their position in the sequence, the decoder is modified to attend*only*to the preceding words. Hence, the prediction for a word at position, $i$, can only depend on the known outputs for the words that come before it in the sequence. In the multi-head attention mechanism (which implements multiple, single attention functions in parallel), this is achieved by introducing a mask over the values produced by the scaled multiplication of matrices $\mathbf{Q}$ and $\mathbf{K}$. This masking is implemented by suppressing the matrix values that would, otherwise, correspond to illegal connections:

$$

\text{mask}(\mathbf{QK}^T) =

\text{mask} \left( \begin{bmatrix}

e_{11} & e_{12} & \dots & e_{1n} \\

e_{21} & e_{22} & \dots & e_{2n} \\

\vdots & \vdots & \ddots & \vdots \\

e_{m1} & e_{m2} & \dots & e_{mn} \\

\end{bmatrix} \right) =

\begin{bmatrix}

e_{11} & -\infty & \dots & -\infty \\

e_{21} & e_{22} & \dots & -\infty \\

\vdots & \vdots & \ddots & \vdots \\

e_{m1} & e_{m2} & \dots & e_{mn} \\

\end{bmatrix}

$$

The masking makes the decoder unidirectional (unlike the bidirectional encoder).

–Advanced Deep Learning with Python, 2019.

- The second layer implements a multi-head self-attention mechanism, which is similar to the one implemented in the first sublayer of the encoder. On the decoder side, this multi-head mechanism receives the queries from the previous decoder sublayer, and the keys and values from the output of the encoder. This allows the decoder to attend to all of the words in the input sequence.

- The third layer implements a fully connected feed-forward network, which is similar to the one implemented in the second sublayer of the encoder.

Furthermore, the three sublayers on the decoder side also have residual connections around them, and are succeeded by a normalization layer.

Positional encodings are also added to the input embeddings of the decoder, in the same manner as previously explained for the encoder.

The Transformer model runs as follows:

- Each word forming an input sequence is transformed into a $d_{\text{model}}$-dimensional embedding vector.

- Each embedding vector representing an input word is augmented by summing it (element-wise) to a positional encoding vector of the same $d_{\text{model}}$ length, hence introducing positional information into the input.

- The augmented embedding vectors are fed into the encoder block, consisting of the two sublayers explained above. Since the encoder attends to all words in the input sequence, irrespective if they precede or succeed the word under consideration, then the Transformer encoder is
*bidirectional*.

- The decoder receives as input its own predicted output word at time-step, $t – 1$.

- The input to the decoder is also augmented by positional encoding, in the same manner as this is done on the encoder side.

- The augmented decoder input is fed into the three sublayers comprising the decoder block explained above. Masking is applied in the first sublayer, in order to stop the decoder from attending to succeeding words. At the second sublayer, the decoder also receives the output of the encoder, which now allows the decoder to attend to all of the words in the input sequence.

- The output of the decoder finally passes through a fully connected layer, followed by a softmax layer, to generate a prediction for the next word of the output sequence.

Vaswani et al. (2017) explain that their motivation for abandoning the use of recurrence and convolutions was based on several factors:

- Self-attention layers were found to be faster than recurrent layers for shorter sequence lengths, and can be restricted to consider only a neighbourhood in the input sequence for very long sequence lengths.

- The number of sequential operations required by a recurrent layer is based upon the sequence length, whereas this number remains constant for a self-attention layer.

- In convolutional neural networks, the kernel width directly affects the long-term dependencies that can be established between pairs of input and output positions. Tracking long-term dependencies would require the use of large kernels, or stacks of convolutional layers that could increase the computational cost.

This section provides more resources on the topic if you are looking to go deeper.

- Attention Is All You Need, 2017.

In this tutorial, you discovered the network architecture of the Transformer model.

Specifically, you learned:

- How the Transformer architecture implements an encoder-decoder structure without recurrence and convolutions.
- How the Transformer encoder and decoder work.
- How the Transformer self-attention compares to recurrent and convolutional layers.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post The Transformer Model appeared first on Machine Learning Mastery.

]]>The post The Transformer Attention Mechanism appeared first on Machine Learning Mastery.

]]>We will first be focusing on the Transformer attention mechanism in this tutorial, and subsequently reviewing the Transformer model in a separate one.

In this tutorial, you will discover the Transformer attention mechanism for neural machine translation.

After completing this tutorial, you will know:

- How the Transformer attention differed from its predecessors.
- How the Transformer computes a scaled-dot product attention.
- How the Transformer computes multi-head attention.

Let’s get started.

This tutorial is divided into two parts; they are:

- Introduction to the Transformer Attention
- The Transformer Attention
- Scaled-Dot Product Attention
- Multi-Head Attention

For this tutorial, we assume that you are already familiar with:

- The concept of attention
- The attention mechanism
- The Bahdanau attention mechanism
- The Luong attention mechanism

We have, thus far, familiarised ourselves with the use of an attention mechanism in conjunction with an RNN-based encoder-decoder architecture. We have seen that two of the most popular models that implement attention in this manner have been those proposed by Bahdanau et al. (2014) and Luong et al. (2015).

The Transformer architecture revolutionized the use of attention by dispensing of recurrence and convolutions, on which the formers had extensively relied.

… the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.

–Attention Is All You Need, 2017.

In their paper, Attention Is All You Need, Vaswani et al. (2017) explain that the Transformer model, alternatively, relies solely on the use of self-attention, where the representation of a sequence (or sentence) is computed by relating different words in the same sequence.

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

–Attention Is All You Need, 2017.

**The Transformer Attention**

The main components in use by the Transformer attention are the following:

- $\mathbf{q}$ and $\mathbf{k}$ denoting vectors of dimension, $d_k$, containing the queries and keys, respectively.
- $\mathbf{v}$ denoting a vector of dimension, $d_v$, containing the values.
- $\mathbf{Q}$, $\mathbf{K}$ and $\mathbf{V}$ denoting matrices packing together sets of queries, keys and values, respectively.
- $\mathbf{W}^Q$, $\mathbf{W}^K$ and $\mathbf{W}^V$ denoting projection matrices that are used in generating different subspace representations of the query, key and value matrices.
- $\mathbf{W}^O$ denoting a projection matrix for the multi-head output.

In essence, the attention function can be considered as a mapping between a query and a set of key-value pairs, to an output.

The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

–Attention Is All You Need, 2017.

Vaswani et al. propose a *scaled dot-product attention*, and then build on it to propose *multi-head attention*. Within the context of neural machine translation, the query, keys and values that are used as inputs to the these attention mechanisms, are different projections of the same input sentence.

Intuitively, therefore, the proposed attention mechanisms implement self-attention by capturing the relationships between the different elements (in this case, the words) of the same sentence.

The Transformer implements a scaled dot-product attention, which follows the procedure of the general attention mechanism that we had previously seen.

As the name suggests, the scaled dot-product attention first computes a *dot product* for each query, $\mathbf{q}$, with all of the keys, $\mathbf{k}$. It, subsequently, divides each result by $\sqrt{d_k}$ and proceeds to apply a softmax function. In doing so, it obtains the weights that are used to *scale* the values, $\mathbf{v}$.

In practice, the computations performed by the scaled dot-product attention can be efficiently applied on the entire set of queries simultaneously. In order to do so, the matrices, $\mathbf{Q}$, $\mathbf{K}$ and $\mathbf{V}$, are supplied as inputs to the attention function:

$$\text{attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V$$

Vaswani et al. explain that their scaled dot-product attention is identical to the multiplicative attention of Luong et al. (2015), except for the added scaling factor of $\tfrac{1}{\sqrt{d_k}}$.

This scaling factor was introduced to counteract the effect of having the dot products grow large in magnitude for large values of $d_k$, where the application of the softmax function would then return extremely small gradients that would lead to the infamous vanishing gradients problem. The scaling factor, therefore, serves to pull the results generated by the dot product multiplication down, hence preventing this problem.

Vaswani et al. further explain that their choice of opting for multiplicative attention instead of the additive attention of Bahdanau et al. (2014), was based on the computational efficiency associated with the former.

… dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

–Attention Is All You Need, 2017.

The step-by-step procedure for computing the scaled-dot product attention is, therefore, the following:

- Compute the alignment scores by multiplying the set of queries packed in matrix, $\mathbf{Q}$,with the keys in matrix, $\mathbf{K}$. If matrix, $\mathbf{Q}$, is of size $m \times d_k$ and matrix, $\mathbf{K}$, is of size, $n \times d_k$, then the resulting matrix will be of size $m \times n$:

$$

\mathbf{QK}^T =

\begin{bmatrix}

e_{11} & e_{12} & \dots & e_{1n} \\

e_{21} & e_{22} & \dots & e_{2n} \\

\vdots & \vdots & \ddots & \vdots \\

e_{m1} & e_{m2} & \dots & e_{mn} \\

\end{bmatrix}

$$

- Scale each of the alignment scores by $\tfrac{1}{\sqrt{d_k}}$:

$$

\frac{\mathbf{QK}^T}{\sqrt{d_k}} =

\begin{bmatrix}

\tfrac{e_{11}}{\sqrt{d_k}} & \tfrac{e_{12}}{\sqrt{d_k}} & \dots & \tfrac{e_{1n}}{\sqrt{d_k}} \\

\tfrac{e_{21}}{\sqrt{d_k}} & \tfrac{e_{22}}{\sqrt{d_k}} & \dots & \tfrac{e_{2n}}{\sqrt{d_k}} \\

\vdots & \vdots & \ddots & \vdots \\

\tfrac{e_{m1}}{\sqrt{d_k}} & \tfrac{e_{m2}}{\sqrt{d_k}} & \dots & \tfrac{e_{mn}}{\sqrt{d_k}} \\

\end{bmatrix}

$$

- And follow the scaling process by applying a softmax operation in order to obtain a set of weights:

$$

\text{softmax} \left( \frac{\mathbf{QK}^T}{\sqrt{d_k}} \right) =

\begin{bmatrix}

\text{softmax} ( \tfrac{e_{11}}{\sqrt{d_k}} & \tfrac{e_{12}}{\sqrt{d_k}} & \dots & \tfrac{e_{1n}}{\sqrt{d_k}} ) \\

\text{softmax} ( \tfrac{e_{21}}{\sqrt{d_k}} & \tfrac{e_{22}}{\sqrt{d_k}} & \dots & \tfrac{e_{2n}}{\sqrt{d_k}} ) \\

\vdots & \vdots & \ddots & \vdots \\

\text{softmax} ( \tfrac{e_{m1}}{\sqrt{d_k}} & \tfrac{e_{m2}}{\sqrt{d_k}} & \dots & \tfrac{e_{mn}}{\sqrt{d_k}} ) \\

\end{bmatrix}

$$

- Finally, apply the resulting weights to the values in matrix, $\mathbf{V}$, of size, $n \times d_v$:

$$

\begin{aligned}

& \text{softmax} \left( \frac{\mathbf{QK}^T}{\sqrt{d_k}} \right) \cdot \mathbf{V} \\

=&

\begin{bmatrix}

\text{softmax} ( \tfrac{e_{11}}{\sqrt{d_k}} & \tfrac{e_{12}}{\sqrt{d_k}} & \dots & \tfrac{e_{1n}}{\sqrt{d_k}} ) \\

\text{softmax} ( \tfrac{e_{21}}{\sqrt{d_k}} & \tfrac{e_{22}}{\sqrt{d_k}} & \dots & \tfrac{e_{2n}}{\sqrt{d_k}} ) \\

\vdots & \vdots & \ddots & \vdots \\

\text{softmax} ( \tfrac{e_{m1}}{\sqrt{d_k}} & \tfrac{e_{m2}}{\sqrt{d_k}} & \dots & \tfrac{e_{mn}}{\sqrt{d_k}} ) \\

\end{bmatrix}

\cdot

\begin{bmatrix}

v_{11} & v_{12} & \dots & v_{1d_v} \\

v_{21} & v_{22} & \dots & v_{2d_v} \\

\vdots & \vdots & \ddots & \vdots \\

v_{n1} & v_{n2} & \dots & v_{nd_v} \\

\end{bmatrix}

\end{aligned}

$$

Building on their single attention function that takes matrices, $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$, as input, as we have just reviewed, Vaswani et al. also propose a multi-head attention mechanism.

Their multi-head attention mechanism linearly projects the queries, keys and values $h$ times, each time using a different learned projection. The single attention mechanism is then applied to each of these $h$ projections in parallel, to produce $h$ outputs, which in turn are concatenated and projected again to produce a final result.

The idea behind multi-head attention is to allow the attention function to extract information from different representation subspaces, which would, otherwise, not be possible with a single attention head.

The multi-head attention function can be represented as follows:

$$\text{multihead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{concat}(\text{head}_1, \dots, \text{head}_h) \mathbf{W}^O$$

Here, each $\text{head}_i$, $i = 1, \dots, h$, implements a single attention function characterized by its own learned projection matrices:

$$\text{head}_i = \text{attention}(\mathbf{QW}^Q_i, \mathbf{KW}^K_i, \mathbf{VW}^V_i)$$

The step-by-step procedure for computing multi-head attention is, therefore, the following:

- Compute the linearly projected versions of the queries, keys and values through a multiplication with the respective weight matrices, $\mathbf{W}^Q_i$, $\mathbf{W}^K_i$ and $\mathbf{W}^V_i$, one for each $\text{head}_i$.

- Apply the single attention function for each head by (1) multiplying the queries and keys matrices, (2) applying the scaling and softmax operations, and (3) weighting the values matrix, to generate an output for each head.

- Concatenate the outputs of the heads, $\text{head}_i$, $i = 1, \dots, h$.

- Apply a linear projection to the concatenated output through a multiplication with the weight matrix, $\mathbf{W}^O$, to generate the final result.

This section provides more resources on the topic if you are looking to go deeper.

- Attention Is All You Need, 2017.
- Neural Machine Translation by Jointly Learning to Align and Translate, 2014.
- Effective Approaches to Attention-based Neural Machine Translation, 2015.

In this tutorial, you discovered the Transformer attention mechanism for neural machine translation.

Specifically, you learned:

- How the Transformer attention differed from its predecessors.
- How the Transformer computes a scaled-dot product attention.
- How the Transformer computes multi-head attention.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post The Transformer Attention Mechanism appeared first on Machine Learning Mastery.

]]>The post The Luong Attention Mechanism appeared first on Machine Learning Mastery.

]]>In this tutorial, you will discover the Luong attention mechanism for neural machine translation.

After completing this tutorial, you will know:

- The operations performed by the Luong attention algorithm.
- How the global and local attentional models work.
- How the Luong attention compares to the Bahdanau attention.

Let’s get started.

This tutorial is divided into five parts; they are:

- Introduction to the Luong Attention
- The Luong Attention Algorithm
- The Global Attentional Model
- The Local Attentional Model
- Comparison to the Bahdanau Attention

For this tutorial, we assume that you are already familiar with:

Luong et al. (2015) inspire themselves from previous attention models, to propose two attention mechanisms:

In this work, we design, with simplicity and effectiveness in mind, two novel types of attention-based models: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time.

–Effective Approaches to Attention-based Neural Machine Translation, 2015.

The *global* attentional model resembles the model of Bahdanau et al. (2014) in attending to *all* source words, but aims to simplify it architecturally.

The *local* attentional model is inspired from the hard and soft attention models of Xu et al. (2016), and attends to *only a few* of the source positions.

The two attentional models share many of the steps in their prediction of the current word, but differ mainly in their computation of the context vector.

Let’s first take a look at the overarching Luong attention algorithm, and then delve into the differences between the global and local attentional models afterwards.

The attention algorithm of Luong et al. performs the following operations:

- The encoder generates a set of annotations, $H = \mathbf{h}_i, i = 1, \dots, T$, from the input sentence.

- The current decoder hidden state is computed as: $\mathbf{s}_t = \text{RNN}_\text{decoder}(\mathbf{s}_{t-1}, y_{t-1})$. Here, $\mathbf{s}_{t-1}$ denotes the previous hidden decoder state, and $y_{t-1}$ the previous decoder output.

- An alignment model, $a(.)$ uses the annotations and the current decoder hidden state to compute the alignment scores: $e_{t,i} = a(\mathbf{s}_t, \mathbf{h}_i)$.

- A softmax function is applied to the alignment scores, effectively normalizing them into weight values in a range between 0 and 1: $\alpha_{t,i} = \text{softmax}(e_{t,i})$.

- These weights together with the previously computed annotations are used to generate a context vector through a weighted sum of the annotations: $\mathbf{c}_t = \sum^T_{i=1} \alpha_{t,i} \mathbf{h}_i$.

- An attentional hidden state is computed based on a weighted concatenation of the context vector and the current decoder hidden state: $\widetilde{\mathbf{s}}_t = \tanh(\mathbf{W_c} [\mathbf{c}_t \; ; \; \mathbf{s}_t])$.

- The decoder produces a final output by feeding it a weighted attentional hidden state: $y_t = \text{softmax}(\mathbf{W}_y \widetilde{\mathbf{s}}_t)$.

- Steps 2-7 are repeated until the end of the sequence.

The global attentional model considers all of the source words in the input sentence when generating the alignment scores and, eventually, when computing the context vector.

The idea of a global attentional model is to consider all the hidden states of the encoder when deriving the context vector, $\mathbf{c}_t$.

–Effective Approaches to Attention-based Neural Machine Translation, 2015.

In order to do so, Luong et al. propose three alternative approaches for computing the alignment scores. The first approach is similar to Bahdanau’s and is based upon the concatenation of $\mathbf{s}_t$ and $\mathbf{h}_i$, while the second and third approaches implement *multiplicative* attention (in contrast to Bahdanau’s *additive* attention):

- $$a(\mathbf{s}_t, \mathbf{h}_i) = \mathbf{v}_a^T \tanh(\mathbf{W}_a [\mathbf{s}_t \; ; \; \mathbf{h}_i)]$$

- $$a(\mathbf{s}_t, \mathbf{h}_i) = \mathbf{s}^T_t \mathbf{h}_i$$

- $$a(\mathbf{s}_t, \mathbf{h}_i) = \mathbf{s}^T_t \mathbf{W}_a \mathbf{h}_i$$

Here, $\mathbf{W}_a$ is a trainable weight matrix and, similarly, $\mathbf{v}_a$ is a weight vector.

Intuitively, the use of the dot product in *multiplicative* attention can be interpreted as providing a similarity measure between the vectors, $\mathbf{s}_t$ and $\mathbf{h}_i$, under consideration.

… if the vectors are similar (that is, aligned), the result of the multiplication will be a large value and the attention will be focused on the current t,i relationship.– Advanced Deep Learning with Python, 2019.

The resulting alignment vector, $\mathbf{e}_t$, is of variable-length according to the number of source words.

In attending to all source words, the global attentional model is computationally expensive and could potentially become impractical for translating longer sentences.

The local attentional model seeks to address these limitations by focusing on a smaller subset of the source words to generate each target word. In order to do so, it takes inspiration from the *hard* and *soft* attention models of the image caption generation work of Xu et al. (2016):

- S
*oft*attention is equivalent to the global attention approach, where weights are softly placed over all the source image patches. Hence, soft attention considers the source image in its entirety.

*Hard*attention attends to a single image patch at a time.

The local attentional model of Luong et al. generates a context vector by computing a weighted average over the set of annotations, $\mathbf{h}_i$, within a window centered over an aligned position, $p_t$:

$$[p_t – D, p_t + D]$$

While a value for $D$ is selected empirically, Luong et al. consider two approaches in computing a value for $p_t$:

*Monotonic*alignment: where the source and target sentences are assumed to be monotonically aligned and, hence, $p_t = t$.

*Predictive*alignment: where a prediction of the aligned position is based upon trainable model parameters, $\mathbf{W}_p$ and $\mathbf{v}_p$, and the source sentence length, $S$:

$$p_t = S \cdot \text{sigmoid}(\mathbf{v}^T_p \tanh(\mathbf{W}_p, \mathbf{s}_t))$$

To favour source words nearer to the window centre, a Gaussian distribution is centered around $p_t$ when computing the alignment weights.

This time round, the resulting alignment vector, $\mathbf{e}_t$, has a fixed length of $2D + 1$.

The Bahdanau model and the global attention approach of Luong et al. are mostly similar, but there are key differences between the two:

While our global attention approach is similar in spirit to the model proposed by Bahdanau et al. (2015), there are several key differences which reflect how we have both simplified and generalized from the original model.

–Effective Approaches to Attention-based Neural Machine Translation, 2015.

- Most notably, the computation of the alignment scores, $e_t$, in the Luong global attentional model depends on the current decoder hidden state, $\mathbf{s}_t$, rather than on the previous hidden state, $\mathbf{s}_{t-1}$, as in the Bahdanau attention.

- Luong et al. drop the bidirectional encoder in use by the Bahdanau model, and instead utilize the hidden states at the top LSTM layers for both encoder and decoder.

- The global attentional model of Luong et al. investigates the use of multiplicative attention, as an alternative to the Bahdanau additive attention.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered the Luong attention mechanism for neural machine translation.

Specifically, you learned:

- The operations performed by the Luong attention algorithm.
- How the global and local attentional models work.
- How the Luong attention compares to the Bahdanau attention.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post The Luong Attention Mechanism appeared first on Machine Learning Mastery.

]]>The post The Bahdanau Attention Mechanism appeared first on Machine Learning Mastery.

]]>The Bahdanau attention was proposed to address the performance bottleneck of conventional encoder-decoder architectures, achieving significant improvements over the conventional approach.

In this tutorial, you will discover the Bahdanau attention mechanism for neural machine translation.

After completing this tutorial, you will know:

- Where the Bahdanau attention derives its name from, and the challenge it addresses.
- The role of the different components that form part of the Bahdanau encoder-decoder architecture.
- The operations performed by the Bahdanau attention algorithm.

Let’s get started.

This tutorial is divided into two parts; they are:

- Introduction to the Bahdanau Attention
- The Bahdanau Architecture
- The Encoder
- The Decoder
- The Bahdanau Attention Algorithm

For this tutorial, we assume that you are already familiar with:

- Recurrent Neural Networks (RNNs)
- The encoder-decoder RNN architecture
- The concept of attention
- The attention mechanism

The Bahdanau attention mechanism has inherited its name from the first author of the paper in which it was published.

It follows the work of Cho et al. (2014) and Sutskever et al. (2014), who had also employed an RNN encoder-decoder framework for neural machine translation, specifically by encoding a variable-length source sentence into a fixed-length vector. The latter would then be decoded into a variable-length target sentence.

Bahdanau et al. (2014) argue that this encoding of a variable-length input into a fixed-length vector *squashes* the information of the source sentence, irrespective of its length, causing the performance of a basic encoder-decoder model to deteriorate rapidly with an increasing length of the input sentence. The approach they propose, on the other hand, replaces the fixed-length vector with a variable-length one, to improve the translation performance of the basic encoder-decoder model.

The most important distinguishing feature of this approach from the basic encoder–decoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation.

–Neural Machine Translation by Jointly Learning to Align and Translate, 2014.

The main components in use by the Bahdanau encoder-decoder architecture are the following:

- $\mathbf{s}_{t-1}$ is the
*hidden decoder state*at the previous time step, $t-1$. - $\mathbf{c}_t$ is the
*context vector*at time step, $t$. It is uniquely generated at each decoder step to generate a target word, $y_t$. - $\mathbf{h}_i$ is an
*annotation*that captures the information contained in the words forming the entire input sentence, $\{ x_1, x_2, \dots, x_T \}$, with strong focus around the $i$-th word out of $T$ total words. - $\alpha_{t,i}$ is a
*weight*value assigned to each annotation, $\mathbf{h}_i$, at the current time step, $t$. - $e_{t,i}$ is an
*attention score*generated by an alignment model, $a(.)$, that scores how well $\mathbf{s}_{t-1}$ and $\mathbf{h}_i$ match.

These components find their use at different stages of the Bahdanau architecture, which employs a bidirectional RNN as an encoder and an RNN decoder, with an attention mechanism in between:

The role of the encoder is generate an annotation, $\mathbf{h}_i$, for every word, $x_i$, in an input sentence of length $T$ words.

For this purpose, Bahdanau et al. employ a bidirectional RNN, which reads the input sentence in the forward direction to produce a forward hidden state, $\overrightarrow{\mathbf{h}_i}$, and then reads the input sentence in the reverse direction to produce a backward hidden state, $\overleftarrow{\mathbf{h}_i}$. The annotation for some particular word, $x_i$, concatenates the two states:

$$\mathbf{h}_i = \left[ \overrightarrow{\mathbf{h}_i^T} \; ; \; \overleftarrow{\mathbf{h}_i^T} \right]^T$$

The idea behind generating each annotation in this manner was to capture a summary of both preceding and succeeding words.

In this way, the annotation $\mathbf{h}_i$ contains the summaries of both the preceding words and the following words.

–Neural Machine Translation by Jointly Learning to Align and Translate, 2014.

The generated annotations are then passed to the decoder to generate the context vector.

The role of the decoder is to produce the target words by focusing on the most relevant information contained in the source sentence. For this purpose, it makes use of an attention mechanism.

Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated. The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.

–Neural Machine Translation by Jointly Learning to Align and Translate, 2014.

The decoder takes each annotation and feeds it to an alignment model, $a(.)$, together with the previous hidden decoder state, $\mathbf{s}_{t-1}$. This generates an attention score:

$$e_{t,i} = a(\mathbf{s}_{t-1}, \mathbf{h}_i)$$

The function implemented by the alignment model, here, combines $\mathbf{s}_{t-1}$ and $\mathbf{h}_i$ by means of an addition operation. For this reason, the attention mechanism implemented by Bahdanau et al. is referred to as *additive attention*.

This can be implemented in two ways, either (1) by applying a weight matrix, $\mathbf{W}$, over the concatenated vectors, $\mathbf{s}_{t-1}$ and $\mathbf{h}_i$, or (2) by applying the weight matrices, $\mathbf{W}_1$ and $\mathbf{W}_2$, to $\mathbf{s}_{t-1}$ and $\mathbf{h}_i$ separately:

- $$a(\mathbf{s}_{t-1}, \mathbf{h}_i) = \mathbf{v}^T \tanh(\mathbf{W}[\mathbf{h}_i \; ; \; \mathbf{s}_{t-1}])$$
- $$a(\mathbf{s}_{t-1}, \mathbf{h}_i) = \mathbf{v}^T \tanh(\mathbf{W}_1 \mathbf{h}_i + \mathbf{W}_2 \mathbf{s}_{t-1})$$

Here, $\mathbf{v}$, is a weight vector.

The alignment model is parametrized as a feedforward neural network, and jointly trained with the remaining system components.

Subsequently, a softmax function is applied to each attention score to obtain the corresponding weight value:

$$\alpha_{t,i} = \text{softmax}(e_{t,i})$$

The application of the softmax function essentially normalizes the annotation values to a range between 0 and 1 and, hence, the resulting weights can be considered as probability values. Each probability (or weight) value reflects how important $\mathbf{h}_i$ and $\mathbf{s}_{t-1}$ are in generating the next state, $\mathbf{s}_t$, and the next output, $y_t$.

Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed- length vector.

–Neural Machine Translation by Jointly Learning to Align and Translate, 2014.

This is finally followed by the computation of the context vector as a weighted sum of the annotations:

$$\mathbf{c}_t = \sum^T_{i=1} \alpha_{t,i} \mathbf{h}_i$$

In summary, the attention algorithm proposed by Bahdanau et al. performs the following operations:

- The encoder generates a set of annotations, $\mathbf{h}_i$, from the input sentence.
- These annotations are fed to an alignment model together with the previous hidden decoder state. The alignment model uses this information to generate the attention scores, $e_{t,i}$.
- A softmax function is applied to the attention scores, effectively normalizing them into weight values, $\alpha_{t,i}$, in a range between 0 and 1.
- These weights together with the previously computed annotations are used to generate a context vector, $\mathbf{c}_t$, through a weighted sum of the annotations.
- The context vector is fed to the decoder together with the previous hidden decoder state and the previous output, to compute the final output, $y_t$.
- Steps 2-6 are repeated until the end of the sequence.

Bahdanau et al. had tested their architecture on the task of English-to-French translation, and had reported that their model outperformed the conventional encoder-decoder model significantly, irrespective of the sentence length.

There had been several improvements over the Bahdanau attention that had been proposed thereafter, such as those of Luong et al. (2015), which we shall review in a separate tutorial.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered the Bahdanau attention mechanism for neural machine translation.

Specifically, you learned:

- Where the Bahdanau attention derives its name from, and the challenge it addresses.
- The role of the different components that form part of the Bahdanau encoder-decoder architecture.
- The operations performed by the Bahdanau attention algorithm.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post The Bahdanau Attention Mechanism appeared first on Machine Learning Mastery.

]]>The post Adding A Custom Attention Layer To Recurrent Neural Network In Keras appeared first on Machine Learning Mastery.

]]>This tutorial shows how to add a custom attention layer to a network built using a recurrent neural network. We’ll illustrate an end to end application of time series forecasting using a very simple dataset. The tutorial is designed for anyone looking for a basic understanding of how to add user defined layers to a deep learning network and use this simple example to build more complex applications.

After completing this tutorial, you will know:

- Which methods are required to create a custom attention layer in Keras
- How to incorporate the new layer in a network built with SimpleRNN

Let’s get started.

This tutorial is divided into three parts; they are:

- Preparing a simple dataset for time series forecasting
- How to use a network built via SimpleRNN for time series forecasting
- Adding a custom attention layer to the SimpleRNN network

It is assumed that you are familiar with the following topics. You can click the links below for an overview.

- What is Attention?
- The attention mechanism from scratch
- An introduction to RNN and the math that powers them
- Understanding simple recurrent neural networks in Keras

The focus of this article is to gain a basic understanding of how to build a custom attention layer to a deep learning network. For this purpose, we’ll use a very simple example of a Fibonacci sequence, where one number is constructed from previous two numbers. The first 10 numbers of the sequence are shown below:

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, …

When given the previous ‘t’ numbers, can we get a machine to accurately reconstruct the next number? This would mean discarding all the previous inputs except the last two and performing the correct operation on the last two numbers.

For this tutorial, we’ll construct the training examples from `t`

time steps and use the value at `t+1`

as the target. For example, if `t=3`

, then the training examples and the corresponding target values would look as follows:

In this section, we’ll write the basic code to generate the dataset and use a SimpleRNN network for predicting the next number of the Fibonacci sequence.

Let’s first write the import section:

from pandas import read_csv import numpy as np from keras import Model from keras.layers import Layer import keras.backend as K from keras.layers import Input, Dense, SimpleRNN from sklearn.preprocessing import MinMaxScaler from keras.models import Sequential from keras.metrics import mean_squared_error

The following function generates a sequence of n Fibonacci numbers (not counting the starting two values). If `scale_data`

is set to True, then it would also use the `MinMaxScaler`

from scikit-learn to scale the values between 0 and 1. Let’s see its output for `n=10`

.

def get_fib_seq(n, scale_data=True): # Get the Fibonacci sequence seq = np.zeros(n) fib_n1 = 0.0 fib_n = 1.0 for i in range(n): seq[i] = fib_n1 + fib_n fib_n1 = fib_n fib_n = seq[i] scaler = [] if scale_data: scaler = MinMaxScaler(feature_range=(0, 1)) seq = np.reshape(seq, (n, 1)) seq = scaler.fit_transform(seq).flatten() return seq, scaler fib_seq = get_fib_seq(10, False)[0] print(fib_seq)

[ 1. 2. 3. 5. 8. 13. 21. 34. 55. 89.]

Next, we need a function `get_fib_XY()`

that reformats the sequence into training examples and target values to be used by the Keras input layer. When given `time_steps`

as a parameter, `get_fib_XY()`

constructs each row of the dataset with `time_steps`

number of columns. This function not only constructs the training set and test set from the Fibonacci sequence, but also shuffles the training examples and reshapes them to the required TensorFlow format, i.e., `total_samples x time_steps x features`

. Also, the function returns the `scaler`

object that scales the values if `scale_data`

is set to `True`

.

Let’s generate a small training set to see what it looks like. We have set `time_steps=3`

, `total_fib_numbers=12`

, with approximately 70% examples going towards the test points. Note the training and test examples have been shuffled by the `permutation()`

function.

def get_fib_XY(total_fib_numbers, time_steps, train_percent, scale_data=True): dat, scaler = get_fib_seq(total_fib_numbers, scale_data) Y_ind = np.arange(time_steps, len(dat), 1) Y = dat[Y_ind] rows_x = len(Y) X = dat[0:rows_x] for i in range(time_steps-1): temp = dat[i+1:rows_x+i+1] X = np.column_stack((X, temp)) # random permutation with fixed seed rand = np.random.RandomState(seed=13) idx = rand.permutation(rows_x) split = int(train_percent*rows_x) train_ind = idx[0:split] test_ind = idx[split:] trainX = X[train_ind] trainY = Y[train_ind] testX = X[test_ind] testY = Y[test_ind] trainX = np.reshape(trainX, (len(trainX), time_steps, 1)) testX = np.reshape(testX, (len(testX), time_steps, 1)) return trainX, trainY, testX, testY, scaler trainX, trainY, testX, testY, scaler = get_fib_XY(12, 3, 0.7, False) print('trainX = ', trainX) print('trainY = ', trainY)

trainX = [[[ 8.] [13.] [21.]] [[ 5.] [ 8.] [13.]] [[ 2.] [ 3.] [ 5.]] [[13.] [21.] [34.]] [[21.] [34.] [55.]] [[34.] [55.] [89.]]] trainY = [ 34. 21. 8. 55. 89. 144.]

Now let’s setup a small network with two layers. The first one being the `SimpleRNN`

layer and the second one being the `Dense`

layer. Below is a summary of the model.

# Set up parameters time_steps = 20 hidden_units = 2 epochs = 30 # Create a traditional RNN network def create_RNN(hidden_units, dense_units, input_shape, activation): model = Sequential() model.add(SimpleRNN(hidden_units, input_shape=input_shape, activation=activation[0])) model.add(Dense(units=dense_units, activation=activation[1])) model.compile(loss='mse', optimizer='adam') return model model_RNN = create_RNN(hidden_units=hidden_units, dense_units=1, input_shape=(time_steps,1), activation=['tanh', 'tanh']) model_RNN.summary()

Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= simple_rnn_3 (SimpleRNN) (None, 2) 8 _________________________________________________________________ dense_3 (Dense) (None, 1) 3 ================================================================= Total params: 11 Trainable params: 11 Non-trainable params: 0

The next step is to add code that generates a dataset, trains the network, and evaluates it. This time around, we’ll scale the data between 0 and 1. We don’t need to pass `scale_data`

parameter as its default value is `True`

.

# Generate the dataset trainX, trainY, testX, testY, scaler = get_fib_XY(1200, time_steps, 0.7) model_RNN.fit(trainX, trainY, epochs=epochs, batch_size=1, verbose=2) # Evalute model train_mse = model_RNN.evaluate(trainX, trainY) test_mse = model_RNN.evaluate(testX, testY) # Print error print("Train set MSE = ", train_mse) print("Test set MSE = ", test_mse)

As output you’ll see the progress of training and the following values of mean square error:

Train set MSE = 5.631405292660929e-05 Test set MSE = 2.623497312015388e-05

In Keras, it is easy to create a custom layer that implements attention by subclassing the `Layer`

class. The Keras guide lists down clear steps for creating a new layer via subclassing. We’ll use those guidelines here. All the weights and biases corresponding to a single layer are encapsulated by this class. We need to write the `__init__`

method as well as override the following methods:

`build()`

: Keras guide recommends adding weights in this method once the size of the inputs is known. This method ‘lazily’ creates weights. The builtin function`add_weight()`

can be used to add weights and biases of the attention layer.`call()`

: The`call()`

method implements the mapping of inputs to outputs. It should implement the forward pass during training.

The call method of the attention layer has to compute the alignment scores, weights, and context. You can go through the details of these parameters in Stefania’s excellent article on The Attention Mechanism from Scratch. We’ll implement the Bahdanau attention in our `call()`

method.

The good thing about inheriting a layer from the Keras `Layer`

class and adding the weights via `add_weights()`

method is that weights are automatically tuned. Keras does an equivalent of ‘reverse engineering’ of the operations/computations of the `call()`

method and calculates the gradients during training. It is important to specify `trainable=True`

when adding the weights. You can also add a `train_step()`

method to your custom layer and specify your own method for weight training if needed.

The code below implements our custom attention layer.

# Add attention layer to the deep learning network class attention(Layer): def __init__(self,**kwargs): super(attention,self).__init__(**kwargs) def build(self,input_shape): self.W=self.add_weight(name='attention_weight', shape=(input_shape[-1],1), initializer='random_normal', trainable=True) self.b=self.add_weight(name='attention_bias', shape=(input_shape[1],1), initializer='zeros', trainable=True) super(attention, self).build(input_shape) def call(self,x): # Alignment scores. Pass them through tanh function e = K.tanh(K.dot(x,self.W)+self.b) # Remove dimension of size 1 e = K.squeeze(e, axis=-1) # Compute the weights alpha = K.softmax(e) # Reshape to tensorFlow format alpha = K.expand_dims(alpha, axis=-1) # Compute the context vector context = x * alpha context = K.sum(context, axis=1) return context

Let’s now add an attention layer to the RNN network we created earlier. The function `create_RNN_with_attention()`

now specifies an RNN layer, attention layer and Dense layer in the network. Make sure to set `return_sequences=True`

when specifying the SimpleRNN. This will return the output of the hidden units for all the previous time steps.

Let’s look at a summary of our model with attention.

def create_RNN_with_attention(hidden_units, dense_units, input_shape, activation): x=Input(shape=input_shape) RNN_layer = SimpleRNN(hidden_units, return_sequences=True, activation=activation)(x) attention_layer = attention()(RNN_layer) outputs=Dense(dense_units, trainable=True, activation=activation)(attention_layer) model=Model(x,outputs) model.compile(loss='mse', optimizer='adam') return model model_attention = create_RNN_with_attention(hidden_units=hidden_units, dense_units=1, input_shape=(time_steps,1), activation='tanh') model_attention.summary()

Model: "model_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_2 (InputLayer) [(None, 20, 1)] 0 _________________________________________________________________ simple_rnn_2 (SimpleRNN) (None, 20, 2) 8 _________________________________________________________________ attention_1 (attention) (None, 2) 22 _________________________________________________________________ dense_2 (Dense) (None, 1) 3 ================================================================= Total params: 33 Trainable params: 33 Non-trainable params: 0 _________________________________________________________________

It’s time to train and test our model and see how it performs on predicting the next Fibonacci number of a sequence.

model_attention.fit(trainX, trainY, epochs=epochs, batch_size=1, verbose=2) # Evalute model train_mse_attn = model_attention.evaluate(trainX, trainY) test_mse_attn = model_attention.evaluate(testX, testY) # Print error print("Train set MSE with attention = ", train_mse_attn) print("Test set MSE with attention = ", test_mse_attn)

You’ll see the training progress as output and the following:

Train set MSE with attention = 5.3511179430643097e-05 Test set MSE with attention = 9.053358553501312e-06

We can see that even for this simple example, the mean square error on the test set is lower with the attention layer. You can achieve better results with hyper-parameter tuning and model selection. Do try this out on more complex problems and adding more layers to the network. You can also use the `scaler`

object to scale the numbers back to their original values.

You can take this example one step further by using LSTM instead of SimpleRNN or you can build a network via convolution and pooling layers. You can also change this to an encoder decoder network if you like.

The entire code for this tutorial is pasted below if you would like to try it. Note that your outputs would be different from the ones given in this tutorial because of the stochastic nature of this algorithm.

from pandas import read_csv import numpy as np from keras import Model from keras.layers import Layer import keras.backend as K from keras.layers import Input, Dense, SimpleRNN from sklearn.preprocessing import MinMaxScaler from keras.models import Sequential from keras.metrics import mean_squared_error # Prepare data def get_fib_seq(n, scale_data=True): # Get the Fibonacci sequence seq = np.zeros(n) fib_n1 = 0.0 fib_n = 1.0 for i in range(n): seq[i] = fib_n1 + fib_n fib_n1 = fib_n fib_n = seq[i] scaler = [] if scale_data: scaler = MinMaxScaler(feature_range=(0, 1)) seq = np.reshape(seq, (n, 1)) seq = scaler.fit_transform(seq).flatten() return seq, scaler def get_fib_XY(total_fib_numbers, time_steps, train_percent, scale_data=True): dat, scaler = get_fib_seq(total_fib_numbers, scale_data) Y_ind = np.arange(time_steps, len(dat), 1) Y = dat[Y_ind] rows_x = len(Y) X = dat[0:rows_x] for i in range(time_steps-1): temp = dat[i+1:rows_x+i+1] X = np.column_stack((X, temp)) # random permutation with fixed seed rand = np.random.RandomState(seed=13) idx = rand.permutation(rows_x) split = int(train_percent*rows_x) train_ind = idx[0:split] test_ind = idx[split:] trainX = X[train_ind] trainY = Y[train_ind] testX = X[test_ind] testY = Y[test_ind] trainX = np.reshape(trainX, (len(trainX), time_steps, 1)) testX = np.reshape(testX, (len(testX), time_steps, 1)) return trainX, trainY, testX, testY, scaler # Set up parameters time_steps = 20 hidden_units = 2 epochs = 30 # Create a traditional RNN network def create_RNN(hidden_units, dense_units, input_shape, activation): model = Sequential() model.add(SimpleRNN(hidden_units, input_shape=input_shape, activation=activation[0])) model.add(Dense(units=dense_units, activation=activation[1])) model.compile(loss='mse', optimizer='adam') return model model_RNN = create_RNN(hidden_units=hidden_units, dense_units=1, input_shape=(time_steps,1), activation=['tanh', 'tanh']) # Generate the dataset for the network trainX, trainY, testX, testY, scaler = get_fib_XY(1200, time_steps, 0.7) # Train the network model_RNN.fit(trainX, trainY, epochs=epochs, batch_size=1, verbose=2) # Evalute model train_mse = model_RNN.evaluate(trainX, trainY) test_mse = model_RNN.evaluate(testX, testY) # Print error print("Train set MSE = ", train_mse) print("Test set MSE = ", test_mse) # Add attention layer to the deep learning network class attention(Layer): def __init__(self,**kwargs): super(attention,self).__init__(**kwargs) def build(self,input_shape): self.W=self.add_weight(name='attention_weight', shape=(input_shape[-1],1), initializer='random_normal', trainable=True) self.b=self.add_weight(name='attention_bias', shape=(input_shape[1],1), initializer='zeros', trainable=True) super(attention, self).build(input_shape) def call(self,x): # Alignment scores. Pass them through tanh function e = K.tanh(K.dot(x,self.W)+self.b) # Remove dimension of size 1 e = K.squeeze(e, axis=-1) # Compute the weights alpha = K.softmax(e) # Reshape to tensorFlow format alpha = K.expand_dims(alpha, axis=-1) # Compute the context vector context = x * alpha context = K.sum(context, axis=1) return context def create_RNN_with_attention(hidden_units, dense_units, input_shape, activation): x=Input(shape=input_shape) RNN_layer = SimpleRNN(hidden_units, return_sequences=True, activation=activation)(x) attention_layer = attention()(RNN_layer) outputs=Dense(dense_units, trainable=True, activation=activation)(attention_layer) model=Model(x,outputs) model.compile(loss='mse', optimizer='adam') return model # Create the model with attention, train and evaluate model_attention = create_RNN_with_attention(hidden_units=hidden_units, dense_units=1, input_shape=(time_steps,1), activation='tanh') model_attention.summary() model_attention.fit(trainX, trainY, epochs=epochs, batch_size=1, verbose=2) # Evalute model train_mse_attn = model_attention.evaluate(trainX, trainY) test_mse_attn = model_attention.evaluate(testX, testY) # Print error print("Train set MSE with attention = ", train_mse_attn) print("Test set MSE with attention = ", test_mse_attn)

This section provides more resources on the topic if you are looking to go deeper.

- Deep Learning Essentials, by Wei Di, Anurag Bhardwaj and Jianing Wei.
- Deep learning by Ian Goodfellow, Joshua Bengio and Aaron Courville.

- A Tour of Recurrent Neural Network Algorithms for Deep Learning.
- What is Attention?
- The attention mechanism from scratch.
- An introduction to RNN and the math that powers them.
- Understanding simple recurrent neural networks in Keras.
- How to Develop an Encoder-Decoder Model with Attention in Keras

In this tutorial, you discovered how to add a custom attention layer to a deep learning network using Keras.

Specifically, you learned:

- How to override the Keras
`Layer`

class. - The method
`build()`

is required to add weights to the attention layer. - The
`call()`

method is required for specifying the mapping of inputs to outputs of the attention layer. - How to add a custom attention layer to the deep learning network built using SimpleRNN.

Do you have any questions about RNNs discussed in this post? Ask your questions in the comments below and I will do my best to answer.

The post Adding A Custom Attention Layer To Recurrent Neural Network In Keras appeared first on Machine Learning Mastery.

]]>The post A Tour of Attention-Based Architectures appeared first on Machine Learning Mastery.

]]>In this tutorial, you will discover the salient neural architectures that have been used in conjunction with attention.

After completing this tutorial, you will gain a better understanding of how the attention mechanism is incorporated into different neural architectures and for which purpose.

Let’s get started.

This tutorial is divided into four parts; they are:

- The Encoder-Decoder Architecture
- The Transformer
- Graph Neural Networks
- Memory-Augmented Neural Networks

The encoder-decoder architecture has been extensively applied to sequence-to-sequence (seq2seq) tasks for language processing. Examples of such tasks, within the domain of language processing, include machine translation and image captioning.

The earliest use of attention was as part of RNN based encoder-decoder framework to encode long input sentences [Bahdanau et al. 2015]. Consequently, attention has been most widely used with this architecture.

Within the context of machine translation, such a seq2seq task would involve the translation of an input sequence, $I = \{ A, B, C, <EOS> \}$, into an output sequence, $O = \{ W, X, Y, Z, <EOS> \}$, of a different length.

For an RNN-based encoder-decoder architecture *without* attention, unrolling each RNN would produce the following graph:

Here, the encoder reads the input sequence one word at a time, each time updating its internal state. It stops when it encounters the <EOS> symbol, which signals that the *end of sequence* has been reached. The hidden state generated by the encoder essentially contains a vector representation of the input sequence, which will then be processed by the decoder.

The decoder generates the output sequence one word at a time, taking the word at the previous time step ($t$ – 1) as input to generate the next word in the output sequence. An <EOS> symbol at the decoding side signals that the decoding process has ended.

As we have previously mentioned, the problem with the encoder-decoder architecture without attention arises when sequences of different length and complexity are represented by a fixed-length vector, potentially resulting in the decoder missing important information.

To circumvent this problem, an attention-based architecture introduces an attention mechanism between the encoder and decoder.

Here, the attention mechanism ($\phi$) learns a set of attention weights that capture the relationship between the encoded vectors (v) and the hidden state of the decoder (h), to generate a context vector (c) through a weighted sum of all the hidden states of the encoder. In doing so, the decoder would have access to the entire input sequence, with specific focus on the input information that is most relevant for generating the output.

The architecture of the transformer also implements an encoder and decoder, however, as opposed to the architectures that we have reviewed above, it does not rely on the use of recurrent neural networks. For this reason, we shall be reviewing this architecture and its variants separately.

The transformer architecture dispenses of any recurrence, and instead relies solely on a *self-attention* (or intra-attention) mechanism.

In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d …– Advanced Deep Learning with Python, 2019.

The self-attention mechanism relies on the use of *queries*, *keys* and *values*, which are generated by multiplying the encoder’s representation of the same input sequence with different weight matrices. The transformer uses dot product (or *multiplicative*) attention, where each query is matched against a database of keys by a dot product operation, in the process of generating the attention weights. These weights are then multiplied to the values to generate a final attention vector.

Intuitively, since all queries, keys and values originate from the same input sequence, the self-attention mechanism captures the relationship between the different elements of the same sequence, highlighting those that are mostly relevant to one another.

Since the transformer does not rely on RNNs, the positional information of each element in the sequence can be preserved by augmenting the encoder’s representation of each element with positional encoding. This means that the transformer architecture may also be applied to tasks where the information may not necessarily be related sequentially, such as for the computer vision tasks of image classification, segmentation or captioning.

Transformers can capture global/long range dependencies between input and output, support parallel processing, require minimal inductive biases (prior knowledge), demonstrate scalability to large sequences and datasets, and allow domain-agnostic processing of multiple modalities (text, images, speech) using similar processing blocks.

Furthermore, several attention layers can be stacked in parallel in what has been termed as *multi-head attention*. Each head works in parallel over different linear transformations of the same input, and the outputs of the heads are then concatenated to produce the final attention result. The benefit of having a multi-head model is that each head can attend to different elements of the sequence.

Some variants of the transformer architecture that address limitations of the vanilla model, are:

- Transformer-XL: Introduces recurrence so that it can learn longer-term dependency beyond the fixed length of the fragmented sequences that are typically used during training.

- XLNet: A bidirectional transformer that builds on Transfomer-XL by introducing a permutation-based mechanism, where training is carried out not only on the original order of the elements comprising the input sequence, but also over different permutations of the input sequence order.

A graph can be defined as a set of *nodes* (or vertices) that are linked by means of *connections* (or edges).

A graph is a versatile data structure that lends itself well to the way data is organized in many real-world scenarios.– Advanced Deep Learning with Python, 2019.

Take, for example, a social network where users can be represented by nodes in a graph, and their relationships with friends by edges. Or a molecule, where the nodes would be the atoms, and the edges would represent the chemical bonds between them.

We can think of an image as a graph, where each pixel is a node, directly connected to its neighboring pixels …– Advanced Deep Learning with Python, 2019.

Of particular interest are the *Graph Attention Networks* (GAT) that employ a self-attention mechanism within a graph convolutional network (GCN), where the latter updates the state vectors by performing a convolution over the nodes of the graph. The convolution operation is applied to the central node and the neighboring nodes by means of a weighted filter, to update the representation of the central node. The filter weights in a GCN can be fixed or learnable.

A GAT, in comparison, assigns weights to the neighboring nodes using attention scores.

The computation of these attention scores follows a similar procedure as in the methods for seq2seq tasks reviewed above: (1) alignment scores are first computed between the feature vectors of two neighboring nodes, from which (2) attention scores are computed by applying a softmax operation, and finally (3) an output feature vector for each node (equivalent to the context vector in a seq2seq task) can be computed by a weighted combination of the feature vectors of all neighbors.

Multi-head attention can be applied here too, in a very similar manner as to how it was proposed in the transformer architecture that we have previously seen. Each node in the graph would be assigned multiple heads, and their outputs averaged in the final layer.

Once the final output has been produced, this can be used as input to a subsequent task-specific layer. Tasks that can be solved by graphs can be the classification of individual nodes between different groups (for example, in predicting which of several clubs a person will decided to become a member with); or the classification of individual edges to determine whether an edge exists between two nodes (for example, to predict whether two persons in a social network might be friends); or even the classification of a full graph (for example, to predict if a molecule is toxic).

In the encoder-decoder attention-based architectures that we have reviewed so far, the set of vectors that encode the input sequence can be considered as external memory, to which the encoder writes and from which the decoder reads. However, a limitation arises because the encoder can only write to this memory, and the decoder can only read.

Memory-Augmented Neural Networks (MANNs) are recent algorithms that aim to address this limitation.

The Neural Turing Machine (NTM) is one type of MANN. It consists of a neural network controller that takes an input to produce an output, and performs read and write operations to memory.

The operation performed by the read head is similar to the attention mechanism employed for seq2seq tasks, where an attention weight indicates the importance of the vector under consideration in forming the output.

A read head always reads the full memory matrix, but it does so by attending to different memory vectors with different intensities.– Advanced Deep Learning with Python, 2019.

The output of a read operation is then defined by a weighted sum of the memory vectors.

The write head also makes use of an attention vector, together with an erase and add vectors. A memory location is erased based on the values in the attention and erase vectors, and information is written via the add vector.

Examples of applications for MANNs include question answering and chat bots, where an external memory stores a large database of sequences (or facts) that the neural network taps into. The role of the attention mechanism is crucial in selecting facts from the database that are more relevant than others for the task at hand.

This section provides more resources on the topic if you are looking to go deeper.

- Advanced Deep Learning with Python, 2019.
- Deep Learning Essentials, 2018.

- An Attentive Survey of Attention Models, 2021.
- Attention in Psychology, Neuroscience, and Machine Learning, 2020.

In this tutorial, you discovered the salient neural architectures that have been used in conjunction with attention.

Specifically, you gained a better understanding of how the attention mechanism is incorporated into different neural architectures and for which purpose.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Tour of Attention-Based Architectures appeared first on Machine Learning Mastery.

]]>The post Understanding Simple Recurrent Neural Networks In Keras appeared first on Machine Learning Mastery.

]]>After completing this tutorial, you will know:

- The structure of RNN
- How RNN computes the output when given an input
- How to prepare data for a SimpleRNN in Keras
- How to train a SimpleRNN model

Let’s get started.

This tutorial is divided into two parts; they are:

- The structure of the RNN
- Different weights and biases associated with different layers of the RNN.
- How computations are performed to compute the output when given an input.

- A complete application for time series prediction.

It is assumed that you have a basic understanding of RNNs before you start implementing them. An Introduction To Recurrent Neural Networks And The Math That Powers Them gives you a quick overview of RNNs.

Let’s now get right down to the implementation part.

To start the implementation of RNNs, let’s add the import section.

from pandas import read_csv import numpy as np from keras.models import Sequential from keras.layers import Dense, SimpleRNN from sklearn.preprocessing import MinMaxScaler from sklearn.metrics import mean_squared_error import math import matplotlib.pyplot as plt

The function below returns a model that includes a `SimpleRNN`

layer and a `Dense`

layer for learning sequential data. The `input_shape`

specifies the parameter `(time_steps x features)`

. We’ll simplify everything and use univariate data, i.e., one feature only; the time_steps are discussed below.

def create_RNN(hidden_units, dense_units, input_shape, activation): model = Sequential() model.add(SimpleRNN(hidden_units, input_shape=input_shape, activation=activation[0])) model.add(Dense(units=dense_units, activation=activation[1])) model.compile(loss='mean_squared_error', optimizer='adam') return model demo_model = create_RNN(2, 1, (3,1), activation=['linear', 'linear'])

The object `demo_model`

is returned with 2 hidden units created via a the `SimpleRNN`

layer and 1 dense unit created via the `Dense`

layer. The `input_shape`

is set at 3×1 and a `linear`

activation function is used in both layers for simplicity. Just to recall the linear activation function $f(x) = x$ makes no change in the input. The network looks as follows:

If we have $m$ hidden units ($m=2$ in the above case), then:

- Input: $x \in R$
- Hidden unit: $h \in R^m$
- Weights for input units: $w_x \in R^m$
- Weights for hidden units: $w_h \in R^{mxm}$
- Bias for hidden units: $b_h \in R^m$
- Weight for the dense layer: $w_y \in R^m$
- Bias for the dense layer: $b_y \in R$

Let’s look at the above weights. Note: As the weights are initialized randomly, the results pasted here will be different from yours. The important thing is to learn what the structure of each object being used looks like and how it interacts with others to produce the final output.

wx = demo_model.get_weights()[0] wh = demo_model.get_weights()[1] bh = demo_model.get_weights()[2] wy = demo_model.get_weights()[3] by = demo_model.get_weights()[4] print('wx = ', wx, ' wh = ', wh, ' bh = ', bh, ' wy =', wy, 'by = ', by)

wx = [[ 0.18662322 -1.2369459 ]] wh = [[ 0.86981213 -0.49338293] [ 0.49338293 0.8698122 ]] bh = [0. 0.] wy = [[-0.4635998] [ 0.6538409]] by = [0.]

Now let’s do a simple experiment to see how the layers from a SimpleRNN and Dense layer produce an output. Keep this figure in view.

We’ll input `x`

for three time steps and let the network generate an output. The values of the hidden units at time steps 1, 2 and 3 will be computed. $h_0$ is initialized to the zero vector. The output $o_3$ is computed from $h_3$ and $w_y$. An activation function is not required as we are using linear units.

x = np.array([1, 2, 3]) # Reshape the input to the required sample_size x time_steps x features x_input = np.reshape(x,(1, 3, 1)) y_pred_model = demo_model.predict(x_input) m = 2 h0 = np.zeros(m) h1 = np.dot(x[0], wx) + h0 + bh h2 = np.dot(x[1], wx) + np.dot(h1,wh) + bh h3 = np.dot(x[2], wx) + np.dot(h2,wh) + bh o3 = np.dot(h3, wy) + by print('h1 = ', h1,'h2 = ', h2,'h3 = ', h3) print("Prediction from network ", y_pred_model) print("Prediction from our computation ", o3)

h1 = [[ 0.18662322 -1.23694587]] h2 = [[-0.07471441 -3.64187904]] h3 = [[-1.30195881 -6.84172557]] Prediction from network [[-3.8698118]] Prediction from our computation [[-3.86981216]]

Now that we understand how the SimpleRNN and Dense layers are put together. Let’s run a complete RNN on a simple time series dataset. We’ll need to follow these steps

- Read the dataset from a given URL
- Split the data into training and test set
- Prepare the input to the required Keras format
- Create an RNN model and train it
- Make the predictions on training and test sets and print the root mean square error on both sets
- View the result

The following function reads the train and test data from a given URL and splits it into a given percentage of train and test data. It returns single dimensional arrays for train and test data after scaling the data between 0 and 1 using `MinMaxScaler`

from scikit-learn.

# Parameter split_percent defines the ratio of training examples def get_train_test(url, split_percent=0.8): df = read_csv(url, usecols=[1], engine='python') data = np.array(df.values.astype('float32')) scaler = MinMaxScaler(feature_range=(0, 1)) data = scaler.fit_transform(data).flatten() n = len(data) # Point for splitting data into train and test split = int(n*split_percent) train_data = data[range(split)] test_data = data[split:] return train_data, test_data, data sunspots_url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/monthly-sunspots.csv' train_data, test_data, data = get_train_test(sunspots_url)

The next step is to prepare the data for Keras model training. The input array should be shaped as: `total_samples x time_steps x features`

.

There are many ways of preparing time series data for training. We’ll create input rows with non-overlapping time steps. An example for time_steps = 2 is shown in the figure below. Here time_steps denotes the number of previous time steps to use for predicting the next value of the time series data.

The following function `get_XY()`

takes a one dimensional array as input and converts it to the required input `X`

and target `Y`

arrays. We’ll use 12 `time_steps`

for the sunspots dataset as the sunspots generally have a cycle of 12 months. You can experiment with other values of `time_steps`

.

# Prepare the input X and target Y def get_XY(dat, time_steps): # Indices of target array Y_ind = np.arange(time_steps, len(dat), time_steps) Y = dat[Y_ind] # Prepare X rows_x = len(Y) X = dat[range(time_steps*rows_x)] X = np.reshape(X, (rows_x, time_steps, 1)) return X, Y time_steps = 12 trainX, trainY = get_XY(train_data, time_steps) testX, testY = get_XY(test_data, time_steps)

For this step, we can reuse our `create_RNN()`

function that was defined above.

model = create_RNN(hidden_units=3, dense_units=1, input_shape=(time_steps,1), activation=['tanh', 'tanh']) model.fit(trainX, trainY, epochs=20, batch_size=1, verbose=2)

The function `print_error()`

computes the mean square error between the actual values and the predicted values.

def print_error(trainY, testY, train_predict, test_predict): # Error of predictions train_rmse = math.sqrt(mean_squared_error(trainY, train_predict)) test_rmse = math.sqrt(mean_squared_error(testY, test_predict)) # Print RMSE print('Train RMSE: %.3f RMSE' % (train_rmse)) print('Test RMSE: %.3f RMSE' % (test_rmse)) # make predictions train_predict = model.predict(trainX) test_predict = model.predict(testX) # Mean square error print_error(trainY, testY, train_predict, test_predict)

Train RMSE: 0.058 RMSE Test RMSE: 0.077 RMSE

The following function plots the actual target values and the predicted value. The red line separates the training and test data points.

# Plot the result def plot_result(trainY, testY, train_predict, test_predict): actual = np.append(trainY, testY) predictions = np.append(train_predict, test_predict) rows = len(actual) plt.figure(figsize=(15, 6), dpi=80) plt.plot(range(rows), actual) plt.plot(range(rows), predictions) plt.axvline(x=len(trainY), color='r') plt.legend(['Actual', 'Predictions']) plt.xlabel('Observation number after given time steps') plt.ylabel('Sunspots scaled') plt.title('Actual and Predicted Values. The Red Line Separates The Training And Test Examples') plot_result(trainY, testY, train_predict, test_predict)

The following plot is generated:

Given below is the entire code for this tutorial. Do try this out at your end and experiment with different hidden units and time steps. You can add a second `SimpleRNN`

to the network and see how it behaves. You can also use the `scaler`

object to rescale the data back to its normal range.

# Parameter split_percent defines the ratio of training examples def get_train_test(url, split_percent=0.8): df = read_csv(url, usecols=[1], engine='python') data = np.array(df.values.astype('float32')) scaler = MinMaxScaler(feature_range=(0, 1)) data = scaler.fit_transform(data).flatten() n = len(data) # Point for splitting data into train and test split = int(n*split_percent) train_data = data[range(split)] test_data = data[split:] return train_data, test_data, data # Prepare the input X and target Y def get_XY(dat, time_steps): Y_ind = np.arange(time_steps, len(dat), time_steps) Y = dat[Y_ind] rows_x = len(Y) X = dat[range(time_steps*rows_x)] X = np.reshape(X, (rows_x, time_steps, 1)) return X, Y def create_RNN(hidden_units, dense_units, input_shape, activation): model = Sequential() model.add(SimpleRNN(hidden_units, input_shape=input_shape, activation=activation[0])) model.add(Dense(units=dense_units, activation=activation[1])) model.compile(loss='mean_squared_error', optimizer='adam') return model def print_error(trainY, testY, train_predict, test_predict): # Error of predictions train_rmse = math.sqrt(mean_squared_error(trainY, train_predict)) test_rmse = math.sqrt(mean_squared_error(testY, test_predict)) # Print RMSE print('Train RMSE: %.3f RMSE' % (train_rmse)) print('Test RMSE: %.3f RMSE' % (test_rmse)) # Plot the result def plot_result(trainY, testY, train_predict, test_predict): actual = np.append(trainY, testY) predictions = np.append(train_predict, test_predict) rows = len(actual) plt.figure(figsize=(15, 6), dpi=80) plt.plot(range(rows), actual) plt.plot(range(rows), predictions) plt.axvline(x=len(trainY), color='r') plt.legend(['Actual', 'Predictions']) plt.xlabel('Observation number after given time steps') plt.ylabel('Sunspots scaled') plt.title('Actual and Predicted Values. The Red Line Separates The Training And Test Examples') sunspots_url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/monthly-sunspots.csv' time_steps = 12 train_data, test_data, data = get_train_test(sunspots_url) trainX, trainY = get_XY(train_data, time_steps) testX, testY = get_XY(test_data, time_steps) # Create model and train model = create_RNN(hidden_units=3, dense_units=1, input_shape=(time_steps,1), activation=['tanh', 'tanh']) model.fit(trainX, trainY, epochs=20, batch_size=1, verbose=2) # make predictions train_predict = model.predict(trainX) test_predict = model.predict(testX) # Print error print_error(trainY, testY, train_predict, test_predict) #Plot result plot_result(trainY, testY, train_predict, test_predict)

This section provides more resources on the topic if you are looking to go deeper.

- Deep Learning Essentials, by Wei Di, Anurag Bhardwaj and Jianing Wei.
- Deep learning by Ian Goodfellow, Joshua Bengio and Aaron Courville.

- Wikipedia article on BPTT
- A Tour of Recurrent Neural Network Algorithms for Deep Learning
- A Gentle Introduction to Backpropagation Through Time
- How to Prepare Univariate Time Series Data for Long Short-Term Memory Networks

In this tutorial, you discovered recurrent neural networks and their various architectures.

Specifically, you learned:

- The structure of RNNs
- How the RNN computes an output from previous inputs
- How to implement an end to end system for time series forecasting using an RNN

Do you have any questions about RNNs discussed in this post? Ask your questions in the comments below and I will do my best to answer.

The post Understanding Simple Recurrent Neural Networks In Keras appeared first on Machine Learning Mastery.

]]>The post An Introduction To Recurrent Neural Networks And The Math That Powers Them appeared first on Machine Learning Mastery.

]]>After completing this tutorial, you will know:

- Recurrent neural networks
- What is meant by unfolding a RNN
- How weights are updated in a RNN
- Various RNN architectures

Let’s get started.

This tutorial is divided into two parts; they are:

- The working of an RNN
- Unfolding in time
- Backpropagation through time algorithm

- Different RNN architectures and variants

For this tutorial, it is assumed that you are already familiar with artificial neural networks and the back propagation algorithm. If not, you can go through this very nice tutorial Calculus in Action: Neural Networks by Stefania Cristina. The tutorial also explains how gradient based back propagation algorithm is used to train a neural network.

A recurrent neural network (RNN) is a special type of an artificial neural network adapted to work for time series data or data that involves sequences. Ordinary feed forward neural networks are only meant for data points, which are independent of each other. However, if we have data in a sequence such that one data point depends upon the previous data point, we need to modify the neural network to incorporate the dependencies between these data points. RNNs have the concept of ‘memory’ that helps them store the states or information of previous inputs to generate the next output of the sequence.

A simple RNN has a feedback loop as shown in the first diagram of the above figure. The feedback loop shown in the gray rectangle can be unrolled in 3 time steps to produce the second network of the above figure. Of course, you can vary the architecture so that the network unrolls $k$ time steps. In the figure, the following notation is used:

- $x_t \in R$ is the input at time step $t$. To keep things simple we assume that $x_t$ is a scalar value with a single feature. You can extend this idea to a $d$-dimensional feature vector.
- $y_t \in R$ is the output of the network at time step $t$. We can produce multiple outputs in the network but for this example we assume that there is one output.
- $h_t \in R^m$ vector stores the values of the hidden units/states at time $t$. This is also called the current context. $m$ is the number of hidden units. $h_0$ vector is initialized to zero.
- $w_x \in R^{m}$ are weights associated with inputs in recurrent layer
- $w_h \in R^{mxm}$ are weights associated with hidden units in recurrent layer
- $w_y \in R^m$ are weights associated with hidden to output units
- $b_h \in R^m$ is the bias associated with the recurrent layer
- $b_y \in R$ is the bias associated with the feedforward layer

At every time step we can unfold the network for $k$ time steps to get the output at time step $k+1$. The unfolded network is very similar to the feedforward neural network. The rectangle in the unfolded network shows an operation taking place. So for example, with an activation function f:

$$h_{t+1} = f(x_t, h_t, w_x, w_h, b_h) = f(w_{x} x_t + w_{h} h_t + b_h)$$

The output $y$ at time $t$ is computed as:

$$

y_{t} = f(h_t, w_y) = f(w_y \cdot h_t + b_y)

$$

Here, $\cdot$ is the dot product.

Hence, in the feedforward pass of a RNN, the network computes the values of the hidden units and the output after $k$ time steps. The weights associated with the network are shared temporally. Each recurrent layer has two sets of weights; one for the input and the second one for the hidden unit. The last feedforward layer, which computes the final output for the kth time step is just like an ordinary layer of a traditional feedforward network.

We can use any activation function we like in the recurrent neural network. Common choices are:

- Sigmoid function: $\frac{1}{1+e^{-x}}$
- Tanh function: $\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}$
- Relu function: max$(0,x)$

The backpropagation algorithm of an artificial neural network is modified to include the unfolding in time to train the weights of the network. This algorithm is based on computing the gradient vector and is called back propagation in time or BPTT algorithm for short. The pseudo-code for training is given below. The value of $k$ can be selected by the user for training. In the pseudo-code below $p_t$ is the target value at time step t:

- Repeat till stopping criterion is met:
- Set all $h$ to zero.
- Repeat for t = 0 to n-k
- Forward propagate the network over the unfolded network for $k$ time steps to compute all $h$ and $y$.
- Compute the error as: $e = y_{t+k}-p_{t+k}$
- Backpropagate the error across the unfolded network and update the weights.

There are different types of recurrent neural networks with varying architectures. Some examples are:

Here there is a single $(x_t, y_t)$ pair. Traditional neural networks employ a one to one architecture.

In one to many networks, a single input at $x_t$ can produce multiple outputs, e.g., $(y_{t0}, y_{t1}, y_{t2})$. Music generation is an example area, where one to many networks are employed.

In this case many inputs from different time steps produce a single output. For example, $(x_t, x_{t+1}, x_{t+2})$ can produce a single output $y_t$. Such networks are employed in sentiment analysis or emotion detection, where the class label depends upon a sequence of words.

There are many possibilities for many to many. An example is shown above, where two inputs produce three outputs. Many to many networks are applied in machine translation, e.g, English to French or vice versa translation systems.

RNNs have various advantages such as:

- Ability to handle sequence data.
- Ability to handle inputs of varying lengths.
- Ability to store or ‘memorize’ historical information.

The disadvantages are:

- The computation can be very slow.
- The network does not take into account future inputs to make decisions.
- Vanishing gradient problem, where the gradients used to compute the weight update may get very close to zero preventing the network from learning new weights. The deeper the network, the more pronounced is this problem.

There are different variations of RNNs that are being applied practically in machine learning problems:

In BRNN, inputs from future time steps are used to improve the accuracy of the network. It is like having knowledge of the first and last words of a sentence to predict the middle words.

These networks are designed to handle the vanishing gradient problem. They have a reset and update gate. These gates determine which information is to be retained for future predictions.

LSTMs were also designed to address the vanishing gradient problem in RNNs. LSTM use three gates called input, output and forget gate. Similar to GRU, these gates determine which information to retain.

This section provides more resources on the topic if you are looking to go deeper.

- Deep Learning Essentials, by Wei Di, Anurag Bhardwaj and Jianing Wei.
- Deep learning by Ian Goodfellow, Joshua Bengio and Aaron Courville.

- Wikipedia article on BPTT
- A Tour of Recurrent Neural Network Algorithms for Deep Learning
- A Gentle Introduction to Backpropagation Through Time

In this tutorial, you discovered recurrent neural networks and their various architectures.

Specifically, you learned:

- How a recurrent neural network handles sequential data
- Unfolding in time in a recurrent neural network
- What is back propagation in time
- Advantages and disadvantages of RNNs
- Various architectures and variants of RNN

Do you have any questions about RNNs discussed in this post? Ask your questions in the comments below and I will do my best to answer.

The post An Introduction To Recurrent Neural Networks And The Math That Powers Them appeared first on Machine Learning Mastery.

]]>