The post Application of differentiations in neural networks appeared first on Machine Learning Mastery.

]]>Last Updated on November 26, 2021

Differential calculus is an important tool in machine learning algorithms. Neural networks in particular, the gradient descent algorithm depends on the gradient, which is a quantity computed by differentiation.

In this tutorial, we will see how the back-propagation technique is used in finding the gradients in neural networks.

After completing this tutorial, you will know

- What is a total differential and total derivative
- How to compute the total derivatives in neural networks
- How back-propagation helped in computing the total derivatives

Let’s get started

This tutorial is divided into 5 parts; they are:

- Total differential and total derivatives
- Algebraic representation of a multilayer perceptron model
- Finding the gradient by back-propagation
- Matrix form of gradient equations
- Implementing back-propagation

For a function such as $f(x)$, we call denote its derivative as $f'(x)$ or $\frac{df}{dx}$. But for a multivariate function, such as $f(u,v)$, we have a partial derivative of $f$ with respect to $u$ denoted as $\frac{\partial f}{\partial u}$, or sometimes written as $f_u$. A partial derivative is obtained by differentiation of $f$ with respect to $u$ while assuming the other variable $v$ is a constant. Therefore, we use $\partial$ instead of $d$ as the symbol for differentiation to signify the difference.

However, what if the $u$ and $v$ in $f(u,v)$ are both function of $x$? In other words, we can write $u(x)$ and $v(x)$ and $f(u(x), v(x))$. So $x$ determines the value of $u$ and $v$ and in turn, determines $f(u,v)$. In this case, it is perfectly fine to ask what is $\frac{df}{dx}$, as $f$ is eventually determined by $x$.

This is the concept of total derivatives. In fact, for a multivariate function $f(t,u,v)=f(t(x),u(x),v(x))$, we always have

$$

\frac{df}{dx} = \frac{\partial f}{\partial t}\frac{dt}{dx} + \frac{\partial f}{\partial u}\frac{du}{dx} + \frac{\partial f}{\partial v}\frac{dv}{dx}

$$

The above notation is called the total derivative because it is sum of the partial derivatives. In essence, it is applying chain rule to find the differentiation.

If we take away the $dx$ part in the above equation, what we get is an approximate change in $f$ with respect to $x$, i.e.,

$$

df = \frac{\partial f}{\partial t}dt + \frac{\partial f}{\partial u}du + \frac{\partial f}{\partial v}dv

$$

We call this notation the total differential.

Consider the network:

This is a simple, fully-connected, 4-layer neural network. Let’s call the input layer as layer 0, the two hidden layers the layer 1 and 2, and the output layer as layer 3. In this picture, we see that we have $n_0=3$ input units, and $n_1=4$ units in the first hidden layer and $n_2=2$ units in the second input layer. There are $n_3=2$ output units.

If we denote the input to the network as $x_i$ where $i=1,\cdots,n_0$ and the network’s output as $\hat{y}_i$ where $i=1,\cdots,n_3$. Then we can write

$$

\begin{aligned}

h_{1i} &= f_1(\sum_{j=1}^{n_0} w^{(1)}_{ij} x_j + b^{(1)}_i) & \text{for } i &= 1,\cdots,n_1\\

h_{2i} &= f_2(\sum_{j=1}^{n_1} w^{(2)}_{ij} h_{1j} + b^{(2)}_i) & i &= 1,\cdots,n_2\\

\hat{y}_i &= f_3(\sum_{j=1}^{n_2} w^{(3)}_{ij} h_{2j} + b^{(3)}_i) & i &= 1,\cdots,n_3

\end{aligned}

$$

Here the activation function at layer $i$ is denoted as $f_i$. The outputs of first hidden layer are denoted as $h_{1i}$ for the $i$-th unit. Similarly, the outputs of second hidden layer are denoted as $h_{2i}$. The weights and bias of unit $i$ in layer $k$ are denoted as $w^{(k)}_{ij}$ and $b^{(k)}_i$ respectively.

In the above, we can see that the output of layer $k-1$ will feed into layer $k$. Therefore, while $\hat{y}_i$ is expressed as a function of $h_{2j}$, but $h_{2i}$ is also a function of $h_{1j}$ and in turn, a function of $x_j$.

The above describes the construction of a neural network in terms of algebraic equations. Training a neural network would need to specify a *loss function* as well so we can minimize it in the training loop. Depends on the application, we commonly use cross entropy for categorization problems or mean squared error for regression problems. With the target variables as $y_i$, the mean square error loss function is specified as

$$

L = \sum_{i=1}^{n_3} (y_i-\hat{y}_i)^2

$$

In the above construct, $x_i$ and $y_i$ are from the dataset. The parameters to the neural network are $w$ and $b$. While the activation functions $f_i$ are by design the outputs at each layer $h_{1i}$, $h_{2i}$, and $\hat{y}_i$ are dependent variables. In training the neural network, our goal is to update $w$ and $b$ in each iteration, namely, by the gradient descent update rule:

$$

\begin{aligned}

w^{(k)}_{ij} &= w^{(k)}_{ij} – \eta \frac{\partial L}{\partial w^{(k)}_{ij}} \\

b^{(k)}_{i} &= b^{(k)}_{i} – \eta \frac{\partial L}{\partial b^{(k)}_{i}}

\end{aligned}

$$

where $\eta$ is the learning rate parameter to gradient descent.

From the equation of $L$ we know that $L$ is not dependent on $w^{(k)}_{ij}$ or $b^{(k)}_i$ but on $\hat{y}_i$. However, $\hat{y}_i$ can be written as function of $w^{(k)}_{ij}$ or $b^{(k)}_i$ eventually. Let’s see one by one how the weights and bias at layer $k$ can be connected to $\hat{y}_i$ at the output layer.

We begin with the loss metric. If we consider the loss of a single data point, we have

$$

\begin{aligned}

L &= \sum_{i=1}^{n_3} (y_i-\hat{y}_i)^2\\

\frac{\partial L}{\partial \hat{y}_i} &= 2(y_i – \hat{y}_i) & \text{for } i &= 1,\cdots,n_3

\end{aligned}

$$

Here we see that the loss function depends on all outputs $\hat{y}_i$ and therefore we can find a partial derivative $\frac{\partial L}{\partial \hat{y}_i}$.

Now let’s look at the output layer:

$$

\begin{aligned}

\hat{y}_i &= f_3(\sum_{j=1}^{n_2} w^{(3)}_{ij} h_{2j} + b^{(3)}_i) & \text{for }i &= 1,\cdots,n_3 \\

\frac{\partial L}{\partial w^{(3)}_{ij}} &= \frac{\partial L}{\partial \hat{y}_i}\frac{\partial \hat{y}_i}{\partial w^{(3)}_{ij}} & i &= 1,\cdots,n_3;\ j=1,\cdots,n_2 \\

&= \frac{\partial L}{\partial \hat{y}_i} f’_3(\sum_{j=1}^{n_2} w^{(3)}_{ij} h_{2j} + b^{(3)}_i)h_{2j} \\

\frac{\partial L}{\partial b^{(3)}_i} &= \frac{\partial L}{\partial \hat{y}_i}\frac{\partial \hat{y}_i}{\partial b^{(3)}_i} & i &= 1,\cdots,n_3 \\

&= \frac{\partial L}{\partial \hat{y}_i}f’_3(\sum_{j=1}^{n_2} w^{(3)}_{ij} h_{2j} + b^{(3)}_i)

\end{aligned}

$$

Because the weight $w^{(3)}_{ij}$ at layer 3 applies to input $h_{2j}$ and affects output $\hat{y}_i$ only. Hence we can write the derivative $\frac{\partial L}{\partial w^{(3)}_{ij}}$ as the product of two derivatives $\frac{\partial L}{\partial \hat{y}_i}\frac{\partial \hat{y}_i}{\partial w^{(3)}_{ij}}$. Similar case for the bias $b^{(3)}_i$ as well. In the above, we make use of $\frac{\partial L}{\partial \hat{y}_i}$, which we already derived previously.

But in fact, we can also write the partial derivative of $L$ with respect to output of second layer $h_{2j}$. It is not used for the update of weights and bias on layer 3 but we will see its importance later:

$$

\begin{aligned}

\frac{\partial L}{\partial h_{2j}} &= \sum_{i=1}^{n_3}\frac{\partial L}{\partial \hat{y}_i}\frac{\partial \hat{y}_i}{\partial h_{2j}} & \text{for }j &= 1,\cdots,n_2 \\

&= \sum_{i=1}^{n_3}\frac{\partial L}{\partial \hat{y}_i}f’_3(\sum_{j=1}^{n_2} w^{(3)}_{ij} h_{2j} + b^{(3)}_i)w^{(3)}_{ij}

\end{aligned}

$$

This one is the interesting one and different from the previous partial derivatives. Note that $h_{2j}$ is an output of layer 2. Each and every output in layer 2 will affect the output $\hat{y}_i$ in layer 3. Therefore, to find $\frac{\partial L}{\partial h_{2j}}$ we need to add up every output at layer 3. Thus the summation sign in the equation above. And we can consider $\frac{\partial L}{\partial h_{2j}}$ as the total derivative, in which we applied the chain rule $\frac{\partial L}{\partial \hat{y}_i}\frac{\partial \hat{y}_i}{\partial h_{2j}}$ for every output $i$ and then sum them up.

If we move back to layer 2, we can derive the derivatives similarly:

$$

\begin{aligned}

h_{2i} &= f_2(\sum_{j=1}^{n_1} w^{(2)}_{ij} h_{1j} + b^{(2)}_i) & \text{for }i &= 1,\cdots,n_2\\

\frac{\partial L}{\partial w^{(2)}_{ij}} &= \frac{\partial L}{\partial h_{2i}}\frac{\partial h_{2i}}{\partial w^{(2)}_{ij}} & i&=1,\cdots,n_2;\ j=1,\cdots,n_1 \\

&= \frac{\partial L}{\partial h_{2i}}f’_2(\sum_{j=1}^{n_1} w^{(2)}_{ij} h_{1j} + b^{(2)}_i)h_{1j} \\

\frac{\partial L}{\partial b^{(2)}_i} &= \frac{\partial L}{\partial h_{2i}}\frac{\partial h_{2i}}{\partial b^{(2)}_i} & i &= 1,\cdots,n_2 \\

&= \frac{\partial L}{\partial h_{2i}}f’_2(\sum_{j=1}^{n_1} w^{(2)}_{ij} h_{1j} + b^{(2)}_i) \\

\frac{\partial L}{\partial h_{1j}} &= \sum_{i=1}^{n_2}\frac{\partial L}{\partial h_{2i}}\frac{\partial h_{2i}}{\partial h_{1j}} & j&= 1,\cdots,n_1 \\

&= \sum_{i=1}^{n_2}\frac{\partial L}{\partial h_{2i}}f’_2(\sum_{j=1}^{n_1} w^{(2)}_{ij} h_{1j} + b^{(2)}_i) w^{(2)}_{ij}

\end{aligned}

$$

In the equations above, we are reusing $\frac{\partial L}{\partial h_{2i}}$ that we derived earlier. Again, this derivative is computed as a sum of several products from the chain rule. Also similar to the previous, we derived $\frac{\partial L}{\partial h_{1j}}$ as well. It is not used to train $w^{(2)}_{ij}$ nor $b^{(2)}_i$ but will be used for the layer prior. So for layer 1, we have

$$

\begin{aligned}

h_{1i} &= f_1(\sum_{j=1}^{n_0} w^{(1)}_{ij} x_j + b^{(1)}_i) & \text{for } i &= 1,\cdots,n_1\\

\frac{\partial L}{\partial w^{(1)}_{ij}} &= \frac{\partial L}{\partial h_{1i}}\frac{\partial h_{1i}}{\partial w^{(1)}_{ij}} & i&=1,\cdots,n_1;\ j=1,\cdots,n_0 \\

&= \frac{\partial L}{\partial h_{1i}}f’_1(\sum_{j=1}^{n_0} w^{(1)}_{ij} x_j + b^{(1)}_i)x_j \\

\frac{\partial L}{\partial b^{(1)}_i} &= \frac{\partial L}{\partial h_{1i}}\frac{\partial h_{1i}}{\partial b^{(1)}_i} & i&=1,\cdots,n_1 \\

&= \frac{\partial L}{\partial h_{1i}}f’_1(\sum_{j=1}^{n_0} w^{(1)}_{ij} x_j + b^{(1)}_i)

\end{aligned}

$$

and this completes all the derivatives needed for training of the neural network using gradient descent algorithm.

Recall how we derived the above: We first start from the loss function $L$ and find the derivatives one by one in the reverse order of the layers. We write down the derivatives on layer $k$ and reuse it for the derivatives on layer $k-1$. While computing the output $\hat{y}_i$ from input $x_i$ starts from layer 0 forward, computing gradients are in the reversed order. Hence the name “back-propagation”.

While we did not use it above, it is cleaner to write the equations in vectors and matrices. We can rewrite the layers and the outputs as:

$$

\mathbf{a}_k = f_k(\mathbf{z}_k) = f_k(\mathbf{W}_k\mathbf{a}_{k-1}+\mathbf{b}_k)

$$

where $\mathbf{a}_k$ is a vector of outputs of layer $k$, and assume $\mathbf{a}_0=\mathbf{x}$ is the input vector and $\mathbf{a}_3=\hat{\mathbf{y}}$ is the output vector. Also denote $\mathbf{z}_k = \mathbf{W}_k\mathbf{a}_{k-1}+\mathbf{b}_k$ for convenience of notation.

Under such notation, we can represent $\frac{\partial L}{\partial\mathbf{a}_k}$ as a vector (so as that of $\mathbf{z}_k$ and $\mathbf{b}_k$) and $\frac{\partial L}{\partial\mathbf{W}_k}$ as a matrix. And then if $\frac{\partial L}{\partial\mathbf{a}_k}$ is known, we have

$$

\begin{aligned}

\frac{\partial L}{\partial\mathbf{z}_k} &= \frac{\partial L}{\partial\mathbf{a}_k}\odot f_k'(\mathbf{z}_k) \\

\frac{\partial L}{\partial\mathbf{W}_k} &= \left(\frac{\partial L}{\partial\mathbf{z}_k}\right)^\top \cdot \mathbf{a}_k \\

\frac{\partial L}{\partial\mathbf{b}_k} &= \frac{\partial L}{\partial\mathbf{z}_k} \\

\frac{\partial L}{\partial\mathbf{a}_{k-1}} &= \left(\frac{\partial\mathbf{z}_k}{\partial\mathbf{a}_{k-1}}\right)^\top\cdot\frac{\partial L}{\partial\mathbf{z}_k} = \mathbf{W}_k^\top\cdot\frac{\partial L}{\partial\mathbf{z}_k}

\end{aligned}

$$

where $\frac{\partial\mathbf{z}_k}{\partial\mathbf{a}_{k-1}}$ is a Jacobian matrix as both $\mathbf{z}_k$ and $\mathbf{a}_{k-1}$ are vectors, and this Jacobian matrix happens to be $\mathbf{W}_k$.

We need the matrix form of equations because it will make our code simpler and avoided a lot of loops. Let’s see how we can convert these equations into code and make a multilayer perceptron model for classification from scratch using numpy.

The first thing we need to implement the activation function and the loss function. Both need to be differentiable functions or otherwise our gradient descent procedure would not work. Nowadays, it is common to use ReLU activation in the hidden layers and sigmoid activation in the output layer. We define them as a function (which assumes the input as numpy array) as well as their differentiation:

import numpy as np # Find a small float to avoid division by zero epsilon = np.finfo(float).eps # Sigmoid function and its differentiation def sigmoid(z): return 1/(1+np.exp(-z.clip(-500, 500))) def dsigmoid(z): s = sigmoid(z) return 2 * s * (1-s) # ReLU function and its differentiation def relu(z): return np.maximum(0, z) def drelu(z): return (z > 0).astype(float)

We deliberately clip the input of the sigmoid function to between -500 to +500 to avoid overflow. Otherwise, these functions are trivial. Then for classification, we care about accuracy but the accuracy function is not differentiable. Therefore, we use the cross entropy function as loss for training:

# Loss function L(y, yhat) and its differentiation def cross_entropy(y, yhat): """Binary cross entropy function L = - y log yhat - (1-y) log (1-yhat) Args: y, yhat (np.array): 1xn matrices which n are the number of data instances Returns: average cross entropy value of shape 1x1, averaging over the n instances """ return -(y.T @ np.log(yhat.clip(epsilon)) + (1-y.T) @ np.log((1-yhat).clip(epsilon))) / y.shape[1] def d_cross_entropy(y, yhat): """ dL/dyhat """ return - np.divide(y, yhat.clip(epsilon)) + np.divide(1-y, (1-yhat).clip(epsilon))

In the above, we assume the output and the target variables are row matrices in numpy. Hence we use the dot product operator `@`

to compute the sum and divide by the number of elements in the output. Note that this design is to compute the **average cross entropy** over a **batch** of samples.

Then we can implement our multilayer perceptron model. To make it easier to read, we want to create the model by providing the number of neurons at each layer as well as the activation function at the layers. But at the same time, we would also need the differentiation of the activation functions as well as the differentiation of the loss function for the training. The loss function itself, however, is not required but useful for us to track the progress. We create a class to ensapsulate the entire model, and define each layer $k$ according to the formula:

$$

\mathbf{a}_k = f_k(\mathbf{z}_k) = f_k(\mathbf{a}_{k-1}\mathbf{W}_k+\mathbf{b}_k)

$

class mlp: '''Multilayer perceptron using numpy ''' def __init__(self, layersizes, activations, derivatives, lossderiv): """remember config, then initialize array to hold NN parameters without init""" # hold NN config self.layersizes = layersizes self.activations = activations self.derivatives = derivatives self.lossderiv = lossderiv # parameters, each is a 2D numpy array L = len(self.layersizes) self.z = [None] * L self.W = [None] * L self.b = [None] * L self.a = [None] * L self.dz = [None] * L self.dW = [None] * L self.db = [None] * L self.da = [None] * L def initialize(self, seed=42): np.random.seed(seed) sigma = 0.1 for l, (insize, outsize) in enumerate(zip(self.layersizes, self.layersizes[1:]), 1): self.W[l] = np.random.randn(insize, outsize) * sigma self.b[l] = np.random.randn(1, outsize) * sigma def forward(self, x): self.a[0] = x for l, func in enumerate(self.activations, 1): # z = W a + b, with `a` as output from previous layer # `W` is of size rxs and `a` the size sxn with n the number of data instances, `z` the size rxn # `b` is rx1 and broadcast to each column of `z` self.z[l] = (self.a[l-1] @ self.W[l]) + self.b[l] # a = g(z), with `a` as output of this layer, of size rxn self.a[l] = func(self.z[l]) return self.a[-1]

The variables in this class `z`

, `W`

, `b`

, and `a`

are for the forward pass and the variables `dz`

, `dW`

, `db`

, and `da`

are their respective gradients that to be computed in the back-propagation. All these variables are presented as numpy arrays.

As we will see later, we are going to test our model using data generated by scikit-learn. Hence we will see our data in numpy array of shape “(number of samples, number of features)”. Therefore, each sample is presented as a row on a matrix, and in function `forward()`

, the weight matrix is right-multiplied to each input `a`

to the layer. While the activation function and dimension of each layer can be different, the process is the same. Thus we transform the neural network’s input `x`

to its output by a loop in the `forward()`

function. The network’s output is simply the output of the last layer.

To train the network, we need to run the back-propagation after each forward pass. The back-propagation is to compute the gradient of the weight and bias of each layer, starting from the output layer to the input layer. With the equations we derived above, the back-propagation function is implemented as:

class mlp: ... def backward(self, y, yhat): # first `da`, at the output self.da[-1] = self.lossderiv(y, yhat) for l, func in reversed(list(enumerate(self.derivatives, 1))): # compute the differentials at this layer self.dz[l] = self.da[l] * func(self.z[l]) self.dW[l] = self.a[l-1].T @ self.dz[l] self.db[l] = np.mean(self.dz[l], axis=0, keepdims=True) self.da[l-1] = self.dz[l] @ self.W[l].T def update(self, eta): for l in range(1, len(self.W)): self.W[l] -= eta * self.dW[l] self.b[l] -= eta * self.db[l]

The only difference here is that we compute `db`

not for one training sample, but for the entire batch. Since the loss function is the cross entropy averaged across the batch, we compute `db`

also by averaging across the samples.

Up to here, we completed our model. The `update()`

function simply applies the gradients found by the back-propagation to the parameters `W`

and `b`

using the gradient descent update rule.

To test out our model, we make use of scikit-learn to generate a classification dataset:

from sklearn.datasets import make_circles from sklearn.metrics import accuracy_score # Make data: Two circles on x-y plane as a classification problem X, y = make_circles(n_samples=1000, factor=0.5, noise=0.1) y = y.reshape(-1,1) # our model expects a 2D array of (n_sample, n_dim)

and then we build our model: Input is two-dimensional and output is one dimensional (logistic regression). We make two hidden layers of 4 and 3 neurons respectively:

# Build a model model = mlp(layersizes=[2, 4, 3, 1], activations=[relu, relu, sigmoid], derivatives=[drelu, drelu, dsigmoid], lossderiv=d_cross_entropy) model.initialize() yhat = model.forward(X) loss = cross_entropy(y, yhat) print("Before training - loss value {} accuracy {}".format(loss, accuracy_score(y, (yhat > 0.5))))

We see that, under random weight, the accuracy is 50%:

Before training - loss value [[693.62972747]] accuracy 0.5

Now we train our network. To make things simple, we perform full-batch gradient descent with fixed learning rate:

# train for each epoch n_epochs = 150 learning_rate = 0.005 for n in range(n_epochs): model.forward(X) yhat = model.a[-1] model.backward(y, yhat) model.update(learning_rate) loss = cross_entropy(y, yhat) print("Iteration {} - loss value {} accuracy {}".format(n, loss, accuracy_score(y, (yhat > 0.5))))

and the output is:

Iteration 0 - loss value [[693.62972747]] accuracy 0.5 Iteration 1 - loss value [[693.62166655]] accuracy 0.5 Iteration 2 - loss value [[693.61534159]] accuracy 0.5 Iteration 3 - loss value [[693.60994018]] accuracy 0.5 ... Iteration 145 - loss value [[664.60120828]] accuracy 0.818 Iteration 146 - loss value [[697.97739669]] accuracy 0.58 Iteration 147 - loss value [[681.08653776]] accuracy 0.642 Iteration 148 - loss value [[665.06165774]] accuracy 0.71 Iteration 149 - loss value [[683.6170298]] accuracy 0.614

Although not perfect, we see the improvement by training. At least in the example above, we can see the accuracy was up to more than 80% at iteration 145, but then we saw the model diverged. That can be improved by reducing the learning rate, which we didn’t implement above. Nonetheless, this shows how we computed the gradients by back-propagations and chain rules.

The complete code is as follows:

from sklearn.datasets import make_circles from sklearn.metrics import accuracy_score import numpy as np np.random.seed(0) # Find a small float to avoid division by zero epsilon = np.finfo(float).eps # Sigmoid function and its differentiation def sigmoid(z): return 1/(1+np.exp(-z.clip(-500, 500))) def dsigmoid(z): s = sigmoid(z) return 2 * s * (1-s) # ReLU function and its differentiation def relu(z): return np.maximum(0, z) def drelu(z): return (z > 0).astype(float) # Loss function L(y, yhat) and its differentiation def cross_entropy(y, yhat): """Binary cross entropy function L = - y log yhat - (1-y) log (1-yhat) Args: y, yhat (np.array): nx1 matrices which n are the number of data instances Returns: average cross entropy value of shape 1x1, averaging over the n instances """ return -(y.T @ np.log(yhat.clip(epsilon)) + (1-y.T) @ np.log((1-yhat).clip(epsilon))) / y.shape[1] def d_cross_entropy(y, yhat): """ dL/dyhat """ return - np.divide(y, yhat.clip(epsilon)) + np.divide(1-y, (1-yhat).clip(epsilon)) class mlp: '''Multilayer perceptron using numpy ''' def __init__(self, layersizes, activations, derivatives, lossderiv): """remember config, then initialize array to hold NN parameters without init""" # hold NN config self.layersizes = tuple(layersizes) self.activations = tuple(activations) self.derivatives = tuple(derivatives) self.lossderiv = lossderiv assert len(self.layersizes)-1 == len(self.activations), \ "number of layers and the number of activation functions does not match" assert len(self.activations) == len(self.derivatives), \ "number of activation functions and number of derivatives does not match" assert all(isinstance(n, int) and n >= 1 for n in layersizes), \ "Only positive integral number of perceptons is allowed in each layer" # parameters, each is a 2D numpy array L = len(self.layersizes) self.z = [None] * L self.W = [None] * L self.b = [None] * L self.a = [None] * L self.dz = [None] * L self.dW = [None] * L self.db = [None] * L self.da = [None] * L def initialize(self, seed=42): """initialize the value of weight matrices and bias vectors with small random numbers.""" np.random.seed(seed) sigma = 0.1 for l, (insize, outsize) in enumerate(zip(self.layersizes, self.layersizes[1:]), 1): self.W[l] = np.random.randn(insize, outsize) * sigma self.b[l] = np.random.randn(1, outsize) * sigma def forward(self, x): """Feed forward using existing `W` and `b`, and overwrite the result variables `a` and `z` Args: x (numpy.ndarray): Input data to feed forward """ self.a[0] = x for l, func in enumerate(self.activations, 1): # z = W a + b, with `a` as output from previous layer # `W` is of size rxs and `a` the size sxn with n the number of data instances, `z` the size rxn # `b` is rx1 and broadcast to each column of `z` self.z[l] = (self.a[l-1] @ self.W[l]) + self.b[l] # a = g(z), with `a` as output of this layer, of size rxn self.a[l] = func(self.z[l]) return self.a[-1] def backward(self, y, yhat): """back propagation using NN output yhat and the reference output y, generates dW, dz, db, da """ assert y.shape[1] == self.layersizes[-1], "Output size doesn't match network output size" assert y.shape == yhat.shape, "Output size doesn't match reference" # first `da`, at the output self.da[-1] = self.lossderiv(y, yhat) for l, func in reversed(list(enumerate(self.derivatives, 1))): # compute the differentials at this layer self.dz[l] = self.da[l] * func(self.z[l]) self.dW[l] = self.a[l-1].T @ self.dz[l] self.db[l] = np.mean(self.dz[l], axis=0, keepdims=True) self.da[l-1] = self.dz[l] @ self.W[l].T assert self.z[l].shape == self.dz[l].shape assert self.W[l].shape == self.dW[l].shape assert self.b[l].shape == self.db[l].shape assert self.a[l].shape == self.da[l].shape def update(self, eta): """Updates W and b Args: eta (float): Learning rate """ for l in range(1, len(self.W)): self.W[l] -= eta * self.dW[l] self.b[l] -= eta * self.db[l] # Make data: Two circles on x-y plane as a classification problem X, y = make_circles(n_samples=1000, factor=0.5, noise=0.1) y = y.reshape(-1,1) # our model expects a 2D array of (n_sample, n_dim) print(X.shape) print(y.shape) # Build a model model = mlp(layersizes=[2, 4, 3, 1], activations=[relu, relu, sigmoid], derivatives=[drelu, drelu, dsigmoid], lossderiv=d_cross_entropy) model.initialize() yhat = model.forward(X) loss = cross_entropy(y, yhat) print("Before training - loss value {} accuracy {}".format(loss, accuracy_score(y, (yhat > 0.5)))) # train for each epoch n_epochs = 150 learning_rate = 0.005 for n in range(n_epochs): model.forward(X) yhat = model.a[-1] model.backward(y, yhat) model.update(learning_rate) loss = cross_entropy(y, yhat) print("Iteration {} - loss value {} accuracy {}".format(n, loss, accuracy_score(y, (yhat > 0.5))))

The back-propagation algorithm is the center of all neural network training, regardless of what variation of gradient descent algorithms you used. Textbook such as this one covered it:

*Deep Learning*, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016.

(https://www.amazon.com/dp/0262035618)

Previously also implemented the neural network from scratch without discussing the math, it explained the steps in greater detail:

In this tutorial, you learned how differentiation is applied to training a neural network.

Specifically, you learned:

- What is a total differential and how it is expressed as a sum of partial differentials
- How to express a neural network as equations and derive the gradients by differentiation
- How back-propagation helped us to express the gradients of each layer in the neural network
- How to convert the gradients into code to make a neural network model

The post Application of differentiations in neural networks appeared first on Machine Learning Mastery.

]]>The post Method of Lagrange Multipliers: The Theory Behind Support Vector Machines (Part 1: The Separable Case) appeared first on Machine Learning Mastery.

]]>Last Updated on November 26, 2021

This tutorial is designed for anyone looking for a deeper understanding of how Lagrange multipliers are used in building up the model for support vector machines (SVMs). SVMs were initially designed to solve binary classification problems and later extended and applied to regression and unsupervised learning. They have shown their success in solving many complex machine learning classification problems.

In this tutorial, we’ll look at the simplest SVM that assumes that the positive and negative examples can be completely separated via a linear hyperplane.

After completing this tutorial, you will know:

- How the hyperplane acts as the decision boundary
- Mathematical constraints on the positive and negative examples
- What is the margin and how to maximize the margin
- Role of Lagrange multipliers in maximizing the margin
- How to determine the separating hyperplane for the separable case

Let’s get started.

This tutorial is divided into three parts; they are:

- Formulation of the mathematical model of SVM
- Solution of finding the maximum margin hyperplane via the method of Lagrange multipliers
- Solved example to demonstrate all concepts

- $x$: Data point, which is an n-dimensional vector.
- $x^+$: Data point labelled as +1.
- $x^-$: Data point labelled as -1.
- $i$: Subscript used to index the training points.
- $t$: Label of a data point.
- T: Transpose operator.
- $w$: Weight vector denoting the coefficients of the hyperplane. It is also an n-dimensional vector.
- $\alpha$: Lagrange multipliers, one per each training point. This is also an n-dimensional vector.
- $d$: Perpendicular distance of a data point from the decision boundary.

The support vector machine is designed to discriminate data points belonging to two different classes. One set of points is labelled as +1 also called the positive class. The other set of points is labeled as -1 also called the negative class. For now, we’ll make a simplifying assumption that points from both classes can be discriminated via linear hyperplane.

The SVM assumes a linear decision boundary between the two classes and the goal is to find a hyperplane that gives the maximum separation between the two classes. For this reason, the alternate term `maximum margin classifier`

is also sometimes used to refer to an SVM. The perpendicular distance between the closest data point and the decision boundary is referred to as the `margin`

. As the margin completely separates the positive and negative examples and does not tolerate any errors, it is also called the `hard margin`

.

The mathematical expression for a hyperplane is given below with \(w_i\) being the coefficients and \(w_0\) being the arbitrary constant that determines the distance of the hyperplane from the origin:

$$

w^T x_i + w_0 = 0

$$

For the ith 2-dimensional point $(x_{i1}, x_{i2})$ the above expression is reduced to:

$$

w_1x_{i1} + w_2 x_{i2} + w_0 = 0

$$

As we are looking to maximize the margin between positive and negative data points, we would like the positive data points to satisfy the following constraint:

$$

w^T x_i^+ + w_0 \geq +1

$$

Similarly, the negative data points should satisfy:

$$

w^T x_i^- + w_0 \leq -1

$$

We can use a neat trick to write a uniform equation for both set of points by using $t_i \in \{-1,+1\}$ to denote the class label of data point $x_i$:

$$

t_i(w^T x_i + w_0) \geq +1

$$

The perpendicular distance $d_i$ of a data point $x_i$ from the margin is given by:

$$

d_i = \frac{|w^T x_i + w_0|}{||w||}

$$

To maximize this distance, we can minimize the square of the denominator to give us a quadratic programming problem given by:

$$

\min \frac{1}{2}||w||^2 \;\text{ subject to } t_i(w^Tx_i+w_0) \geq +1, \forall i

$$

To solve the above quadratic programming problem with inequality constraints, we can use the method of Lagrange multipliers. The Lagrange function is therefore:

$$

L(w, w_0, \alpha) = \frac{1}{2}||w||^2 + \sum_i \alpha_i\big(t_i(w^Tx_i+w_0) – 1\big)

$$

To solve the above, we set the following:

\begin{equation}

\frac{\partial L}{ \partial w} = 0, \\

\frac{\partial L}{ \partial \alpha} = 0, \\

\frac{\partial L}{ \partial w_0} = 0 \\

\end{equation}

Plugging above in the Lagrange function gives us the following optimization problem, also called the dual:

$$

L_d = -\frac{1}{2} \sum_i \sum_k \alpha_i \alpha_k t_i t_k (x_i)^T (x_k) + \sum_i \alpha_i

$$

We have to maximize the above subject to the following:

$$

w = \sum_i \alpha_i t_i x_i

$$

and

$$

0=\sum_i \alpha_i t_i

$$

The nice thing about the above is that we have an expression for \(w\) in terms of Lagrange multipliers. The objective function involves no $w$ term. There is a Lagrange multiplier associated with each data point. The computation of $w_0$ is also explained later.

The classification of any test point $x$ can be determined using this expression:

$$

y(x) = \sum_i \alpha_i t_i x^T x_i + w_0

$$

A positive value of $y(x)$ implies $x\in+1$ and a negative value means $x\in-1$

Also, Karush-Kuhn-Tucker (KKT) conditions are satisfied by the above constrained optimization problem as given by:

\begin{eqnarray}

\alpha_i &\geq& 0 \\

t_i y(x_i) -1 &\geq& 0 \\

\alpha_i(t_i y(x_i) -1) &=& 0

\end{eqnarray}

The KKT conditions dictate that for each data point one of the following is true:

- The Lagrange multiplier is zero, i.e., \(\alpha_i=0\). This point, therefore, plays no role in classification

OR

- $ t_i y(x_i) = 1$ and $\alpha_i > 0$: In this case, the data point has a role in deciding the value of $w$. Such a point is called a support vector.

For $w_0$, we can select any support vector $x_s$ and solve

$$

t_s y(x_s) = 1

$$

giving us:

$$

t_s(\sum_i \alpha_i t_i x_s^T x_i + w_0) = 1

$$

To help you understand the above concepts, here is a simple arbitrarily solved example. Of course, for a large number of points you would use an optimization software to solve this. Also, this is one possible solution that satisfies all the constraints. The objective function can be maximized further but the slope of the hyperplane will remain the same for an optimal solution. Also, for this example, $w_0$ was computed by taking the average of $w_0$ from all three support vectors.

This example will show you that the model is not as complex as it looks.

For the above set of points, we can see that (1,2), (2,1) and (0,0) are points closest to the separating hyperplane and hence, act as support vectors. Points far away from the boundary (e.g. (-3,1)) do not play any role in determining the classification of the points.

This section provides more resources on the topic if you are looking to go deeper.

- Pattern Recognition and Machine Learning by Christopher M. Bishop
- Thomas’ Calculus, 14th edition, 2017 (based on the original works of George B. Thomas, revised by Joel Hass, Christopher Heil, Maurice Weir)

- Support Vector Machines for Machine Learning
- A Tutorial on Support Vector Machines for Pattern Recognition by Christopher J.C. Burges

In this tutorial, you discovered how to use the method of Lagrange multipliers to solve the problem of maximizing the margin via a quadratic programming problem with inequality constraints.

Specifically, you learned:

- The mathematical expression for a separating linear hyperplane
- The maximum margin as a solution of a quadratic programming problem with inequality constraint
- How to find a linear hyperplane between positive and negative examples using the method of Lagrange multipliers

Do you have any questions about the SVM discussed in this post? Ask your questions in the comments below and I will do my best to answer.

The post Method of Lagrange Multipliers: The Theory Behind Support Vector Machines (Part 1: The Separable Case) appeared first on Machine Learning Mastery.

]]>The post Visualizing the vanishing gradient problem appeared first on Machine Learning Mastery.

]]>Last Updated on November 26, 2021

Deep learning was a recent invention. Partially, it is due to improved computation power that allows us to use more layers of perceptrons in a neural network. But at the same time, we can train a deep network only after we know how to work around the vanishing gradient problem.

In this tutorial, we visually examine why vanishing gradient problem exists.

After completing this tutorial, you will know

- What is a vanishing gradient
- Which configuration of neural network will be susceptible to vanishing gradient
- How to run manual training loop in Keras
- How to extract weights and gradients from Keras model

Let’s get started

This tutorial is divided into 5 parts; they are:

- Configuration of multilayer perceptron models
- Example of vanishing gradient problem
- Looking at the weights of each layer
- Looking at the gradients of each layer
- The Glorot initialization

Because neural networks are trained by gradient descent, people believed that a differentiable function is required to be the activation function in neural networks. This caused us to conventionally use sigmoid function or hyperbolic tangent as activation.

For a binary classification problem, if we want to do logistic regression such that 0 and 1 are the ideal output, sigmoid function is preferred as it is in this range:

$$

\sigma(x) = \frac{1}{1+e^{-x}}

$$

and if we need sigmoidal activation at the output, it is natural to use it in all layers of the neural network. Additionally, each layer in a neural network has a weight parameter. Initially, the weights have to be randomized and naturally we would use some simple way to do it, such as using uniform random or normal distribution.

To illustrate the problem of vanishing gradient, let’s try with an example. Neural network is a nonlinear function. Hence it should be most suitable for classification of nonlinear dataset. We make use of scikit-learn’s `make_circle()`

function to generate some data:

from sklearn.datasets import make_circles import matplotlib.pyplot as plt # Make data: Two circles on x-y plane as a classification problem X, y = make_circles(n_samples=1000, factor=0.5, noise=0.1) plt.figure(figsize=(8,6)) plt.scatter(X[:,0], X[:,1], c=y) plt.show()

This is not difficult to classify. A naive way is to build a 3-layer neural network, which can give a quite good result:

from tensorflow.keras.layers import Dense, Input from tensorflow.keras import Sequential model = Sequential([ Input(shape=(2,)), Dense(5, "relu"), Dense(1, "sigmoid") ]) model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["acc"]) model.fit(X, y, batch_size=32, epochs=100, verbose=0) print(model.evaluate(X,y))

32/32 [==============================] - 0s 1ms/step - loss: 0.2404 - acc: 0.9730 [0.24042171239852905, 0.9729999899864197]

Note that we used rectified linear unit (ReLU) in the hidden layer above. By default, the dense layer in Keras will be using linear activation (i.e. no activation) which mostly is not useful. We usually use ReLU in modern neural networks. But we can also try the old school way as everyone does two decades ago:

model = Sequential([ Input(shape=(2,)), Dense(5, "sigmoid"), Dense(1, "sigmoid") ]) model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["acc"]) model.fit(X, y, batch_size=32, epochs=100, verbose=0) print(model.evaluate(X,y))

32/32 [==============================] - 0s 1ms/step - loss: 0.6927 - acc: 0.6540 [0.6926590800285339, 0.6539999842643738]

The accuracy is much worse. It turns out, it is even worse by adding more layers (at least in my experiment):

model = Sequential([ Input(shape=(2,)), Dense(5, "sigmoid"), Dense(5, "sigmoid"), Dense(5, "sigmoid"), Dense(1, "sigmoid") ]) model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["acc"]) model.fit(X, y, batch_size=32, epochs=100, verbose=0) print(model.evaluate(X,y))

32/32 [==============================] - 0s 1ms/step - loss: 0.6922 - acc: 0.5330 [0.6921834349632263, 0.5329999923706055]

Your result may vary given the stochastic nature of the training algorithm. You may see the 5-layer sigmoidal network performing much worse than 3-layer or not. But the idea here is you can’t get back the high accuracy as we can achieve with rectified linear unit activation by merely adding layers.

Shouldn’t we get a more powerful neural network with more layers?

Yes, it should be. But it turns out as we adding more layers, we triggered the vanishing gradient problem. To illustrate what happened, let’s see how are the weights look like as we trained our network.

In Keras, we are allowed to plug-in a callback function to the training process. We are going create our own callback object to intercept and record the weights of each layer of our multilayer perceptron (MLP) model at the end of each epoch.

from tensorflow.keras.callbacks import Callback class WeightCapture(Callback): "Capture the weights of each layer of the model" def __init__(self, model): super().__init__() self.model = model self.weights = [] self.epochs = [] def on_epoch_end(self, epoch, logs=None): self.epochs.append(epoch) # remember the epoch axis weight = {} for layer in model.layers: if not layer.weights: continue name = layer.weights[0].name.split("/")[0] weight[name] = layer.weights[0].numpy() self.weights.append(weight)

We derive the `Callback`

class and define the `on_epoch_end()`

function. This class will need the created model to initialize. At the end of each epoch, it will read each layer and save the weights into numpy array.

For the convenience of experimenting different ways of creating a MLP, we make a helper function to set up the neural network model:

def make_mlp(activation, initializer, name): "Create a model with specified activation and initalizer" model = Sequential([ Input(shape=(2,), name=name+"0"), Dense(5, activation=activation, kernel_initializer=initializer, name=name+"1"), Dense(5, activation=activation, kernel_initializer=initializer, name=name+"2"), Dense(5, activation=activation, kernel_initializer=initializer, name=name+"3"), Dense(5, activation=activation, kernel_initializer=initializer, name=name+"4"), Dense(1, activation="sigmoid", kernel_initializer=initializer, name=name+"5") ]) return model

We deliberately create a neural network with 4 hidden layers so we can see how each layer respond to the training. We will vary the activation function of each hidden layer as well as the weight initialization. To make things easier to tell, we are going to name each layer instead of letting Keras to assign a name. The input is a coordinate on the xy-plane hence the input shape is a vector of 2. The output is binary classification. Therefore we use sigmoid activation to make the output fall in the range of 0 to 1.

Then we can `compile()`

the model to provide the evaluation metrics and pass on the callback in the `fit()`

call to train the model:

initializer = RandomNormal(mean=0.0, stddev=1.0) batch_size = 32 n_epochs = 100 model = make_mlp("sigmoid", initializer, "sigmoid") capture_cb = WeightCapture(model) capture_cb.on_epoch_end(-1) model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"]) model.fit(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=1)

Here we create the neural network by calling `make_mlp()`

first. Then we set up our callback object. Since the weights of each layer in the neural network are initialized at creation, we deliberately call the callback function to remember what they are initialized to. Then we call the `compile()`

and `fit()`

from the model as usual, with the callback object provided.

After we fit the model, we can evaluate it with the entire dataset:

... print(model.evaluate(X,y))

[0.6649572253227234, 0.5879999995231628]

Here it means the log-loss is 0.665 and the accuracy is 0.588 for this model of having all layers using sigmoid activation.

What we can further look into is how the weight behaves along the iterations of training. All the layers except the first and the last are having their weight as a 5×5 matrix. We can check the mean and standard deviation of the weights to get a sense of how the weights look like:

def plotweight(capture_cb): "Plot the weights' mean and s.d. across epochs" fig, ax = plt.subplots(2, 1, sharex=True, constrained_layout=True, figsize=(8, 10)) ax[0].set_title("Mean weight") for key in capture_cb.weights[0]: ax[0].plot(capture_cb.epochs, [w[key].mean() for w in capture_cb.weights], label=key) ax[0].legend() ax[1].set_title("S.D.") for key in capture_cb.weights[0]: ax[1].plot(capture_cb.epochs, [w[key].std() for w in capture_cb.weights], label=key) ax[1].legend() plt.show() plotweight(capture_cb)

This results in the following figure:

We see the mean weight moved quickly only in first 10 iterations or so. Only the weights of the first layer getting more diversified as its standard deviation is moving up.

We can restart with the hyperbolic tangent (tanh) activation on the same process:

# tanh activation, large variance gaussian initialization model = make_mlp("tanh", initializer, "tanh") capture_cb = WeightCapture(model) capture_cb.on_epoch_end(-1) model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"]) model.fit(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0) print(model.evaluate(X,y)) plotweight(capture_cb)

[0.012918001972138882, 0.9929999709129333]

The log-loss and accuracy are both improved. If we look at the plot, we don’t see the abrupt change in the mean and standard deviation in the weights but instead, that of all layers are slowly converged.

Similar case can be seen in ReLU activation:

# relu activation, large variance gaussian initialization model = make_mlp("relu", initializer, "relu") capture_cb = WeightCapture(model) capture_cb.on_epoch_end(-1) model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"]) model.fit(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0) print(model.evaluate(X,y)) plotweight(capture_cb)

[0.016895903274416924, 0.9940000176429749]

We see the effect of different activation function in the above. But indeed, what matters is the gradient as we are running gradient decent during training. The paper by Xavier Glorot and Yoshua Bengio, “Understanding the difficulty of training deep feedforward neural networks”, suggested to look at the gradient of each layer in each training iteration as well as the standard deviation of it.

Bradley (2009) found that back-propagated gradients were smaller as one moves from the output layer towards the input layer, just after initialization. He studied networks with linear activation at each layer, finding that the variance of the back-propagated gradients decreases as we go backwards in the network

— “Understanding the difficulty of training deep feedforward neural networks” (2010)

To understand how the activation function related to the gradient as perceived during training, we need to run the training loop manually.

In Tensorflow-Keras, a training loop can be run by turning on the gradient tape, and then make the neural network model produce an output, which afterwards we can obtain the gradient by automatic differentiation from the gradient tape. Subsequently we can update the parameters (weights and biases) according to the gradient descent update rule.

Because the gradient is readily obtained in this loop, we can make a copy of it. The following is how we implement the training loop and at the same time, keep a copy of the gradients:

optimizer = tf.keras.optimizers.RMSprop() loss_fn = tf.keras.losses.BinaryCrossentropy() def train_model(X, y, model, n_epochs=n_epochs, batch_size=batch_size): "Run training loop manually" train_dataset = tf.data.Dataset.from_tensor_slices((X, y)) train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size) gradhistory = [] losshistory = [] def recordweight(): data = {} for g,w in zip(grads, model.trainable_weights): if '/kernel:' not in w.name: continue # skip bias name = w.name.split("/")[0] data[name] = g.numpy() gradhistory.append(data) losshistory.append(loss_value.numpy()) for epoch in range(n_epochs): for step, (x_batch_train, y_batch_train) in enumerate(train_dataset): with tf.GradientTape() as tape: y_pred = model(x_batch_train, training=True) loss_value = loss_fn(y_batch_train, y_pred) grads = tape.gradient(loss_value, model.trainable_weights) optimizer.apply_gradients(zip(grads, model.trainable_weights)) if step == 0: recordweight() # After all epochs, record again recordweight() return gradhistory, losshistory

The key in the function above is the nested for-loop. In which, we launch `tf.GradientTape()`

and pass in a batch of data to the model to get a prediction, which is then evaluated using the loss function. Afterwards, we can pull out the gradient from the tape by comparing the loss with the trainable weight from the model. Next, we update the weights using the optimizer, which will handle the learning weights and momentums in the gradient descent algorithm implicitly.

As a refresh, the gradient here means the following. For a loss value $L$ computed and a layer with weights $W=[w_1, w_2, w_3, w_4, w_5]$ (e.g., on the output layer) then the gradient is the matrix

$$

\frac{\partial L}{\partial W} = \Big[\frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, \frac{\partial L}{\partial w_3}, \frac{\partial L}{\partial w_4}, \frac{\partial L}{\partial w_5}\Big]

$$

But before we start the next iteration of training, we have a chance to further manipulate the gradient: We match the gradient with the weights, to get the name of each, then save a copy of the gradient as numpy array. We sample the weight and loss only once per epoch, but you can change that to sample in a higher frequency.

With these, we can plot the gradient across epochs. In the following, we create the model (but not calling `compile()`

because we would not call `fit()`

afterwards) and run the manual training loop, then plot the gradient as well as the standard deviation of the gradient:

from sklearn.metrics import accuracy_score def plot_gradient(gradhistory, losshistory): "Plot gradient mean and sd across epochs" fig, ax = plt.subplots(3, 1, sharex=True, constrained_layout=True, figsize=(8, 12)) ax[0].set_title("Mean gradient") for key in gradhistory[0]: ax[0].plot(range(len(gradhistory)), [w[key].mean() for w in gradhistory], label=key) ax[0].legend() ax[1].set_title("S.D.") for key in gradhistory[0]: ax[1].semilogy(range(len(gradhistory)), [w[key].std() for w in gradhistory], label=key) ax[1].legend() ax[2].set_title("Loss") ax[2].plot(range(len(losshistory)), losshistory) plt.show() model = make_mlp("sigmoid", initializer, "sigmoid") print("Before training: Accuracy", accuracy_score(y, (model(X) > 0.5))) gradhistory, losshistory = train_model(X, y, model) print("After training: Accuracy", accuracy_score(y, (model(X) > 0.5))) plot_gradient(gradhistory, losshistory)

It reported a weak classification result:

Before training: Accuracy 0.5 After training: Accuracy 0.652

and the plot we obtained shows vanishing gradient:

From the plot, the loss is not significantly decreased. The mean of gradient (i.e., mean of all elements in the gradient matrix) has noticeable value only for the last layer while all other layers are virtually zero. The standard deviation of the gradient is at the level of between 0.01 and 0.001 approximately.

Repeat this with tanh activation, we see a different result, which explains why the performance is better:

model = make_mlp("tanh", initializer, "tanh") print("Before training: Accuracy", accuracy_score(y, (model(X) > 0.5))) gradhistory, losshistory = train_model(X, y, model) print("After training: Accuracy", accuracy_score(y, (model(X) > 0.5))) plot_gradient(gradhistory, losshistory)

Before training: Accuracy 0.502 After training: Accuracy 0.994

From the plot of the mean of the gradients, we see the gradients from every layer are wiggling equally. The standard deviation of the gradient are also an order of magnitude larger than the case of sigmoid activation, at around 0.1 to 0.01.

Finally, we can also see the similar in rectified linear unit (ReLU) activation. And in this case the loss dropped quickly, hence we see it as the more efficient activation to use in neural networks:

model = make_mlp("relu", initializer, "relu") print("Before training: Accuracy", accuracy_score(y, (model(X) > 0.5))) gradhistory, losshistory = train_model(X, y, model) print("After training: Accuracy", accuracy_score(y, (model(X) > 0.5))) plot_gradient(gradhistory, losshistory)

Before training: Accuracy 0.503 After training: Accuracy 0.995

The following is the complete code:

import numpy as np import tensorflow as tf from tensorflow.keras.callbacks import Callback from tensorflow.keras.layers import Dense, Input from tensorflow.keras import Sequential from tensorflow.keras.initializers import RandomNormal import matplotlib.pyplot as plt from sklearn.datasets import make_circles from sklearn.metrics import accuracy_score tf.random.set_seed(42) np.random.seed(42) # Make data: Two circles on x-y plane as a classification problem X, y = make_circles(n_samples=1000, factor=0.5, noise=0.1) plt.figure(figsize=(8,6)) plt.scatter(X[:,0], X[:,1], c=y) plt.show() # Test performance with 3-layer binary classification network model = Sequential([ Input(shape=(2,)), Dense(5, "relu"), Dense(1, "sigmoid") ]) model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["acc"]) model.fit(X, y, batch_size=32, epochs=100, verbose=0) print(model.evaluate(X,y)) # Test performance with 3-layer network with sigmoid activation model = Sequential([ Input(shape=(2,)), Dense(5, "sigmoid"), Dense(1, "sigmoid") ]) model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["acc"]) model.fit(X, y, batch_size=32, epochs=100, verbose=0) print(model.evaluate(X,y)) # Test performance with 5-layer network with sigmoid activation model = Sequential([ Input(shape=(2,)), Dense(5, "sigmoid"), Dense(5, "sigmoid"), Dense(5, "sigmoid"), Dense(1, "sigmoid") ]) model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["acc"]) model.fit(X, y, batch_size=32, epochs=100, verbose=0) print(model.evaluate(X,y)) # Illustrate weights across epochs class WeightCapture(Callback): "Capture the weights of each layer of the model" def __init__(self, model): super().__init__() self.model = model self.weights = [] self.epochs = [] def on_epoch_end(self, epoch, logs=None): self.epochs.append(epoch) # remember the epoch axis weight = {} for layer in model.layers: if not layer.weights: continue name = layer.weights[0].name.split("/")[0] weight[name] = layer.weights[0].numpy() self.weights.append(weight) def make_mlp(activation, initializer, name): "Create a model with specified activation and initalizer" model = Sequential([ Input(shape=(2,), name=name+"0"), Dense(5, activation=activation, kernel_initializer=initializer, name=name+"1"), Dense(5, activation=activation, kernel_initializer=initializer, name=name+"2"), Dense(5, activation=activation, kernel_initializer=initializer, name=name+"3"), Dense(5, activation=activation, kernel_initializer=initializer, name=name+"4"), Dense(1, activation="sigmoid", kernel_initializer=initializer, name=name+"5") ]) return model def plotweight(capture_cb): "Plot the weights' mean and s.d. across epochs" fig, ax = plt.subplots(2, 1, sharex=True, constrained_layout=True, figsize=(8, 10)) ax[0].set_title("Mean weight") for key in capture_cb.weights[0]: ax[0].plot(capture_cb.epochs, [w[key].mean() for w in capture_cb.weights], label=key) ax[0].legend() ax[1].set_title("S.D.") for key in capture_cb.weights[0]: ax[1].plot(capture_cb.epochs, [w[key].std() for w in capture_cb.weights], label=key) ax[1].legend() plt.show() initializer = RandomNormal(mean=0, stddev=1) batch_size = 32 n_epochs = 100 # Sigmoid activation model = make_mlp("sigmoid", initializer, "sigmoid") capture_cb = WeightCapture(model) capture_cb.on_epoch_end(-1) model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"]) print("Before training: Accuracy", accuracy_score(y, (model(X).numpy() > 0.5).astype(int))) model.fit(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0) print("After training: Accuracy", accuracy_score(y, (model(X).numpy() > 0.5).astype(int))) print(model.evaluate(X,y)) plotweight(capture_cb) # tanh activation model = make_mlp("tanh", initializer, "tanh") capture_cb = WeightCapture(model) capture_cb.on_epoch_end(-1) model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"]) print("Before training: Accuracy", accuracy_score(y, (model(X).numpy() > 0.5).astype(int))) model.fit(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0) print("After training: Accuracy", accuracy_score(y, (model(X).numpy() > 0.5).astype(int))) print(model.evaluate(X,y)) plotweight(capture_cb) # relu activation model = make_mlp("relu", initializer, "relu") capture_cb = WeightCapture(model) capture_cb.on_epoch_end(-1) model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"]) print("Before training: Accuracy", accuracy_score(y, (model(X).numpy() > 0.5).astype(int))) model.fit(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0) print("After training: Accuracy", accuracy_score(y, (model(X).numpy() > 0.5).astype(int))) print(model.evaluate(X,y)) plotweight(capture_cb) # Show gradient across epochs optimizer = tf.keras.optimizers.RMSprop() loss_fn = tf.keras.losses.BinaryCrossentropy() def train_model(X, y, model, n_epochs=n_epochs, batch_size=batch_size): "Run training loop manually" train_dataset = tf.data.Dataset.from_tensor_slices((X, y)) train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size) gradhistory = [] losshistory = [] def recordweight(): data = {} for g,w in zip(grads, model.trainable_weights): if '/kernel:' not in w.name: continue # skip bias name = w.name.split("/")[0] data[name] = g.numpy() gradhistory.append(data) losshistory.append(loss_value.numpy()) for epoch in range(n_epochs): for step, (x_batch_train, y_batch_train) in enumerate(train_dataset): with tf.GradientTape() as tape: y_pred = model(x_batch_train, training=True) loss_value = loss_fn(y_batch_train, y_pred) grads = tape.gradient(loss_value, model.trainable_weights) optimizer.apply_gradients(zip(grads, model.trainable_weights)) if step == 0: recordweight() # After all epochs, record again recordweight() return gradhistory, losshistory def plot_gradient(gradhistory, losshistory): "Plot gradient mean and sd across epochs" fig, ax = plt.subplots(3, 1, sharex=True, constrained_layout=True, figsize=(8, 12)) ax[0].set_title("Mean gradient") for key in gradhistory[0]: ax[0].plot(range(len(gradhistory)), [w[key].mean() for w in gradhistory], label=key) ax[0].legend() ax[1].set_title("S.D.") for key in gradhistory[0]: ax[1].semilogy(range(len(gradhistory)), [w[key].std() for w in gradhistory], label=key) ax[1].legend() ax[2].set_title("Loss") ax[2].plot(range(len(losshistory)), losshistory) plt.show() model = make_mlp("sigmoid", initializer, "sigmoid") print("Before training: Accuracy", accuracy_score(y, (model(X) > 0.5))) gradhistory, losshistory = train_model(X, y, model) print("After training: Accuracy", accuracy_score(y, (model(X) > 0.5))) plot_gradient(gradhistory, losshistory) model = make_mlp("tanh", initializer, "tanh") print("Before training: Accuracy", accuracy_score(y, (model(X) > 0.5))) gradhistory, losshistory = train_model(X, y, model) print("After training: Accuracy", accuracy_score(y, (model(X) > 0.5))) plot_gradient(gradhistory, losshistory) model = make_mlp("relu", initializer, "relu") print("Before training: Accuracy", accuracy_score(y, (model(X) > 0.5))) gradhistory, losshistory = train_model(X, y, model) print("After training: Accuracy", accuracy_score(y, (model(X) > 0.5))) plot_gradient(gradhistory, losshistory)

We didn’t demonstrate in the code above, but the most famous outcome from the paper by Glorot and Bengio is the Glorot initialization. Which suggests to initialize the weights of a layer of the neural network with uniform distribution:

The normalization factor may therefore be important when initializing deep networks because of the multiplicative effect through layers, and we suggest the following initialization procedure to approximately satisfy our objectives of maintaining activation variances and back-propagated gradients variance as one moves up or down the network. We call it the normalized initialization:

$$

W \sim U\Big[-\frac{\sqrt{6}}{\sqrt{n_j+n_{j+1}}}, \frac{\sqrt{6}}{\sqrt{n_j+n_{j+1}}}\Big]

$$

— “Understanding the difficulty of training deep feedforward neural networks” (2010)

This is derived from the linear activation on the condition that the standard deviation of the gradient is keeping consistent across the layers. In the sigmoid and tanh activation, the linear region is narrow. Therefore we can understand why ReLU is the key to workaround the vanishing gradient problem. Comparing to replacing the activation function, changing the weight initialization is less pronounced in helping to resolve the vanishing gradient problem. But this can be an exercise for you to explore to see how this can help improving the result.

The Glorot and Bengio paper is available at:

- “Understanding the difficulty of training deep feedforward neural networks”, by Xavier Glorot and Yoshua Bengio, 2010.

(https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)

The vanishing gradient problem is well known enough in machine learning that many books covered it. For example,

*Deep Learning*, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016.

(https://www.amazon.com/dp/0262035618)

Previously we have posts about vanishing and exploding gradients:

- How to fix vanishing gradients using the rectified linear activation function
- Exploding gradients in neural networks

You may also find the following documentation helpful to explain some syntax we used above:

- Writings a training loop from scratch in Keras: https://keras.io/guides/writing_a_training_loop_from_scratch/
- Writing your own callbacks in Keras: https://keras.io/guides/writing_your_own_callbacks/

In this tutorial, you visually saw how a rectified linear unit (ReLU) can help resolving the vanishing gradient problem.

Specifically, you learned:

- How the problem of vanishing gradient impact the performance of a neural network
- Why ReLU activation is the solution to vanishing gradient problem
- How to use a custom callback to extract data in the middle of training loop in Keras
- How to write a custom training loop
- How to read the weight and gradient from a layer in the neural network

The post Visualizing the vanishing gradient problem appeared first on Machine Learning Mastery.

]]>The post Using CNN for financial time series prediction appeared first on Machine Learning Mastery.

]]>Last Updated on November 20, 2021

Convolutional neural networks have their roots in image processing. It was first published in LeNet to recognize the MNIST handwritten digits. However, convolutional neural networks are not limited to handling images.

In this tutorial, we are going to look at an example of using CNN for time series prediction with an application from financial markets. By way of this example, we are going to explore some techniques in using Keras for model training as well.

After completing this tutorial, you will know

- What a typical multidimensional financial data series looks like?
- How can CNN applied to time series in a classification problem
- How to use generators to feed data to train a Keras model
- How to provide a custom metric for evaluating a Keras model

Let’s get started

This tutorial is divided into 7 parts; they are:

- Background of the idea
- Preprocessing of data
- Data generator
- The model
- Training, validation, and test
- Extensions
- Does it work?

In this tutorial we are following the paper titled “CNNpred: CNN-based stock market prediction using a iverse set of variables” by Ehsan Hoseinzade and Saman Haratizadeh. The data file and sample code from the author are available in github:

The goal of the paper is simple: To predict the next day’s direction of the stock market (i.e., up or down compared to today), hence it is a binary classification problem. However, it is interesting to see how this problem are formulated and solved.

We have seen the examples on using CNN for sequence prediction. If we consider Dow Jones Industrial Average (DJIA) as an example, we may build a CNN with 1D convolution for prediction. This makes sense because a 1D convolution on a time series is roughly computing its moving average or using digital signal processing terms, applying a filter to the time series. It should provide some clues about the trend.

However, when we look at financial time series, it is quite a common sense that some derived signals are useful for predictions too. For example, price and volume together can provide a better clue. Also some other technical indicators such as the moving average of different window size are useful too. If we put all these align together, we will have a table of data, which each time instance has multiple **features**, and the goal is still to predict the direction of **one** time series.

In the CNNpred paper, 82 such features are prepared for the DJIA time series:

Unlike LSTM, which there is an explicit concept of time steps applied, we present data as a matrix in CNN models. As shown in the table below, the features across multiple time steps are presented as a 2D array.

In the following, we try to implement the idea of the CNNpred from scratch using Tensorflow’s keras API. While there is a reference implementation from the author in the github link above, we reimplement it differently to illustrate some Keras techniques.

Firstly the data are five CSV files, each for a different market index, under the `Dataset`

directory from github repository above, or we can also get a copy here:

The input data has a date column and a name column to identify the ticker symbol for the market index. We can leave the date column as time index and remove the name column. The rest are all numerical.

As we are going to predict the market direction, we first try to create the classification label. The market direction is defined as the closing index of tomorrow compared to today. If we have read the data into a pandas DataFrame, we can use `X["Close"].pct_change()`

to find the percentage change, which a positive change for the market goes up. So we can shift this to one time step back as our label:

... X["Target"] = (X["Close"].pct_change().shift(-1) > 0).astype(int)

The line of code above is to compute the percentage change of the closing index and align the data with the previous day. Then convert the data into either 1 or 0 for whether the percentage change is positive.

For five data file in the directory, we read each of them as a separate pandas DataFrame and keep them in a Python dictionary:

... data = {} for filename in os.listdir(DATADIR): if not filename.lower().endswith(".csv"): continue # read only the CSV files filepath = os.path.join(DATADIR, filename) X = pd.read_csv(filepath, index_col="Date", parse_dates=True) # basic preprocessing: get the name, the classification # Save the target variable as a column in dataframe for easier dropna() name = X["Name"][0] del X["Name"] cols = X.columns X["Target"] = (X["Close"].pct_change().shift(-1) > 0).astype(int) X.dropna(inplace=True) # Fit the standard scaler using the training dataset index = X.index[X.index > TRAIN_TEST_CUTOFF] index = index[:int(len(index) * TRAIN_VALID_RATIO)] scaler = StandardScaler().fit(X.loc[index, cols]) # Save scale transformed dataframe X[cols] = scaler.transform(X[cols]) data[name] = X

The result of the above code is a DataFrame for each index, which the classification label is the column “Target” while all other columns are input features. We also normalize the data with a standard scaler.

In time series problems, it is generally reasonable not to split the data into training and test sets randomly, but to set up a cutoff point in which the data before the cutoff is training set while that afterwards is the test set. The scaling above are based on the training set but applied to the entire dataset.

We are not going to use all time steps at once, but instead, we use a fixed length of N time steps to predict the market direction at step N+1. In this design, the window of N time steps can start from anywhere. We can just create a large number of DataFrames with large amount of overlaps with one another. To save memory, we are going to build a data generator for training and validation, as follows:

... TRAIN_TEST_CUTOFF = '2016-04-21' TRAIN_VALID_RATIO = 0.75 def datagen(data, seq_len, batch_size, targetcol, kind): "As a generator to produce samples for Keras model" batch = [] while True: # Pick one dataframe from the pool key = random.choice(list(data.keys())) df = data[key] input_cols = [c for c in df.columns if c != targetcol] index = df.index[df.index < TRAIN_TEST_CUTOFF] split = int(len(index) * TRAIN_VALID_RATIO) if kind == 'train': index = index[:split] # range for the training set elif kind == 'valid': index = index[split:] # range for the validation set # Pick one position, then clip a sequence length while True: t = random.choice(index) # pick one time step n = (df.index == t).argmax() # find its position in the dataframe if n-seq_len+1 < 0: continue # can't get enough data for one sequence length frame = df.iloc[n-seq_len+1:n+1] batch.append([frame[input_cols].values, df.loc[t, targetcol]]) break # if we get enough for a batch, dispatch if len(batch) == batch_size: X, y = zip(*batch) X, y = np.expand_dims(np.array(X), 3), np.array(y) yield X, y batch = []

Generator is a special function in Python that does not `return`

a value but to `yield`

in iterations, such that a sequence of data are produced from it. For a generator to be used in Keras training, it is expected to `yield`

a batch of input data and target. This generator supposed to run indefinitely. Hence the generator function above is created with an infinite loop starts with `while True`

.

In each iteration, it randomly pick one DataFrame from the Python dictionary, then within the range of time steps of the training set (i.e., the beginning portion), we start from a random point and take N time steps using the pandas `iloc[start:end]`

syntax to create a input under the variable `frame`

. This DataFrame will be a 2D array. The target label is that of the last time step. The input data and the label are then appended to the list `batch`

. Until we accumulated for one batch’s size, we dispatch it from the generator.

The last four lines at the code snippet above is to dispatch a batch for training or validation. We collect the list of input data (each a 2D array) as well as a list of target label into variables `X`

and `y`

, then convert them into numpy array so it can work with our Keras model. We need to add one more dimension to the numpy array `X`

using `np.expand_dims()`

because of the design of the network model, as explained below.

The 2D CNN model presented in the original paper accepts an input tensor of shape $N\times m \times 1$ for N the number of time steps and m the number of features in each time step. The paper assumes $N=60$ and $m=82$.

The model comprises of three convolutional layers, as described as follows:

... def cnnpred_2d(seq_len=60, n_features=82, n_filters=(8,8,8), droprate=0.1): "2D-CNNpred model according to the paper" model = Sequential([ Input(shape=(seq_len, n_features, 1)), Conv2D(n_filters[0], kernel_size=(1, n_features), activation="relu"), Conv2D(n_filters[1], kernel_size=(3,1), activation="relu"), MaxPool2D(pool_size=(2,1)), Conv2D(n_filters[2], kernel_size=(3,1), activation="relu"), MaxPool2D(pool_size=(2,1)), Flatten(), Dropout(droprate), Dense(1, activation="sigmoid") ]) return model

and the model is presented by the following:

Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv2d (Conv2D) (None, 60, 1, 8) 664 _________________________________________________________________ conv2d_1 (Conv2D) (None, 58, 1, 8) 200 _________________________________________________________________ max_pooling2d (MaxPooling2D) (None, 29, 1, 8) 0 _________________________________________________________________ conv2d_2 (Conv2D) (None, 27, 1, 8) 200 _________________________________________________________________ max_pooling2d_1 (MaxPooling2 (None, 13, 1, 8) 0 _________________________________________________________________ flatten (Flatten) (None, 104) 0 _________________________________________________________________ dropout (Dropout) (None, 104) 0 _________________________________________________________________ dense (Dense) (None, 1) 105 ================================================================= Total params: 1,169 Trainable params: 1,169 Non-trainable params: 0

The first convolutional layer has 8 units, and is applied across all features in each time step. It is followed by a second convolutional layer to consider three consecutive days at once, for it is a common belief that three days can make a trend in the stock market. It is then applied to a max pooling layer and another convolutional layer before it is flattened into a one-dimensional array and applied to a fully-connected layer with sigmoid activation for binary classification.

That’s it for the model. The paper used MAE as the loss metric and also monitor for accuracy and F1 score to determine the quality of the model. We should point out that F1 score depends on precision and recall ratios, which are both considering the positive classification. The paper, however, consider the average of the F1 from positive and negative classification. Explicitly, it is the F1-macro metric:

$$

F_1 = \frac{1}{2}\left(

\frac{2\cdot \frac{TP}{TP+FP} \cdot \frac{TP}{TP+FN}}{\frac{TP}{TP+FP} + \frac{TP}{TP+FN}}

+

\frac{2\cdot \frac{TN}{TN+FN} \cdot \frac{TN}{TN+FP}}{\frac{TN}{TN+FN} + \frac{TN}{TN+FP}}

\right)

$$

The fraction $\frac{TP}{TP+FP}$ is the precision with TP and FP the number of true positive and false positive. Similarly $\frac{TP}{TP+FN}$ is the recall. The first term in the big parenthesis above is the normal F1 metric that considered positive classifications. And the second term is the reverse, which considered the negative classifications.

While this metric is available in scikit-learn as `sklearn.metrics.f1_score()`

there is no equivalent in Keras. Hence we would create our own by borrowing code from this stackexchange question:

from tensorflow.keras import backend as K def recall_m(y_true, y_pred): true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) possible_positives = K.sum(K.round(K.clip(y_true, 0, 1))) recall = true_positives / (possible_positives + K.epsilon()) return recall def precision_m(y_true, y_pred): true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1))) precision = true_positives / (predicted_positives + K.epsilon()) return precision def f1_m(y_true, y_pred): precision = precision_m(y_true, y_pred) recall = recall_m(y_true, y_pred) return 2*((precision*recall)/(precision+recall+K.epsilon())) def f1macro(y_true, y_pred): f_pos = f1_m(y_true, y_pred) # negative version of the data and prediction f_neg = f1_m(1-y_true, 1-K.clip(y_pred,0,1)) return (f_pos + f_neg)/2

The training process can take hours to complete. Hence we want to save the model in the middle of the training so that we may interrupt and resume it. We can make use of checkpoint features in Keras:

checkpoint_path = "./cp2d-{epoch}-{val_f1macro:.2f}.h5" callbacks = [ ModelCheckpoint(checkpoint_path, monitor='val_f1macro', mode="max", verbose=0, save_best_only=True, save_weights_only=False, save_freq="epoch") ]

We set up a filename template `checkpoint_path`

and ask Keras to fill in the epoch number as well as validation F1 score into the filename. We save it by monitoring the validation’s F1 metric, and this metric is supposed to increase when the model gets better. Hence we pass in the `mode="max"`

to it.

It should now be trivial to train our model, as follows:

seq_len = 60 batch_size = 128 n_epochs = 20 n_features = 82 model = cnnpred_2d(seq_len, n_features) model.compile(optimizer="adam", loss="mae", metrics=["acc", f1macro]) model.fit(datagen(data, seq_len, batch_size, "Target", "train"), validation_data=datagen(data, seq_len, batch_size, "Target", "valid"), epochs=n_epochs, steps_per_epoch=400, validation_steps=10, verbose=1, callbacks=callbacks)

Two points to note in the above snippets. We supplied `"acc"`

as the accuracy as well as the function `f1macro`

defined above as the `metrics`

parameter to the `compile()`

function. Hence these two metrics will be monitored during training. Because the function is named `f1macro`

, we refer to this metric in the checkpoint’s `monitor`

parameter as `val_f1macro`

.

Separately, in the `fit()`

function, we provided the input data through the `datagen()`

generator as defined above. Calling this function will produce a generator, which during the training loop, batches are fetched from it one after another. Similarly, validation data are also provided by the generator.

Because the nature of a generator is to dispatch data indefinitely. We need to tell the training process on how to define a epoch. Recall that in Keras terms, a batch is one iteration of doing gradient descent update. An epoch is supposed to be one cycle through all data in the dataset. At the end of an epoch is the time to run validation. It is also the opportunity for running the checkpoint we defined above. As Keras has no way to infer the size of the dataset from a generator, we need to tell how many batch it should process in one epoch using the `steps_per_epoch`

parameter. Similarly, it is the `validation_steps`

parameter to tell how many batch are used in each validation step. The validation does not affect the training, but it will report to us the metrics we are interested. Below is a screenshot of what we will see in the middle of training, which we will see that the metric for training set are updated on each batch but that for validation set is provided only at the end of epoch:

Epoch 1/20 400/400 [==============================] - 43s 106ms/step - loss: 0.4062 - acc: 0.6184 - f1macro: 0.5237 - val_loss: 0.4958 - val_acc: 0.4969 - val_f1macro: 0.4297 Epoch 2/20 400/400 [==============================] - 44s 111ms/step - loss: 0.2760 - acc: 0.7489 - f1macro: 0.7304 - val_loss: 0.5007 - val_acc: 0.4984 - val_f1macro: 0.4833 Epoch 3/20 60/400 [===>..........................] - ETA: 39s - loss: 0.2399 - acc: 0.7783 - f1macro: 0.7643

After the model finished training, we can test it with unseen data, i.e., the test set. Instead of generating the test set randomly, we create it from the dataset in a deterministic way:

def testgen(data, seq_len, targetcol): "Return array of all test samples" batch = [] for key, df in data.items(): input_cols = [c for c in df.columns if c != targetcol] # find the start of test sample t = df.index[df.index >= TRAIN_TEST_CUTOFF][0] n = (df.index == t).argmax() for i in range(n+1, len(df)+1): frame = df.iloc[i-seq_len:i] batch.append([frame[input_cols].values, frame[targetcol][-1]]) X, y = zip(*batch) return np.expand_dims(np.array(X),3), np.array(y) # Prepare test data test_data, test_target = testgen(data, seq_len, "Target") # Test the model test_out = model.predict(test_data) test_pred = (test_out > 0.5).astype(int) print("accuracy:", accuracy_score(test_pred, test_target)) print("MAE:", mean_absolute_error(test_pred, test_target)) print("F1:", f1_score(test_pred, test_target))

The structure of the function `testgen()`

is resembling that of `datagen()`

we defined above. Except in `datagen()`

the output data’s first dimension is the number of samples in a batch but in `testgen()`

is the the entire test samples.

Using the model for prediction will produce a floating point between 0 and 1 as we are using the sigmoid activation function. We will convert this into 0 or 1 by using the threshold at 0.5. Then we use the functions from scikit-learn to compute the accuracy, mean absolute error and F1 score (which accuracy is just one minus the MAE).

Tying all these together, the complete code is as follows:

import os import random import numpy as np import pandas as pd import tensorflow as tf from tensorflow.keras import backend as K from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D, Input from tensorflow.keras.models import Sequential, load_model from tensorflow.keras.callbacks import ModelCheckpoint from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score, f1_score, mean_absolute_error DATADIR = "./Dataset" TRAIN_TEST_CUTOFF = '2016-04-21' TRAIN_VALID_RATIO = 0.75 # https://datascience.stackexchange.com/questions/45165/how-to-get-accuracy-f1-precision-and-recall-for-a-keras-model # to implement F1 score for validation in a batch def recall_m(y_true, y_pred): true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) possible_positives = K.sum(K.round(K.clip(y_true, 0, 1))) recall = true_positives / (possible_positives + K.epsilon()) return recall def precision_m(y_true, y_pred): true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1))) precision = true_positives / (predicted_positives + K.epsilon()) return precision def f1_m(y_true, y_pred): precision = precision_m(y_true, y_pred) recall = recall_m(y_true, y_pred) return 2*((precision*recall)/(precision+recall+K.epsilon())) def f1macro(y_true, y_pred): f_pos = f1_m(y_true, y_pred) # negative version of the data and prediction f_neg = f1_m(1-y_true, 1-K.clip(y_pred,0,1)) return (f_pos + f_neg)/2 def cnnpred_2d(seq_len=60, n_features=82, n_filters=(8,8,8), droprate=0.1): "2D-CNNpred model according to the paper" model = Sequential([ Input(shape=(seq_len, n_features, 1)), Conv2D(n_filters[0], kernel_size=(1, n_features), activation="relu"), Conv2D(n_filters[1], kernel_size=(3,1), activation="relu"), MaxPool2D(pool_size=(2,1)), Conv2D(n_filters[2], kernel_size=(3,1), activation="relu"), MaxPool2D(pool_size=(2,1)), Flatten(), Dropout(droprate), Dense(1, activation="sigmoid") ]) return model def datagen(data, seq_len, batch_size, targetcol, kind): "As a generator to produce samples for Keras model" batch = [] while True: # Pick one dataframe from the pool key = random.choice(list(data.keys())) df = data[key] input_cols = [c for c in df.columns if c != targetcol] index = df.index[df.index < TRAIN_TEST_CUTOFF] split = int(len(index) * TRAIN_VALID_RATIO) assert split > seq_len, "Training data too small for sequence length {}".format(seq_len) if kind == 'train': index = index[:split] # range for the training set elif kind == 'valid': index = index[split:] # range for the validation set else: raise NotImplementedError # Pick one position, then clip a sequence length while True: t = random.choice(index) # pick one time step n = (df.index == t).argmax() # find its position in the dataframe if n-seq_len+1 < 0: continue # this sample is not enough for one sequence length frame = df.iloc[n-seq_len+1:n+1] batch.append([frame[input_cols].values, df.loc[t, targetcol]]) break # if we get enough for a batch, dispatch if len(batch) == batch_size: X, y = zip(*batch) X, y = np.expand_dims(np.array(X), 3), np.array(y) yield X, y batch = [] def testgen(data, seq_len, targetcol): "Return array of all test samples" batch = [] for key, df in data.items(): input_cols = [c for c in df.columns if c != targetcol] # find the start of test sample t = df.index[df.index >= TRAIN_TEST_CUTOFF][0] n = (df.index == t).argmax() # extract sample using a sliding window for i in range(n+1, len(df)+1): frame = df.iloc[i-seq_len:i] batch.append([frame[input_cols].values, frame[targetcol][-1]]) X, y = zip(*batch) return np.expand_dims(np.array(X),3), np.array(y) # Read data into pandas DataFrames data = {} for filename in os.listdir(DATADIR): if not filename.lower().endswith(".csv"): continue # read only the CSV files filepath = os.path.join(DATADIR, filename) X = pd.read_csv(filepath, index_col="Date", parse_dates=True) # basic preprocessing: get the name, the classification # Save the target variable as a column in dataframe for easier dropna() name = X["Name"][0] del X["Name"] cols = X.columns X["Target"] = (X["Close"].pct_change().shift(-1) > 0).astype(int) X.dropna(inplace=True) # Fit the standard scaler using the training dataset index = X.index[X.index < TRAIN_TEST_CUTOFF] index = index[:int(len(index) * TRAIN_VALID_RATIO)] scaler = StandardScaler().fit(X.loc[index, cols]) # Save scale transformed dataframe X[cols] = scaler.transform(X[cols]) data[name] = X seq_len = 60 batch_size = 128 n_epochs = 20 n_features = 82 # Produce CNNpred as a binary classification problem model = cnnpred_2d(seq_len, n_features) model.compile(optimizer="adam", loss="mae", metrics=["acc", f1macro]) model.summary() # print model structure to console # Set up callbacks and fit the model # We use custom validation score f1macro() and hence monitor for "val_f1macro" checkpoint_path = "./cp2d-{epoch}-{val_f1macro:.2f}.h5" callbacks = [ ModelCheckpoint(checkpoint_path, monitor='val_f1macro', mode="max", verbose=0, save_best_only=True, save_weights_only=False, save_freq="epoch") ] model.fit(datagen(data, seq_len, batch_size, "Target", "train"), validation_data=datagen(data, seq_len, batch_size, "Target", "valid"), epochs=n_epochs, steps_per_epoch=400, validation_steps=10, verbose=1, callbacks=callbacks) # Prepare test data test_data, test_target = testgen(data, seq_len, "Target") # Test the model test_out = model.predict(test_data) test_pred = (test_out > 0.5).astype(int) print("accuracy:", accuracy_score(test_pred, test_target)) print("MAE:", mean_absolute_error(test_pred, test_target)) print("F1:", f1_score(test_pred, test_target))

The original paper called the above model “2D-CNNpred” and there is a version called “3D-CNNpred”. The idea is not only consider the many features of one stock market index but cross compare with many market indices to help prediction on **one index**. Refer to the table of features and time steps above, the data for one market index is presented as 2D array. If we stack up multiple such data from different indices, we constructed a 3D array. While the target label is the same, but allowing us to look at a different market may provide some additional information to help prediction.

Because the shape of the data changed, the convolutional network also defined slightly different, and the data generators need some modification accordingly as well. Below is the complete code of the 3D version, which the change from the previous 2d version should be self-explanatory:

import os import random import numpy as np import pandas as pd import tensorflow as tf from tensorflow.keras import backend as K from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D, Input from tensorflow.keras.models import Sequential, load_model from tensorflow.keras.callbacks import ModelCheckpoint from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score, f1_score, mean_absolute_error DATADIR = "./Dataset" TRAIN_TEST_CUTOFF = '2016-04-21' TRAIN_VALID_RATIO = 0.75 # https://datascience.stackexchange.com/questions/45165/how-to-get-accuracy-f1-precision-and-recall-for-a-keras-model # to implement F1 score for validation in a batch def recall_m(y_true, y_pred): true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) possible_positives = K.sum(K.round(K.clip(y_true, 0, 1))) recall = true_positives / (possible_positives + K.epsilon()) return recall def precision_m(y_true, y_pred): true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1))) precision = true_positives / (predicted_positives + K.epsilon()) return precision def f1_m(y_true, y_pred): precision = precision_m(y_true, y_pred) recall = recall_m(y_true, y_pred) return 2*((precision*recall)/(precision+recall+K.epsilon())) def f1macro(y_true, y_pred): f_pos = f1_m(y_true, y_pred) # negative version of the data and prediction f_neg = f1_m(1-y_true, 1-K.clip(y_pred,0,1)) return (f_pos + f_neg)/2 def cnnpred_3d(seq_len=60, n_stocks=5, n_features=82, n_filters=(8,8,8), droprate=0.1): "3D-CNNpred model according to the paper" model = Sequential([ Input(shape=(n_stocks, seq_len, n_features)), Conv2D(n_filters[0], kernel_size=(1,1), activation="relu", data_format="channels_last"), Conv2D(n_filters[1], kernel_size=(n_stocks,3), activation="relu"), MaxPool2D(pool_size=(1,2)), Conv2D(n_filters[2], kernel_size=(1,3), activation="relu"), MaxPool2D(pool_size=(1,2)), Flatten(), Dropout(droprate), Dense(1, activation="sigmoid") ]) return model def datagen(data, seq_len, batch_size, target_index, targetcol, kind): "As a generator to produce samples for Keras model" # Learn about the data's features and time axis input_cols = [c for c in data.columns if c[0] != targetcol] tickers = sorted(set(c for _,c in input_cols)) n_features = len(input_cols) // len(tickers) index = data.index[data.index < TRAIN_TEST_CUTOFF] split = int(len(index) * TRAIN_VALID_RATIO) assert split > seq_len, "Training data too small for sequence length {}".format(seq_len) if kind == "train": index = index[:split] # range for the training set elif kind == 'valid': index = index[split:] # range for the validation set else: raise NotImplementedError # Infinite loop to generate a batch batch = [] while True: # Pick one position, then clip a sequence length while True: t = random.choice(index) n = (data.index == t).argmax() if n-seq_len+1 < 0: continue # this sample is not enough for one sequence length frame = data.iloc[n-seq_len+1:n+1][input_cols] # convert frame with two level of indices into 3D array shape = (len(tickers), len(frame), n_features) X = np.full(shape, np.nan) for i,ticker in enumerate(tickers): X[i] = frame.xs(ticker, axis=1, level=1).values batch.append([X, data[targetcol][target_index][t]]) break # if we get enough for a batch, dispatch if len(batch) == batch_size: X, y = zip(*batch) yield np.array(X), np.array(y) batch = [] def testgen(data, seq_len, target_index, targetcol): "Return array of all test samples" input_cols = [c for c in data.columns if c[0] != targetcol] tickers = sorted(set(c for _,c in input_cols)) n_features = len(input_cols) // len(tickers) t = data.index[data.index >= TRAIN_TEST_CUTOFF][0] n = (data.index == t).argmax() batch = [] for i in range(n+1, len(data)+1): # Clip a window of seq_len ends at row position i-1 frame = data.iloc[i-seq_len:i] target = frame[targetcol][target_index][-1] frame = frame[input_cols] # convert frame with two level of indices into 3D array shape = (len(tickers), len(frame), n_features) X = np.full(shape, np.nan) for i,ticker in enumerate(tickers): X[i] = frame.xs(ticker, axis=1, level=1).values batch.append([X, target]) X, y = zip(*batch) return np.array(X), np.array(y) # Read data into pandas DataFrames data = {} for filename in os.listdir(DATADIR): if not filename.lower().endswith(".csv"): continue # read only the CSV files filepath = os.path.join(DATADIR, filename) X = pd.read_csv(filepath, index_col="Date", parse_dates=True) # basic preprocessing: get the name, the classification # Save the target variable as a column in dataframe for easier dropna() name = X["Name"][0] del X["Name"] cols = X.columns X["Target"] = (X["Close"].pct_change().shift(-1) > 0).astype(int) X.dropna(inplace=True) # Fit the standard scaler using the training dataset index = X.index[X.index < TRAIN_TEST_CUTOFF] index = index[:int(len(index) * TRAIN_VALID_RATIO)] scaler = StandardScaler().fit(X.loc[index, cols]) # Save scale transformed dataframe X[cols] = scaler.transform(X[cols]) data[name] = X # Transform data into 3D dataframe (multilevel columns) for key, df in data.items(): df.columns = pd.MultiIndex.from_product([df.columns, [key]]) data = pd.concat(data.values(), axis=1) seq_len = 60 batch_size = 128 n_epochs = 20 n_features = 82 n_stocks = 5 # Produce CNNpred as a binary classification problem model = cnnpred_3d(seq_len, n_stocks, n_features) model.compile(optimizer="adam", loss="mae", metrics=["acc", f1macro]) model.summary() # print model structure to console # Set up callbacks and fit the model # We use custom validation score f1macro() and hence monitor for "val_f1macro" checkpoint_path = "./cp3d-{epoch}-{val_f1macro:.2f}.h5" callbacks = [ ModelCheckpoint(checkpoint_path, monitor='val_f1macro', mode="max", verbose=0, save_best_only=True, save_weights_only=False, save_freq="epoch") ] model.fit(datagen(data, seq_len, batch_size, "DJI", "Target", "train"), validation_data=datagen(data, seq_len, batch_size, "DJI", "Target", "valid"), epochs=n_epochs, steps_per_epoch=400, validation_steps=10, verbose=1, callbacks=callbacks) # Prepare test data test_data, test_target = testgen(data, seq_len, "DJI", "Target") # Test the model test_out = model.predict(test_data) test_pred = (test_out > 0.5).astype(int) print("accuracy:", accuracy_score(test_pred, test_target)) print("MAE:", mean_absolute_error(test_pred, test_target)) print("F1:", f1_score(test_pred, test_target))

While the model above is for next-step prediction, it does not stop you from making prediction for k steps ahead if you replace the target label to a different calculation. This may be an exercise for you.

As in all prediction projects in the financial market, it is always unrealistic to expect a high accuracy. The training parameter in the code above can produce slightly more than 50% accuracy in the testing set. While the number of epochs and batch size are deliberately set smaller to save time, there should not be much room for improvement.

In the original paper, it is reported that the 3D-CNNpred performed better than 2D-CNNpred but only attaining the F1 score of less than 0.6. This is already doing better than three baseline models mentioned in the paper. It may be of some use, but not a magic that can help you make money quick.

From machine learning technique perspective, here we classify a panel of data into whether the market direction is up or down the next day. Hence while the data is not an image, it resembles one since both are presented in the form of a 2D array. The technique of convolutional layers can therefore applied, but we may use a different filter size to match the intuition we usually have for financial time series.

The original paper is available at:

- “CNNPred: CNN-based stock market prediction using several data sources”, by Ehsan Hoseinzade, Saman Haratizadeh, 2019.

(https://arxiv.org/abs/1810.08923)

If you are new to finance application and want to build the connection between machine learning techniques and finance, you may find this book useful:

*Machine Learning in Finance: From Theory to Practice*, by Matthew F. Dixon, Igor Halperin, and Paul Bilokon. 2000.

(https://www.amazon.com/dp/3030410676/)

On the similar topic, we have a previous post on using CNN for time series, but using 1D convolutional layers;

You may also find the following documentation helpful to explain some syntax we used above:

- Panads user guide: https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html
- Keras model training API: https://keras.io/api/models/model_training_apis/
- Keras callbacks API: https://keras.io/api/callbacks/

In this tutorial, you discovered how a CNN model can be built for prediction in financial time series.

Specifically, you learned:

- How to create 2D convolutional layers to process the time series
- How to present the time series data in a multidimensional array so that the convolutional layers can be applied
- What is a data generator for Keras model training and how to use it
- How to monitor the performance of model training with a custom metric
- What to expect in predicting financial market

The post Using CNN for financial time series prediction appeared first on Machine Learning Mastery.

]]>The post The Transformer Model appeared first on Machine Learning Mastery.

]]>We have already familiarized ourselves with the concept of self-attention as implemented by the Transformer attention mechanism for neural machine translation. We will now be shifting our focus on the details of the Transformer architecture itself, to discover how self-attention can be implemented without relying on the use of recurrence and convolutions.

In this tutorial, you will discover the network architecture of the Transformer model.

After completing this tutorial, you will know:

- How the Transformer architecture implements an encoder-decoder structure without recurrence and convolutions.
- How the Transformer encoder and decoder work.
- How the Transformer self-attention compares to the use of recurrent and convolutional layers.

Let’s get started.

This tutorial is divided into three parts; they are:

- The Transformer Architecture
- The Encoder
- The Decoder

- Sum Up: The Transformer Model
- Comparison to Recurrent and Convolutional Layers

For this tutorial, we assume that you are already familiar with:

The Transformer architecture follows an encoder-decoder structure, but does not rely on recurrence and convolutions in order to generate an output.

In a nutshell, the task of the encoder, on the left half of the Transformer architecture, is to map an input sequence to a sequence of continuous representations, which is then fed into a decoder.

The decoder, on the right half of the architecture, receives the output of the encoder together with the decoder output at the previous time step, to generate an output sequence.

At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.

–Attention Is All You Need, 2017.

The encoder consists of a stack of $N$ = 6 identical layers, where each layer is composed of two sublayers:

- The first sublayer implements a multi-head self-attention mechanism. We had seen that the multi-head mechanism implements $h$ heads that receive a (different) linearly projected version of the queries, keys and values each, to produce $h$ outputs in parallel that are then used to generate a final result.

- The second sublayer is a fully connected feed-forward network, consisting of two linear transformations with Rectified Linear Unit (ReLU) activation in between:

$$\text{FFN}(x) = \text{ReLU}(\mathbf{W}_1 x + b_1) \mathbf{W}_2 + b_2$$

The six layers of the Transformer encoder apply the same linear transformations to all of the words in the input sequence, but *each* layer employs different weight ($\mathbf{W}_1, \mathbf{W}_2$) and bias ($b_1, b_2$) parameters to do so.

Furthermore, each of these two sublayers has a residual connection around it.

Each sublayer is also succeeded by a normalization layer, $\text{layernorm}(.)$, which normalizes the sum computed between the sublayer input, $x$, and the output generated by the sublayer itself, $\text{sublayer}(x)$:

$$\text{layernorm}(x + \text{sublayer}(x))$$

An important consideration to keep in mind is that the Transformer architecture cannot inherently capture any information about the relative positions of the words in the sequence, since it does not make use of recurrence. This information has to be injected by introducing *positional encodings* to the input embeddings.

The positional encoding vectors are of the same dimension as the input embeddings, and are generated using sine and cosine functions of different frequencies. Then, they are simply summed to the input embeddings in order to *inject* the positional information.

The decoder shares several similarities with the encoder.

The decoder also consists of a stack of $N$ = 6 identical layers that are, each, composed of three sublayers:

- The first sublayer receives the previous output of the decoder stack, augments it with positional information, and implements multi-head self-attention over it. While the encoder is designed to attend to all words in the input sequence,
*regardless*of their position in the sequence, the decoder is modified to attend*only*to the preceding words. Hence, the prediction for a word at position, $i$, can only depend on the known outputs for the words that come before it in the sequence. In the multi-head attention mechanism (which implements multiple, single attention functions in parallel), this is achieved by introducing a mask over the values produced by the scaled multiplication of matrices $\mathbf{Q}$ and $\mathbf{K}$. This masking is implemented by suppressing the matrix values that would, otherwise, correspond to illegal connections:

$$

\text{mask}(\mathbf{QK}^T) =

\text{mask} \left( \begin{bmatrix}

e_{11} & e_{12} & \dots & e_{1n} \\

e_{21} & e_{22} & \dots & e_{2n} \\

\vdots & \vdots & \ddots & \vdots \\

e_{m1} & e_{m2} & \dots & e_{mn} \\

\end{bmatrix} \right) =

\begin{bmatrix}

e_{11} & -\infty & \dots & -\infty \\

e_{21} & e_{22} & \dots & -\infty \\

\vdots & \vdots & \ddots & \vdots \\

e_{m1} & e_{m2} & \dots & e_{mn} \\

\end{bmatrix}

$$

The masking makes the decoder unidirectional (unlike the bidirectional encoder).

–Advanced Deep Learning with Python, 2019.

- The second layer implements a multi-head self-attention mechanism, which is similar to the one implemented in the first sublayer of the encoder. On the decoder side, this multi-head mechanism receives the queries from the previous decoder sublayer, and the keys and values from the output of the encoder. This allows the decoder to attend to all of the words in the input sequence.

- The third layer implements a fully connected feed-forward network, which is similar to the one implemented in the second sublayer of the encoder.

Furthermore, the three sublayers on the decoder side also have residual connections around them, and are succeeded by a normalization layer.

Positional encodings are also added to the input embeddings of the decoder, in the same manner as previously explained for the encoder.

The Transformer model runs as follows:

- Each word forming an input sequence is transformed into a $d_{\text{model}}$-dimensional embedding vector.

- Each embedding vector representing an input word is augmented by summing it (element-wise) to a positional encoding vector of the same $d_{\text{model}}$ length, hence introducing positional information into the input.

- The augmented embedding vectors are fed into the encoder block, consisting of the two sublayers explained above. Since the encoder attends to all words in the input sequence, irrespective if they precede or succeed the word under consideration, then the Transformer encoder is
*bidirectional*.

- The decoder receives as input its own predicted output word at time-step, $t – 1$.

- The input to the decoder is also augmented by positional encoding, in the same manner as this is done on the encoder side.

- The augmented decoder input is fed into the three sublayers comprising the decoder block explained above. Masking is applied in the first sublayer, in order to stop the decoder from attending to succeeding words. At the second sublayer, the decoder also receives the output of the encoder, which now allows the decoder to attend to all of the words in the input sequence.

- The output of the decoder finally passes through a fully connected layer, followed by a softmax layer, to generate a prediction for the next word of the output sequence.

Vaswani et al. (2017) explain that their motivation for abandoning the use of recurrence and convolutions was based on several factors:

- Self-attention layers were found to be faster than recurrent layers for shorter sequence lengths, and can be restricted to consider only a neighbourhood in the input sequence for very long sequence lengths.

- The number of sequential operations required by a recurrent layer is based upon the sequence length, whereas this number remains constant for a self-attention layer.

- In convolutional neural networks, the kernel width directly affects the long-term dependencies that can be established between pairs of input and output positions. Tracking long-term dependencies would require the use of large kernels, or stacks of convolutional layers that could increase the computational cost.

This section provides more resources on the topic if you are looking to go deeper.

- Attention Is All You Need, 2017.

In this tutorial, you discovered the network architecture of the Transformer model.

Specifically, you learned:

- How the Transformer architecture implements an encoder-decoder structure without recurrence and convolutions.
- How the Transformer encoder and decoder work.
- How the Transformer self-attention compares to recurrent and convolutional layers.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post The Transformer Model appeared first on Machine Learning Mastery.

]]>The post The Transformer Attention Mechanism appeared first on Machine Learning Mastery.

]]>Before the introduction of the Transformer model, the use of attention for neural machine translation was being implemented by RNN-based encoder-decoder architectures. The Transformer model revolutionized the implementation of attention by dispensing of recurrence and convolutions and, alternatively, relying solely on a self-attention mechanism.

We will first be focusing on the Transformer attention mechanism in this tutorial, and subsequently reviewing the Transformer model in a separate one.

In this tutorial, you will discover the Transformer attention mechanism for neural machine translation.

After completing this tutorial, you will know:

- How the Transformer attention differed from its predecessors.
- How the Transformer computes a scaled-dot product attention.
- How the Transformer computes multi-head attention.

Let’s get started.

This tutorial is divided into two parts; they are:

- Introduction to the Transformer Attention
- The Transformer Attention
- Scaled-Dot Product Attention
- Multi-Head Attention

For this tutorial, we assume that you are already familiar with:

- The concept of attention
- The attention mechanism
- The Bahdanau attention mechanism
- The Luong attention mechanism

We have, thus far, familiarised ourselves with the use of an attention mechanism in conjunction with an RNN-based encoder-decoder architecture. We have seen that two of the most popular models that implement attention in this manner have been those proposed by Bahdanau et al. (2014) and Luong et al. (2015).

The Transformer architecture revolutionized the use of attention by dispensing of recurrence and convolutions, on which the formers had extensively relied.

… the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.

–Attention Is All You Need, 2017.

In their paper, Attention Is All You Need, Vaswani et al. (2017) explain that the Transformer model, alternatively, relies solely on the use of self-attention, where the representation of a sequence (or sentence) is computed by relating different words in the same sequence.

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

–Attention Is All You Need, 2017.

**The Transformer Attention**

The main components in use by the Transformer attention are the following:

- $\mathbf{q}$ and $\mathbf{k}$ denoting vectors of dimension, $d_k$, containing the queries and keys, respectively.
- $\mathbf{v}$ denoting a vector of dimension, $d_v$, containing the values.
- $\mathbf{Q}$, $\mathbf{K}$ and $\mathbf{V}$ denoting matrices packing together sets of queries, keys and values, respectively.
- $\mathbf{W}^Q$, $\mathbf{W}^K$ and $\mathbf{W}^V$ denoting projection matrices that are used in generating different subspace representations of the query, key and value matrices.
- $\mathbf{W}^O$ denoting a projection matrix for the multi-head output.

In essence, the attention function can be considered as a mapping between a query and a set of key-value pairs, to an output.

The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

–Attention Is All You Need, 2017.

Vaswani et al. propose a *scaled dot-product attention*, and then build on it to propose *multi-head attention*. Within the context of neural machine translation, the query, keys and values that are used as inputs to the these attention mechanisms, are different projections of the same input sentence.

Intuitively, therefore, the proposed attention mechanisms implement self-attention by capturing the relationships between the different elements (in this case, the words) of the same sentence.

The Transformer implements a scaled dot-product attention, which follows the procedure of the general attention mechanism that we had previously seen.

As the name suggests, the scaled dot-product attention first computes a *dot product* for each query, $\mathbf{q}$, with all of the keys, $\mathbf{k}$. It, subsequently, divides each result by $\sqrt{d_k}$ and proceeds to apply a softmax function. In doing so, it obtains the weights that are used to *scale* the values, $\mathbf{v}$.

In practice, the computations performed by the scaled dot-product attention can be efficiently applied on the entire set of queries simultaneously. In order to do so, the matrices, $\mathbf{Q}$, $\mathbf{K}$ and $\mathbf{V}$, are supplied as inputs to the attention function:

$$\text{attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V$$

Vaswani et al. explain that their scaled dot-product attention is identical to the multiplicative attention of Luong et al. (2015), except for the added scaling factor of $\tfrac{1}{\sqrt{d_k}}$.

This scaling factor was introduced to counteract the effect of having the dot products grow large in magnitude for large values of $d_k$, where the application of the softmax function would then return extremely small gradients that would lead to the infamous vanishing gradients problem. The scaling factor, therefore, serves to pull the results generated by the dot product multiplication down, hence preventing this problem.

Vaswani et al. further explain that their choice of opting for multiplicative attention instead of the additive attention of Bahdanau et al. (2014), was based on the computational efficiency associated with the former.

… dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

–Attention Is All You Need, 2017.

The step-by-step procedure for computing the scaled-dot product attention is, therefore, the following:

- Compute the alignment scores by multiplying the set of queries packed in matrix, $\mathbf{Q}$,with the keys in matrix, $\mathbf{K}$. If matrix, $\mathbf{Q}$, is of size $m \times d_k$ and matrix, $\mathbf{K}$, is of size, $n \times d_k$, then the resulting matrix will be of size $m \times n$:

$$

\mathbf{QK}^T =

\begin{bmatrix}

e_{11} & e_{12} & \dots & e_{1n} \\

e_{21} & e_{22} & \dots & e_{2n} \\

\vdots & \vdots & \ddots & \vdots \\

e_{m1} & e_{m2} & \dots & e_{mn} \\

\end{bmatrix}

$$

- Scale each of the alignment scores by $\tfrac{1}{\sqrt{d_k}}$:

$$

\frac{\mathbf{QK}^T}{\sqrt{d_k}} =

\begin{bmatrix}

\tfrac{e_{11}}{\sqrt{d_k}} & \tfrac{e_{12}}{\sqrt{d_k}} & \dots & \tfrac{e_{1n}}{\sqrt{d_k}} \\

\tfrac{e_{21}}{\sqrt{d_k}} & \tfrac{e_{22}}{\sqrt{d_k}} & \dots & \tfrac{e_{2n}}{\sqrt{d_k}} \\

\vdots & \vdots & \ddots & \vdots \\

\tfrac{e_{m1}}{\sqrt{d_k}} & \tfrac{e_{m2}}{\sqrt{d_k}} & \dots & \tfrac{e_{mn}}{\sqrt{d_k}} \\

\end{bmatrix}

$$

- And follow the scaling process by applying a softmax operation in order to obtain a set of weights:

$$

\text{softmax} \left( \frac{\mathbf{QK}^T}{\sqrt{d_k}} \right) =

\begin{bmatrix}

\text{softmax} \left( \tfrac{e_{11}}{\sqrt{d_k}} \right) & \text{softmax} \left( \tfrac{e_{12}}{\sqrt{d_k}} \right) & \dots & \text{softmax} \left( \tfrac{e_{1n}}{\sqrt{d_k}} \right) \\

\text{softmax} \left( \tfrac{e_{21}}{\sqrt{d_k}} \right) & \text{softmax} \left( \tfrac{e_{22}}{\sqrt{d_k}} \right) & \dots & \text{softmax} \left( \tfrac{e_{2n}}{\sqrt{d_k}} \right) \\

\vdots & \vdots & \ddots & \vdots \\

\text{softmax} \left( \tfrac{e_{m1}}{\sqrt{d_k}} \right) & \text{softmax} \left( \tfrac{e_{m2}}{\sqrt{d_k}} \right) & \dots & \text{softmax} \left( \tfrac{e_{mn}}{\sqrt{d_k}} \right) \\

\end{bmatrix}

$$

- Finally, apply the resulting weights to the values in matrix, $\mathbf{V}$, of size, $n \times d_v$:

$$

\begin{aligned}

& \text{softmax} \left( \frac{\mathbf{QK}^T}{\sqrt{d_k}} \right) \cdot \mathbf{V} \\

=&

\begin{bmatrix}

\text{softmax} \left( \tfrac{e_{11}}{\sqrt{d_k}} \right) & \text{softmax} \left( \tfrac{e_{12}}{\sqrt{d_k}} \right) & \dots & \text{softmax} \left( \tfrac{e_{1n}}{\sqrt{d_k}} \right) \\

\text{softmax} \left( \tfrac{e_{21}}{\sqrt{d_k}} \right) & \text{softmax} \left( \tfrac{e_{22}}{\sqrt{d_k}} \right) & \dots & \text{softmax} \left( \tfrac{e_{2n}}{\sqrt{d_k}} \right) \\

\vdots & \vdots & \ddots & \vdots \\

\text{softmax} \left( \tfrac{e_{m1}}{\sqrt{d_k}} \right) & \text{softmax} \left( \tfrac{e_{m2}}{\sqrt{d_k}} \right) & \dots & \text{softmax} \left( \tfrac{e_{mn}}{\sqrt{d_k}} \right) \\

\end{bmatrix}

\cdot

\begin{bmatrix}

v_{11} & v_{12} & \dots & v_{1d_v} \\

v_{21} & v_{22} & \dots & v_{2d_v} \\

\vdots & \vdots & \ddots & \vdots \\

v_{n1} & v_{n2} & \dots & v_{nd_v} \\

\end{bmatrix}

\end{aligned}

$$

Building on their single attention function that takes matrices, $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$, as input, as we have just reviewed, Vaswani et al. also propose a multi-head attention mechanism.

Their multi-head attention mechanism linearly projects the queries, keys and values $h$ times, each time using a different learned projection. The single attention mechanism is then applied to each of these $h$ projections in parallel, to produce $h$ outputs, which in turn are concatenated and projected again to produce a final result.

The idea behind multi-head attention is to allow the attention function to extract information from different representation subspaces, which would, otherwise, not be possible with a single attention head.

The multi-head attention function can be represented as follows:

$$\text{multihead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{concat}(\text{head}_1, \dots, \text{head}_h) \mathbf{W}^O$$

Here, each $\text{head}_i$, $i = 1, \dots, h$, implements a single attention function characterized by its own learned projection matrices:

$$\text{head}_i = \text{attention}(\mathbf{QW}^Q_i, \mathbf{KW}^K_i, \mathbf{VW}^V_i)$$

The step-by-step procedure for computing multi-head attention is, therefore, the following:

- Compute the linearly projected versions of the queries, keys and values through a multiplication with the respective weight matrices, $\mathbf{W}^Q_i$, $\mathbf{W}^K_i$ and $\mathbf{W}^V_i$, one for each $\text{head}_i$.

- Apply the single attention function for each head by (1) multiplying the queries and keys matrices, (2) applying the scaling and softmax operations, and (3) weighting the values matrix, to generate an output for each head.

- Concatenate the outputs of the heads, $\text{head}_i$, $i = 1, \dots, h$.

- Apply a linear projection to the concatenated output through a multiplication with the weight matrix, $\mathbf{W}^O$, to generate the final result.

This section provides more resources on the topic if you are looking to go deeper.

- Attention Is All You Need, 2017.
- Neural Machine Translation by Jointly Learning to Align and Translate, 2014.
- Effective Approaches to Attention-based Neural Machine Translation, 2015.

In this tutorial, you discovered the Transformer attention mechanism for neural machine translation.

Specifically, you learned:

- How the Transformer attention differed from its predecessors.
- How the Transformer computes a scaled-dot product attention.
- How the Transformer computes multi-head attention.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post The Transformer Attention Mechanism appeared first on Machine Learning Mastery.

]]>The post Face Recognition using Principal Component Analysis appeared first on Machine Learning Mastery.

]]>Last Updated on October 30, 2021

Recent advance in machine learning has made face recognition not a difficult problem. But in the previous, researchers have made various attempts and developed various skills to make computer capable of identifying people. One of the early attempt with moderate success is **eigenface**, which is based on linear algebra techniques.

In this tutorial, we will see how we can build a primitive face recognition system with some simple linear algebra technique such as principal component analysis.

After completing this tutorial, you will know:

- The development of eigenface technique
- How to use principal component analysis to extract characteristic images from an image dataset
- How to express any image as a weighted sum of the characteristic images
- How to compare the similarity of images from the weight of principal components

Let’s get started.

This tutorial is divided into 3 parts; they are:

- Image and Face Recognition
- Overview of Eigenface
- Implementing Eigenface

In computer, pictures are represented as a matrix of pixels, with each pixel a particular color coded in some numerical values. It is natural to ask if computer can read the picture and understand what it is, and if so, whether we can describe the logic using matrix mathematics. To be less ambitious, people try to limit the scope of this problem to identifying human faces. An early attempt for face recognition is to consider the matrix as a high dimensional detail and we infer a lower dimension information vector from it, then try to recognize the person in lower dimension. It was necessary in the old time because the computer was not powerful and the amount of memory is very limited. However, by exploring how to **compress** image to a much smaller size, we developed a skill to compare if two images are portraying the same human face even if the pictures are not identical.

In 1987, a paper by Sirovich and Kirby considered the idea that all pictures of human face to be a weighted sum of a few “key pictures”. Sirovich and Kirby called these key pictures the “eigenpictures”, as they are the eigenvectors of the covariance matrix of the mean-subtracted pictures of human faces. In the paper they indeed provided the algorithm of principal component analysis of the face picture dataset in its matrix form. And the weights used in the weighted sum indeed correspond to the projection of the face picture into each eigenpicture.

In 1991, a paper by Turk and Pentland coined the term “eigenface”. They built on top of the idea of Sirovich and Kirby and use the weights and eigenpictures as characteristic features to recognize faces. The paper by Turk and Pentland laid out a memory-efficient way to compute the eigenpictures. It also proposed an algorithm on how the face recognition system can operate, including how to update the system to include new faces and how to combine it with a video capture system. The same paper also pointed out that the concept of eigenface can help reconstruction of partially obstructed picture.

Before we jump into the code, let’s outline the steps in using eigenface for face recognition, and point out how some simple linear algebra technique can help the task.

Assume we have a bunch of pictures of human faces, all in the same pixel dimension (e.g., all are r×c grayscale images). If we get M different pictures and **vectorize** each picture into L=r×c pixels, we can present the entire dataset as a L×M matrix (let’s call it matrix $A$), where each element in the matrix is the pixel’s grayscale value.

Recall that principal component analysis (PCA) can be applied to any matrix, and the result is a number of vectors called the **principal components**. Each principal component has the length same as the column length of the matrix. The different principal components from the same matrix are orthogonal to each other, meaning that the vector dot-product of any two of them is zero. Therefore the various principal components constructed a vector space for which each column in the matrix can be represented as a linear combination (i.e., weighted sum) of the principal components.

The way it is done is to first take $C=A – a$ where $a$ is the mean vector of the matrix $A$. So $C$ is the matrix that subtract each column of $A$ with the mean vector $a$. Then the covariance matrix is

$$S = C\cdot C^T$$

from which we find its eigenvectors and eigenvalues. The principal components are these eigenvectors in decreasing order of the eigenvalues. Because matrix $S$ is a L×L matrix, we may consider to find the eigenvectors of a M×M matrix $C^T\cdot C$ instead as the eigenvector $v$ for $C^T\cdot C$ can be transformed into eigenvector $u$ of $C\cdot C^T$ by $u=C\cdot v$, except we usually prefer to write $u$ as normalized vector (i.e., norm of $u$ is 1).

The physical meaning of the principal component vectors of $A$, or equivalently the eigenvectors of $S=C\cdot C^T$, is that they are the key directions that we can construct the columns of matrix $A$. The relative importance of the different principal component vectors can be inferred from the corresponding eigenvalues. The greater the eigenvalue, the more useful (i.e., holds more information about $A$) the principal component vector. Hence we can keep only the first K principal component vectors. If matrix $A$ is the dataset for face pictures, the first K principal component vectors are the top K most important “face pictures”. We call them the **eigenface** picture.

For any given face picture, we can project its mean-subtracted version onto the eigenface picture using vector dot-product. The result is how close this face picture is related to the eigenface. If the face picture is totally unrelated to the eigenface, we would expect its result is zero. For the K eigenfaces, we can find K dot-product for any given face picture. We can present the result as **weights** of this face picture with respect to the eigenfaces. The weight is usually presented as a vector.

Conversely, if we have a weight vector, we can add up each eigenfaces subjected to the weight and reconstruct a new face. Let’s denote the eigenfaces as matrix $F$, which is a L×K matrix, and the weight vector $w$ is a column vector. Then for any $w$ we can construct the picture of a face as

$$z=F\cdot w$$

which $z$ is resulted as a column vector of length L. Because we are only using the top K principal component vectors, we should expect the resulting face picture is distorted but retained some facial characteristic.

Since the eigenface matrix is constant for the dataset, a varying weight vector $w$ means a varying face picture. Therefore we can expect the pictures of the same person would provide similar weight vectors, even if the pictures are not identical. As a result, we may make use of the distance between two weight vectors (such as the L2-norm) as a metric of how two pictures resemble.

Now we attempt to implement the idea of eigenface with numpy and scikit-learn. We will also make use of OpenCV to read picture files. You may need to install the relevant package with `pip`

command:

pip install opencv-python

The dataset we use are the ORL Database of Faces, which is quite of age but we can download it from Kaggle:

The file is a zip file of around 4MB. It has pictures of 40 persons and each person has 10 pictures. Total to 400 pictures. In the following we assumed the file is downloaded to the local directory and named as `attface.zip`

.

We may extract the zip file to get the pictures, or we can also make use of the `zipfile`

package in Python to read the contents from the zip file directly:

import cv2 import zipfile import numpy as np faces = {} with zipfile.ZipFile("attface.zip") as facezip: for filename in facezip.namelist(): if not filename.endswith(".pgm"): continue # not a face picture with facezip.open(filename) as image: # If we extracted files from zip, we can use cv2.imread(filename) instead faces[filename] = cv2.imdecode(np.frombuffer(image.read(), np.uint8), cv2.IMREAD_GRAYSCALE)

The above is to read every PGM file in the zip. PGM is a grayscale image file format. We extract each PGM file into a byte string through `image.read()`

and convert it into a numpy array of bytes. Then we use OpenCV to decode the byte string into an array of pixels using `cv2.imdecode()`

. The file format will be detected automatically by OpenCV. We save each picture into a Python dictionary `faces`

for later use.

Here we can take a look on these picture of human faces, using matplotlib:

... import matplotlib.pyplot as plt fig, axes = plt.subplots(4,4,sharex=True,sharey=True,figsize=(8,10)) faceimages = list(faces.values())[-16:] # take last 16 images for i in range(16): axes[i%4][i//4].imshow(faceimages[i], cmap="gray") plt.show()

We can also find the pixel size of each picture:

... faceshape = list(faces.values())[0].shape print("Face image shape:", faceshape)

Face image shape: (112, 92)

The pictures of faces are identified by their file name in the Python dictionary. We can take a peek on the filenames:

... print(list(faces.keys())[:5])

['s1/1.pgm', 's1/10.pgm', 's1/2.pgm', 's1/3.pgm', 's1/4.pgm']

and therefore we can put faces of the same person into the same class. There are 40 classes and totally 400 pictures:

... classes = set(filename.split("/")[0] for filename in faces.keys()) print("Number of classes:", len(classes)) print("Number of pictures:", len(faces))

Number of classes: 40 Number of pictures: 400

To illustrate the capability of using eigenface for recognition, we want to hold out some of the pictures before we generate our eigenfaces. We hold out all the pictures of one person as well as one picture for another person as our test set. The remaining pictures are vectorized and converted into a 2D numpy array:

... # Take classes 1-39 for eigenfaces, keep entire class 40 and # image 10 of class 39 as out-of-sample test facematrix = [] facelabel = [] for key,val in faces.items(): if key.startswith("s40/"): continue # this is our test set if key == "s39/10.pgm": continue # this is our test set facematrix.append(val.flatten()) facelabel.append(key.split("/")[0]) # Create facematrix as (n_samples,n_pixels) matrix facematrix = np.array(facematrix)

Now we can perform principal component analysis on this dataset matrix. Instead of computing the PCA step by step, we make use of the PCA function in scikit-learn, which we can easily retrieve all results we needed:

... # Apply PCA to extract eigenfaces from sklearn.decomposition import PCA pca = PCA().fit(facematrix)

We can identify how significant is each principal component from the explained variance ratio:

... print(pca.explained_variance_ratio_)

[1.77824822e-01 1.29057925e-01 6.67093882e-02 5.63561346e-02 5.13040312e-02 3.39156477e-02 2.47893586e-02 2.27967054e-02 1.95632067e-02 1.82678428e-02 1.45655853e-02 1.38626271e-02 1.13318896e-02 1.07267786e-02 9.68365599e-03 9.17860717e-03 8.60995215e-03 8.21053028e-03 7.36580634e-03 7.01112888e-03 6.69450840e-03 6.40327943e-03 5.98295099e-03 5.49298705e-03 5.36083980e-03 4.99408106e-03 4.84854321e-03 4.77687371e-03 ... 1.12203331e-04 1.11102187e-04 1.08901471e-04 1.06750318e-04 1.05732991e-04 1.01913786e-04 9.98164783e-05 9.85530209e-05 9.51582720e-05 8.95603083e-05 8.71638147e-05 8.44340263e-05 7.95894118e-05 7.77912922e-05 7.06467912e-05 6.77447444e-05 2.21225931e-32]

or we can simply make up a moderate number, say, 50, and consider these many principal component vectors as the eigenface. For convenience, we extract the eigenface from PCA result and store it as a numpy array. Note that the eigenfaces are stored as rows in a matrix. We can convert it back to 2D if we want to display it. In below, we show some of the eigenfaces to see how they look like:

... # Take the first K principal components as eigenfaces n_components = 50 eigenfaces = pca.components_[:n_components] # Show the first 16 eigenfaces fig, axes = plt.subplots(4,4,sharex=True,sharey=True,figsize=(8,10)) for i in range(16): axes[i%4][i//4].imshow(eigenfaces[i].reshape(faceshape), cmap="gray") plt.show()

From this picture, we can see eigenfaces are blurry faces, but indeed each eigenfaces holds some facial characteristics that can be used to build a picture.

Since our goal is to build a face recognition system, we first calculate the weight vector for each input picture:

... # Generate weights as a KxN matrix where K is the number of eigenfaces and N the number of samples weights = eigenfaces @ (facematrix - pca.mean_).T

The above code is using matrix multiplication to replace loops. It is roughly equivalent to the following:

... weights = [] for i in range(facematrix.shape[0]): weight = [] for j in range(n_components): w = eigenfaces[j] @ (facematrix[i] - pca.mean_) weight.append(w) weights.append(weight)

Up to here, our face recognition system has been completed. We used pictures of 39 persons to build our eigenface. We use the test picture that belongs to one of these 39 persons (the one held out from the matrix that trained the PCA model) to see if it can successfully recognize the face:

... # Test on out-of-sample image of existing class query = faces["s39/10.pgm"].reshape(1,-1) query_weight = eigenfaces @ (query - pca.mean_).T euclidean_distance = np.linalg.norm(weights - query_weight, axis=0) best_match = np.argmin(euclidean_distance) print("Best match %s with Euclidean distance %f" % (facelabel[best_match], euclidean_distance[best_match])) # Visualize fig, axes = plt.subplots(1,2,sharex=True,sharey=True,figsize=(8,6)) axes[0].imshow(query.reshape(faceshape), cmap="gray") axes[0].set_title("Query") axes[1].imshow(facematrix[best_match].reshape(faceshape), cmap="gray") axes[1].set_title("Best match") plt.show()

Above, we first subtract the vectorized image by the average vector that retrieved from the PCA result. Then we compute the projection of this mean-subtracted vector to each eigenface and take it as the weight for this picture. Afterwards, we compare the weight vector of the picture in question to that of each existing picture and find the one with the smallest L2 distance as the best match. We can see that it indeed can successfully find the closest match in the same class:

Best match s39 with Euclidean distance 1559.997137

and we can visualize the result by comparing the closest match side by side:

We can try again with the picture of the 40th person that we held out from the PCA. We would never get it correct because it is a new person to our model. However, we want to see how wrong it can be as well as the value in the distance metric:

... # Test on out-of-sample image of new class query = faces["s40/1.pgm"].reshape(1,-1) query_weight = eigenfaces @ (query - pca.mean_).T euclidean_distance = np.linalg.norm(weights - query_weight, axis=0) best_match = np.argmin(euclidean_distance) print("Best match %s with Euclidean distance %f" % (facelabel[best_match], euclidean_distance[best_match])) # Visualize fig, axes = plt.subplots(1,2,sharex=True,sharey=True,figsize=(8,6)) axes[0].imshow(query.reshape(faceshape), cmap="gray") axes[0].set_title("Query") axes[1].imshow(facematrix[best_match].reshape(faceshape), cmap="gray") axes[1].set_title("Best match") plt.show()

We can see that it’s best match has a greater L2 distance:

Best match s5 with Euclidean distance 2690.209330

but we can see that the mistaken result has some resemblance to the picture in question:

In the paper by Turk and Petland, it is suggested that we set up a threshold for the L2 distance. If the best match’s distance is less than the threshold, we would consider the face is recognized to be the same person. If the distance is above the threshold, we claim the picture is someone we never saw even if a best match can be find numerically. In this case, we may consider to include this as a new person into our model by remembering this new weight vector.

Actually, we can do one step further, to generate new faces using eigenfaces, but the result is not very realistic. In below, we generate one using random weight vector and show it side by side with the “average face”:

... # Visualize the mean face and random face fig, axes = plt.subplots(1,2,sharex=True,sharey=True,figsize=(8,6)) axes[0].imshow(pca.mean_.reshape(faceshape), cmap="gray") axes[0].set_title("Mean face") random_weights = np.random.randn(n_components) * weights.std() newface = random_weights @ eigenfaces + pca.mean_ axes[1].imshow(newface.reshape(faceshape), cmap="gray") axes[1].set_title("Random face") plt.show()

How good is eigenface? It is surprisingly overachieved for the simplicity of the model. However, Turk and Pentland tested it with various conditions. It found that its accuracy was “an average of 96% with light variation, 85% with orientation variation, and 64% with size variation.” Hence it may not be very practical as a face recognition system. After all, the picture as a matrix will be distorted a lot in the principal component domain after zoom-in and zoom-out. Therefore the modern alternative is to use convolution neural network, which is more tolerant to various transformations.

Putting everything together, the following is the complete code:

import zipfile import cv2 import numpy as np import matplotlib.pyplot as plt from sklearn.decomposition import PCA # Read face image from zip file on the fly faces = {} with zipfile.ZipFile("attface.zip") as facezip: for filename in facezip.namelist(): if not filename.endswith(".pgm"): continue # not a face picture with facezip.open(filename) as image: # If we extracted files from zip, we can use cv2.imread(filename) instead faces[filename] = cv2.imdecode(np.frombuffer(image.read(), np.uint8), cv2.IMREAD_GRAYSCALE) # Show sample faces using matplotlib fig, axes = plt.subplots(4,4,sharex=True,sharey=True,figsize=(8,10)) faceimages = list(faces.values())[-16:] # take last 16 images for i in range(16): axes[i%4][i//4].imshow(faceimages[i], cmap="gray") print("Showing sample faces") plt.show() # Print some details faceshape = list(faces.values())[0].shape print("Face image shape:", faceshape) classes = set(filename.split("/")[0] for filename in faces.keys()) print("Number of classes:", len(classes)) print("Number of images:", len(faces)) # Take classes 1-39 for eigenfaces, keep entire class 40 and # image 10 of class 39 as out-of-sample test facematrix = [] facelabel = [] for key,val in faces.items(): if key.startswith("s40/"): continue # this is our test set if key == "s39/10.pgm": continue # this is our test set facematrix.append(val.flatten()) facelabel.append(key.split("/")[0]) # Create a NxM matrix with N images and M pixels per image facematrix = np.array(facematrix) # Apply PCA and take first K principal components as eigenfaces pca = PCA().fit(facematrix) n_components = 50 eigenfaces = pca.components_[:n_components] # Show the first 16 eigenfaces fig, axes = plt.subplots(4,4,sharex=True,sharey=True,figsize=(8,10)) for i in range(16): axes[i%4][i//4].imshow(eigenfaces[i].reshape(faceshape), cmap="gray") print("Showing the eigenfaces") plt.show() # Generate weights as a KxN matrix where K is the number of eigenfaces and N the number of samples weights = eigenfaces @ (facematrix - pca.mean_).T print("Shape of the weight matrix:", weights.shape) # Test on out-of-sample image of existing class query = faces["s39/10.pgm"].reshape(1,-1) query_weight = eigenfaces @ (query - pca.mean_).T euclidean_distance = np.linalg.norm(weights - query_weight, axis=0) best_match = np.argmin(euclidean_distance) print("Best match %s with Euclidean distance %f" % (facelabel[best_match], euclidean_distance[best_match])) # Visualize fig, axes = plt.subplots(1,2,sharex=True,sharey=True,figsize=(8,6)) axes[0].imshow(query.reshape(faceshape), cmap="gray") axes[0].set_title("Query") axes[1].imshow(facematrix[best_match].reshape(faceshape), cmap="gray") axes[1].set_title("Best match") plt.show() # Test on out-of-sample image of new class query = faces["s40/1.pgm"].reshape(1,-1) query_weight = eigenfaces @ (query - pca.mean_).T euclidean_distance = np.linalg.norm(weights - query_weight, axis=0) best_match = np.argmin(euclidean_distance) print("Best match %s with Euclidean distance %f" % (facelabel[best_match], euclidean_distance[best_match])) # Visualize fig, axes = plt.subplots(1,2,sharex=True,sharey=True,figsize=(8,6)) axes[0].imshow(query.reshape(faceshape), cmap="gray") axes[0].set_title("Query") axes[1].imshow(facematrix[best_match].reshape(faceshape), cmap="gray") axes[1].set_title("Best match") plt.show()

This section provides more resources on the topic if you are looking to go deeper.

- L. Sirovich and M. Kirby (1987). “Low-dimensional procedure for the characterization of human faces“.
*Journal of the Optical Society of America**A*. 4(3): 519–524. - M. Turk and A. Pentland (1991). “Eigenfaces for recognition“.
*Journal of Cognitive Neuroscience*. 3(1): 71–86.

- Introduction to Linear Algebra, Fifth Edition, 2016.

In this tutorial, you discovered how to build a face recognition system using eigenface, which is derived from principal component analysis.

Specifically, you learned:

- How to extract characteristic images from the image dataset using principal component analysis
- How to use the set of characteristic images to create a weight vector for any seen or unseen images
- How to use the weight vectors of different images to measure for their similarity, and apply this technique to face recognition
- How to generate a new random image from the characteristic images

The post Face Recognition using Principal Component Analysis appeared first on Machine Learning Mastery.

]]>The post Using Singular Value Decomposition to Build a Recommender System appeared first on Machine Learning Mastery.

]]>Last Updated on October 29, 2021

Singular value decomposition is a very popular linear algebra technique to break down a matrix into the product of a few smaller matrices. In fact, it is a technique that has many uses. One example is that we can use SVD to discover relationship between items. A recommender system can be build easily from this.

In this tutorial, we will see how a recommender system can be build just using linear algebra techniques.

After completing this tutorial, you will know:

- What has singular value decomposition done to a matrix
- How to interpret the result of singular value decomposition
- What data a single recommender system require, and how we can make use of SVD to analyze it
- How we can make use of the result from SVD to make recommendations

Let’s get started.

This tutorial is divided into 3 parts; they are:

- Review of Singular Value Decomposition
- The Meaning of Singular Value Decomposition in Recommender System
- Implementing a Recommender System

Just like a number such as 24 can be decomposed as factors 24=2×3×4, a matrix can also be expressed as multiplication of some other matrices. Because matrices are arrays of numbers, they have their own rules of multiplication. Consequently, they have different ways of factorization, or known as **decomposition**. QR decomposition or LU decomposition are common examples. Another example is **singular value decomposition**, which has no restriction to the shape or properties of the matrix to be decomposed.

Singular value decomposition assumes a matrix $M$ (for example, a $m\times n$ matrix) is decomposed as

$$

M = U\cdot \Sigma \cdot V^T

$$

where $U$ is a $m\times m$ matrix, $\Sigma$ is a diagonal matrix of $m\times n$, and $V^T$ is a $n\times n$ matrix. The diagonal matrix $\Sigma$ is an interesting one, which it can be non-square but only the entries on the diagonal could be non-zero. The matrices $U$ and $V^T$ are **orthonormal** matrices. Meaning the columns of $U$ or rows of $V$ are (1) orthogonal to each other and are (2) unit vectors. Vectors are orthogonal to each other if any two vectors’ dot product is zero. A vector is unit vector if its L2-norm is 1. Orthonormal matrix has the property that its transpose is its inverse. In other words, since $U$ is an orthonormal matrix, $U^T = U^{-1}$ or $U\cdot U^T=U^T\cdot U=I$, where $I$ is the identity matrix.

Singular value decomposition gets its name from the diagonal entries on $\Sigma$, which are called the singular values of matrix $M$. They are in fact, the square root of the eigenvalues of matrix $M\cdot M^T$. Just like a number factorized into primes, the singular value decomposition of a matrix reveals a lot about the structure of that matrix.

But actually what described above is called the **full SVD**. There is another version called **reduced SVD** or **compact SVD**. We still have to write $M = U\cdot\Sigma\cdot V^T$ but we have $\Sigma$ a $r\times r$ square diagonal matrix with $r$ the **rank** of matrix $M$, which is usually less than or equal to the smaller of $m$ and $n$. The matrix $U$ is than a $m\times r$ matrix and $V^T$ is a $r\times n$ matrix. Because matrices $U$ and $V^T$ are non-square, they are called **semi-orthonormal**, meaning $U^T\cdot U=I$ and $V^T\cdot V=I$, with $I$ in both case a $r\times r$ identity matrix.

If the matrix $M$ is rank $r$, than we can prove that the matrices $M\cdot M^T$ and $M^T\cdot M$ are both rank $r$. In singular value decomposition (the reduced SVD), the columns of matrix $U$ are eigenvectors of $M\cdot M^T$ and the rows of matrix $V^T$ are eigenvectors of $M^T\cdot M$. What’s interesting is that $M\cdot M^T$ and $M^T\cdot M$ are potentially in different size (because matrix $M$ can be non-square shape), but they have the same set of eigenvalues, which are the square of values on the diagonal of $\Sigma$.

This is why the result of singular value decomposition can reveal a lot about the matrix $M$.

Imagine we collected some book reviews such that books are columns and people are rows, and the entries are the ratings that a person gave to a book. In that case, $M\cdot M^T$ would be a table of person-to-person which the entries would mean the sum of the ratings one person gave match with another one. Similarly $M^T\cdot M$ would be a table of book-to-book which entries are the sum of the ratings received match with that received by another book. What can be the hidden connection between people and books? That could be the genre, or the author, or something of similar nature.

Let’s see how we can make use of the result from SVD to build a recommender system. Firstly, let’s download the dataset from this link (caution: it is 600MB big)

This dataset is the “Social Recommendation Data” from “Recommender Systems and Personalization Datasets“. It contains the reviews given by users on books on Librarything. What we are interested are the number of “stars” a user given to a book.

If we open up this tar file we will see a large file named “reviews.json”. We can extract it, or read the included file on the fly. First three lines of reviews.json are shown below:

import tarfile # Read downloaded file from: # http://deepyeti.ucsd.edu/jmcauley/datasets/librarything/lthing_data.tar.gz with tarfile.open("lthing_data.tar.gz") as tar: print("Files in tar archive:") tar.list() with tar.extractfile("lthing_data/reviews.json") as file: count = 0 for line in file: print(line) count += 1 if count > 3: break

The above will print:

Files in tar archive: ?rwxr-xr-x julian/julian 0 2016-09-30 17:58:55 lthing_data/ ?rw-r--r-- julian/julian 4824989 2014-01-02 13:55:12 lthing_data/edges.txt ?rw-rw-r-- julian/julian 1604368260 2016-09-30 17:58:25 lthing_data/reviews.json b"{'work': '3206242', 'flags': [], 'unixtime': 1194393600, 'stars': 5.0, 'nhelpful': 0, 'time': 'Nov 7, 2007', 'comment': 'This a great book for young readers to be introduced to the world of Middle Earth. ', 'user': 'van_stef'}\n" b"{'work': '12198649', 'flags': [], 'unixtime': 1333756800, 'stars': 5.0, 'nhelpful': 0, 'time': 'Apr 7, 2012', 'comment': 'Help Wanted: Tales of On The Job Terror from Evil Jester Press is a fun and scary read. This book is edited by Peter Giglio and has short stories by Joe McKinney, Gary Brandner, Henry Snider and many more. As if work wasnt already scary enough, this book gives you more reasons to be scared. Help Wanted is an excellent anthology that includes some great stories by some master storytellers.\\nOne of the stories includes Agnes: A Love Story by David C. Hayes, which tells the tale of a lawyer named Jack who feels unappreciated at work and by his wife so he starts a relationship with a photocopier. They get along well until the photocopier starts wanting the lawyer to kill for it. The thing I liked about this story was how the author makes you feel sorry for Jack. His two co-workers are happily married and love their jobs while Jack is married to a paranoid alcoholic and he hates and works at a job he cant stand. You completely understand how he can fall in love with a copier because he is a lonely soul that no one understands except the copier of course.\\nAnother story in Help Wanted is Work Life Balance by Jeff Strand. In this story a man works for a company that starts to let their employees do what they want at work. It starts with letting them come to work a little later than usual, then the employees are allowed to hug and kiss on the job. Things get really out of hand though when the company starts letting employees carry knives and stab each other, as long as it doesnt interfere with their job. This story is meant to be more funny then scary but still has its scary moments. Jeff Strand does a great job mixing humor and horror in this story.\\nAnother good story in Help Wanted: On The Job Terror is The Chapel Of Unrest by Stephen Volk. This is a gothic horror story that takes place in the 1800s and has to deal with an undertaker who has the duty of capturing and embalming a ghoul who has been eating dead bodies in a graveyard. Stephen Volk through his use of imagery in describing the graveyard, the chapel and the clothes of the time, transports you into an 1800s gothic setting that reminded me of Bram Stokers Dracula.\\nOne more story in this anthology that I have to mention is Expulsion by Eric Shapiro which tells the tale of a mad man going into a office to kill his fellow employees. This is a very short but very powerful story that gets you into the mind of a disgruntled employee but manages to end on a positive note. Though there were stories I didnt like in Help Wanted, all in all its a very good anthology. I highly recommend this book ', 'user': 'dwatson2'}\n" b"{'work': '12533765', 'flags': [], 'unixtime': 1352937600, 'nhelpful': 0, 'time': 'Nov 15, 2012', 'comment': 'Magoon, K. (2012). Fire in the streets. New York: Simon and Schuster/Aladdin. 336 pp. ISBN: 978-1-4424-2230-8. (Hardcover); $16.99.\\nKekla Magoon is an author to watch (http://www.spicyreads.org/Author_Videos.html- scroll down). One of my favorite books from 2007 is Magoons The Rock and the River. At the time, I mentioned in reviews that we have very few books that even mention the Black Panther Party, let alone deal with them in a careful, thorough way. Fire in the Streets continues the story Magoon began in her debut book. While her familys financial fortunes drip away, not helped by her mothers drinking and assortment of boyfriends, the Panthers provide a very real respite for Maxie. Sam is still dealing with the death of his brother. Maxies relationship with Sam only serves to confuse and upset them both. Her friends, Emmalee and Patrice, are slowly drifting away. The Panther Party is the only thing that seems to make sense and she basks in its routine and consistency. She longs to become a full member of the Panthers and constantly battles with her Panther brother Raheem over her maturity and ability to do more than office tasks. Maxie wants to have her own gun. When Maxie discovers that there is someone working with the Panthers that is leaking information to the government about Panther activity, Maxie investigates. Someone is attempting to destroy the only place that offers her shelter. Maxie is determined to discover the identity of the traitor, thinking that this will prove her worth to the organization. However, the truth is not simple and it is filled with pain. Unfortunately we still do not have many teen books that deal substantially with the Democratic National Convention in Chicago, the Black Panther Party, and the social problems in Chicago that lead to the civil unrest. Thankfully, Fire in the Streets lives up to the standard Magoon set with The Rock and the River. Readers will feel like they have stepped back in time. Magoons factual tidbits add journalistic realism to the story and only improves the atmosphere. Maxie has spunk. Readers will empathize with her Atlas-task of trying to hold onto her world. Fire in the Streets belongs in all middle school and high school libraries. While readers are able to read this story independently of The Rock and the River, I strongly urge readers to read both and in order. Magoons recognition by the Coretta Scott King committee and the NAACP Image awards are NOT mistakes!', 'user': 'edspicer'}\n" b'{\'work\': \'12981302\', \'flags\': [], \'unixtime\': 1364515200, \'stars\': 4.0, \'nhelpful\': 0, \'time\': \'Mar 29, 2013\', \'comment\': "Well, I definitely liked this book better than the last in the series. There was less fighting and more story. I liked both Toni and Ricky Lee and thought they were pretty good together. The banter between the two was sweet and often times funny. I enjoyed seeing some of the past characters and of course it\'s always nice to be introduced to new ones. I just wonder how many more of these books there will be. At least two hopefully, one each for Rory and Reece. ", \'user\': \'amdrane2\'}\n'

Each line in reviews.json is a record. We are going to extract the “user”, “work”, and “stars” field of each record as long as there are no missing data among these three. Despite the name, the records are not well-formed JSON strings (most notably it uses single quote rather than double quote). Therefore, we cannot use `json`

package from Python but to use `ast`

to decode such string:

... import ast reviews = [] with tarfile.open("lthing_data.tar.gz") as tar: with tar.extractfile("lthing_data/reviews.json") as file: for line in file: record = ast.literal_eval(line.decode("utf8")) if any(x not in record for x in ['user', 'work', 'stars']): continue reviews.append([record['user'], record['work'], record['stars']]) print(len(reviews), "records retrieved")

1387209 records retrieved

Now we should make a matrix of how different users rate each book. We make use of the pandas library to help convert the data we collected into a table:

... import pandas as pd reviews = pd.DataFrame(reviews, columns=["user", "work", "stars"]) print(reviews.head())

user work stars 0 van_stef 3206242 5.0 1 dwatson2 12198649 5.0 2 amdrane2 12981302 4.0 3 Lila_Gustavus 5231009 3.0 4 skinglist 184318 2.0

As an example, we try not to use all data in order to save time and memory. Here we consider only those users who reviewed more than 50 books and also those books who are reviewed by more than 50 users. This way, we trimmed our dataset to less than 15% of its original size:

... # Look for the users who reviewed more than 50 books usercount = reviews[["work","user"]].groupby("user").count() usercount = usercount[usercount["work"] >= 50] print(usercount.head())

work user 84 -Eva- 602 06nwingert 370 1983mk 63 1dragones 194

... # Look for the books who reviewed by more than 50 users workcount = reviews[["work","user"]].groupby("work").count() workcount = workcount[workcount["user"] >= 50] print(workcount.head())

user work 10000 106 10001 53 1000167 186 10001797 53 10005525 134

... # Keep only the popular books and active users reviews = reviews[reviews["user"].isin(usercount.index) & reviews["work"].isin(workcount.index)] print(reviews)

user work stars 0 van_stef 3206242 5.0 6 justine 3067 4.5 18 stephmo 1594925 4.0 19 Eyejaybee 2849559 5.0 35 LisaMaria_C 452949 4.5 ... ... ... ... 1387161 connie53 1653 4.0 1387177 BruderBane 24623 4.5 1387192 StuartAston 8282225 4.0 1387202 danielx 9759186 4.0 1387206 jclark88 8253945 3.0 [205110 rows x 3 columns]

Then we can make use of “pivot table” function in pandas to convert this into a matrix:

... reviewmatrix = reviews.pivot(index="user", columns="work", values="stars").fillna(0)

The result is a matrix of 5593 rows and 2898 columns

Here we represented 5593 users and 2898 books in a matrix. Then we apply the SVD (this will take a while):

... from numpy.linalg import svd matrix = reviewmatrix.values u, s, vh = svd(matrix, full_matrices=False)

By default, the `svd()`

returns a full singular value decomposition. We choose a reduced version so we can use smaller matrices to save memory. The columns of `vh`

correspond to the books. We can based on vector space model to find which book are most similar to the one we are looking at:

... import numpy as np def cosine_similarity(v,u): return (v @ u)/ (np.linalg.norm(v) * np.linalg.norm(u)) highest_similarity = -np.inf highest_sim_col = -1 for col in range(1,vh.shape[1]): similarity = cosine_similarity(vh[:,0], vh[:,col]) if similarity > highest_similarity: highest_similarity = similarity highest_sim_col = col print("Column %d is most similar to column 0" % highest_sim_col)

And in the above example, we try to find the book that is best match to to first column. The result is:

Column 906 is most similar to column 0

In a recommendation system, when a user picked a book, we may show her a few other books that are similar to the one she picked based on the cosine distance as calculated above.

Depends on the dataset, we may use truncated SVD to reduce the dimension of matrix `vh`

. In essence, this means we are removing several rows on `vh`

that the corresponding singular values in `s`

are small, before we use it to compute the similarity. This would likely make the prediction more accurate as those less important features of a book are removed from consideration.

Note that, in the decomposition $M=U\cdot\Sigma\cdot V^T$ we know the rows of $U$ are the users and columns of $V^T$ are books, we cannot identify what are the meanings of the columns of $U$ or rows of $V^T$ (an equivalently, that of $\Sigma$). We know they could be genres, for example, that provide some underlying connections between the users and the books but we cannot be sure what exactly are they. However, this does not stop us from using them as **features** in our recommendation system.

Tying all together, the following is the complete code:

import tarfile import ast import pandas as pd import numpy as np # Read downloaded file from: # http://deepyeti.ucsd.edu/jmcauley/datasets/librarything/lthing_data.tar.gz with tarfile.open("lthing_data.tar.gz") as tar: print("Files in tar archive:") tar.list() print("\nSample records:") with tar.extractfile("lthing_data/reviews.json") as file: count = 0 for line in file: print(line) count += 1 if count > 3: break # Collect records reviews = [] with tarfile.open("lthing_data.tar.gz") as tar: with tar.extractfile("lthing_data/reviews.json") as file: for line in file: try: record = ast.literal_eval(line.decode("utf8")) except: print(line.decode("utf8")) raise if any(x not in record for x in ['user', 'work', 'stars']): continue reviews.append([record['user'], record['work'], record['stars']]) print(len(reviews), "records retrieved") # Print a few sample of what we collected reviews = pd.DataFrame(reviews, columns=["user", "work", "stars"]) print(reviews.head()) # Look for the users who reviewed more than 50 books usercount = reviews[["work","user"]].groupby("user").count() usercount = usercount[usercount["work"] >= 50] # Look for the books who reviewed by more than 50 users workcount = reviews[["work","user"]].groupby("work").count() workcount = workcount[workcount["user"] >= 50] # Keep only the popular books and active users reviews = reviews[reviews["user"].isin(usercount.index) & reviews["work"].isin(workcount.index)] print("\nSubset of data:") print(reviews) # Convert records into user-book review score matrix reviewmatrix = reviews.pivot(index="user", columns="work", values="stars").fillna(0) matrix = reviewmatrix.values # Singular value decomposition u, s, vh = np.linalg.svd(matrix, full_matrices=False) # Find the highest similarity def cosine_similarity(v,u): return (v @ u)/ (np.linalg.norm(v) * np.linalg.norm(u)) highest_similarity = -np.inf highest_sim_col = -1 for col in range(1,vh.shape[1]): similarity = cosine_similarity(vh[:,0], vh[:,col]) if similarity > highest_similarity: highest_similarity = similarity highest_sim_col = col print("Column %d (book id %s) is most similar to column 0 (book id %s)" % (highest_sim_col, reviewmatrix.columns[col], reviewmatrix.columns[0]) )

This section provides more resources on the topic if you are looking to go deeper.

- Introduction to Linear Algebra, Fifth Edition, 2016.

In this tutorial, you discovered how to build a recommender system using singular value decomposition.

Specifically, you learned:

- What a singular value decomposition mean to a matrix
- How to interpret the result of a singular value decomposition
- Find similarity from the columns of matrix $V^T$ obtained from singular value decomposition, and make recommendations based on the similarity

The post Using Singular Value Decomposition to Build a Recommender System appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Vector Space Models appeared first on Machine Learning Mastery.

]]>Last Updated on October 23, 2021

Vector space models are to consider the relationship between data that are represented by vectors. It is popular in information retrieval systems but also useful for other purposes. Generally, this allows us to compare the similarity of two vectors from a geometric perspective.

In this tutorial, we will see what is a vector space model and what it can do.

After completing this tutorial, you will know:

- What is a vector space model and the properties of cosine similarity
- How cosine similarity can help you compare two vectors
- What is the difference between cosine similarity and L2 distance

Let’s get started.

This tutorial is divided into 3 parts; they are:

- Vector space and cosine formula
- Using vector space model for similarity
- Common use of vector space models and cosine distance

A vector space is a mathematical term that defines some vector operations. In layman’s term, we can imagine it is a $n$-dimensional metric space where each point is represented by a $n$-dimensional vector. In this space, we can do any vector addition or scalar-vector multiplications.

It is useful to consider a vector space because it is useful to represent things as a vector. For example in machine learning, we usually have a data point with multiple features. Therefore, it is convenient for us to represent a data point as a vector.

With a vector, we can compute its **norm**. The most common one is the L2-norm or the length of the vector. With two vectors in the same vector space, we can find their difference. Assume it is a 3-dimensional vector space, the two vectors are $(x_1, x_2, x_3)$ and $(y_1, y_2, y_3)$. Their difference is the vector $(y_1-x_1, y_2-x_2, y_3-x_3)$, and the L2-norm of the difference is the **distance** or more precisely the Euclidean distance between those two vectors:

$$

\sqrt{(y_1-x_1)^2+(y_2-x_2)^2+(y_3-x_3)^2}

$$

Besides distance, we can also consider the **angle** between two vectors. If we consider the vector $(x_1, x_2, x_3)$ as a line segment from the point $(0,0,0)$ to $(x_1,x_2,x_3)$ in the 3D coordinate system, then there is another line segment from $(0,0,0)$ to $(y_1,y_2, y_3)$. They make an angle at their intersection:

The angle between the two line segments can be found using the cosine formula:

$$

\cos \theta = \frac{a\cdot b} {\lVert a\rVert_2\lVert b\rVert_2}

$$

where $a\cdot b$ is the vector dot-product and $\lVert a\rVert_2$ is the L2-norm of vector $a$. This formula arises from considering the dot-product as the projection of vector $a$ onto the direction as pointed by vector $b$. The nature of cosine tells that, as the angle $\theta$ increases from 0 to 90 degrees, cosine decreases from 1 to 0. Sometimes we would call $1-\cos\theta$ the **cosine distance** because it runs from 0 to 1 as the two vectors are moving further away from each other. This is an important property that we are going to exploit in the vector space model.

Let’s look at an example of how the vector space model is useful.

World Bank collects various data about countries and regions in the world. While every country is different, we can try to compare countries under vector space model. For convenience, we will use the `pandas_datareader`

module in Python to read data from World Bank. You may install `pandas_datareader`

using `pip`

or `conda`

command:

pip install pandas_datareader

The data series collected by World Bank are named by an identifier. For example, “SP.URB.TOTL” is the total urban population of a country. Many of the series are yearly. When we download a series, we have to put in the start and end years. Usually the data are not updated on time. Hence it is best to look at the data a few years back rather than the most recent year to avoid missing data.

In below, we try to collect some economic data of *every* country in 2010:

from pandas_datareader import wb import pandas as pd pd.options.display.width = 0 names = [ "NE.EXP.GNFS.CD", # Exports of goods and services (current US$) "NE.IMP.GNFS.CD", # Imports of goods and services (current US$) "NV.AGR.TOTL.CD", # Agriculture, forestry, and fishing, value added (current US$) "NY.GDP.MKTP.CD", # GDP (current US$) "NE.RSB.GNFS.CD", # External balance on goods and services (current US$) ] df = wb.download(country="all", indicator=names, start=2010, end=2010).reset_index() countries = wb.get_countries() non_aggregates = countries[countries["region"] != "Aggregates"].name df_nonagg = df[df["country"].isin(non_aggregates)].dropna() print(df_nonagg)

country year NE.EXP.GNFS.CD NE.IMP.GNFS.CD NV.AGR.TOTL.CD NY.GDP.MKTP.CD NE.RSB.GNFS.CD 50 Albania 2010 3.337089e+09 5.792189e+09 2.141580e+09 1.192693e+10 -2.455100e+09 51 Algeria 2010 6.197541e+10 5.065473e+10 1.364852e+10 1.612073e+11 1.132067e+10 54 Angola 2010 5.157282e+10 3.568226e+10 5.179055e+09 8.379950e+10 1.589056e+10 55 Antigua and Barbuda 2010 9.142222e+08 8.415185e+08 1.876296e+07 1.148700e+09 7.270370e+07 56 Argentina 2010 8.020887e+10 6.793793e+10 3.021382e+10 4.236274e+11 1.227093e+10 .. ... ... ... ... ... ... ... 259 Venezuela, RB 2010 1.121794e+11 6.922736e+10 2.113513e+10 3.931924e+11 4.295202e+10 260 Vietnam 2010 8.347359e+10 9.299467e+10 2.130649e+10 1.159317e+11 -9.521076e+09 262 West Bank and Gaza 2010 1.367300e+09 5.264300e+09 8.716000e+08 9.681500e+09 -3.897000e+09 264 Zambia 2010 7.503513e+09 6.256989e+09 1.909207e+09 2.026556e+10 1.246524e+09 265 Zimbabwe 2010 3.569254e+09 6.440274e+09 1.157187e+09 1.204166e+10 -2.871020e+09 [174 rows x 7 columns]

In the above we obtained some economic metrics of each country in 2010. The function `wb.download()`

will download the data from World Bank and return a pandas dataframe. Similarly `wb.get_countries()`

will get the name of the countries and regions as identified by World Bank, which we will use this to filter out the non-countries aggregates such as “East Asia” and “World”. Pandas allows filtering rows by boolean indexing, which `df["country"].isin(non_aggregates)`

gives a boolean vector of which row is in the list of `non_aggregates`

and based on that, `df[df["country"].isin(non_aggregates)]`

selects only those. For various reasons not all countries will have all data. Hence we use `dropna()`

to remove those with missing data. In practice, we may want to apply some imputation techniques instead of merely removing them. But as an example, we proceed with the 174 remaining data points.

To better illustrate the idea rather than hiding the actual manipulation in pandas or numpy functions, we first extract the data for each country as a vector:

... vectors = {} for rowid, row in df_nonagg.iterrows(): vectors[row["country"]] = row[names].values print(vectors)

{'Albania': array([3337088824.25553, 5792188899.58985, 2141580308.0144, 11926928505.5231, -2455100075.33431], dtype=object), 'Algeria': array([61975405318.205, 50654732073.2396, 13648522571.4516, 161207310515.42, 11320673244.9655], dtype=object), 'Angola': array([51572818660.8665, 35682259098.1843, 5179054574.41704, 83799496611.2004, 15890559562.6822], dtype=object), ... 'West Bank and Gaza': array([1367300000.0, 5264300000.0, 871600000.0, 9681500000.0, -3897000000.0], dtype=object), 'Zambia': array([7503512538.82554, 6256988597.27752, 1909207437.82702, 20265559483.8548, 1246523941.54802], dtype=object), 'Zimbabwe': array([3569254400.0, 6440274000.0, 1157186600.0, 12041655200.0, -2871019600.0], dtype=object)}

The Python dictionary we created has the name of each country as a key and the economic metrics as a numpy array. There are 5 metrics, hence each is a vector of 5 dimensions.

What this helps us is that, we can use the vector representation of each country to see how similar it is to another. Let’s try both the L2-norm of the difference (the Euclidean distance) and the cosine distance. We pick one country, such as Australia, and compare it to all other countries on the list based on the selected economic metrics.

... import numpy as np euclid = {} cosine = {} target = "Australia" for country in vectors: vecA = vectors[target] vecB = vectors[country] dist = np.linalg.norm(vecA - vecB) cos = (vecA @ vecB) / (np.linalg.norm(vecA) * np.linalg.norm(vecB)) euclid[country] = dist # Euclidean distance cosine[country] = 1-cos # cosine distance

In the for-loop above, we set `vecA`

as the vector of the target country (i.e., Australia) and `vecB`

as that of the other country. Then we compute the L2-norm of their difference as the Euclidean distance between the two vectors. We also compute the cosine similarity using the formula and minus it from 1 to get the cosine distance. With more than a hundred countries, we can see which one has the shortest Euclidean distance to Australia:

... import pandas as pd df_distance = pd.DataFrame({"euclid": euclid, "cos": cosine}) print(df_distance.sort_values(by="euclid").head())

euclid cos Australia 0.000000e+00 -2.220446e-16 Mexico 1.533802e+11 7.949549e-03 Spain 3.411901e+11 3.057903e-03 Turkey 3.798221e+11 3.502849e-03 Indonesia 4.083531e+11 7.417614e-03

By sorting the result, we can see that Mexico is the closest to Australia under Euclidean distance. However, with cosine distance, it is Colombia the closest to Australia.

... df_distance.sort_values(by="cos").head()

euclid cos Australia 0.000000e+00 -2.220446e-16 Colombia 8.981118e+11 1.720644e-03 Cuba 1.126039e+12 2.483993e-03 Italy 1.088369e+12 2.677707e-03 Argentina 7.572323e+11 2.930187e-03

To understand why the two distances give different result, we can observe how the three countries’ metric compare to each other:

... print(df_nonagg[df_nonagg.country.isin(["Mexico", "Colombia", "Australia"])])

country year NE.EXP.GNFS.CD NE.IMP.GNFS.CD NV.AGR.TOTL.CD NY.GDP.MKTP.CD NE.RSB.GNFS.CD 59 Australia 2010 2.270501e+11 2.388514e+11 2.518718e+10 1.146138e+12 -1.180129e+10 91 Colombia 2010 4.682683e+10 5.136288e+10 1.812470e+10 2.865631e+11 -4.536047e+09 176 Mexico 2010 3.141423e+11 3.285812e+11 3.405226e+10 1.057801e+12 -1.443887e+10

From this table, we see that the metrics of Australia and Mexico are very close to each other in magnitude. However, if you compare the ratio of each metric within the same country, it is Colombia that match Australia better. In fact from the cosine formula, we can see that

$$

\cos \theta = \frac{a\cdot b} {\lVert a\rVert_2\lVert b\rVert_2} = \frac{a}{\lVert a\rVert_2} \cdot \frac{b} {\lVert b\rVert_2}

$$

which means the cosine of the angle between the two vector is the dot-product of the corresponding vectors after they were normalized to length of 1. Hence cosine distance is virtually applying a scaler to the data before computing the distance.

Putting these altogether, the following is the complete code

from pandas_datareader import wb import numpy as np import pandas as pd pd.options.display.width = 0 # Download data from World Bank names = [ "NE.EXP.GNFS.CD", # Exports of goods and services (current US$) "NE.IMP.GNFS.CD", # Imports of goods and services (current US$) "NV.AGR.TOTL.CD", # Agriculture, forestry, and fishing, value added (current US$) "NY.GDP.MKTP.CD", # GDP (current US$) "NE.RSB.GNFS.CD", # External balance on goods and services (current US$) ] df = wb.download(country="all", indicator=names, start=2010, end=2010).reset_index() # We remove aggregates and keep only countries with no missing data countries = wb.get_countries() non_aggregates = countries[countries["region"] != "Aggregates"].name df_nonagg = df[df["country"].isin(non_aggregates)].dropna() # Extract vector for each country vectors = {} for rowid, row in df_nonagg.iterrows(): vectors[row["country"]] = row[names].values # Compute the Euclidean and cosine distances euclid = {} cosine = {} target = "Australia" for country in vectors: vecA = vectors[target] vecB = vectors[country] dist = np.linalg.norm(vecA - vecB) cos = (vecA @ vecB) / (np.linalg.norm(vecA) * np.linalg.norm(vecB)) euclid[country] = dist # Euclidean distance cosine[country] = 1-cos # cosine distance # Print the results df_distance = pd.DataFrame({"euclid": euclid, "cos": cosine}) print("Closest by Euclidean distance:") print(df_distance.sort_values(by="euclid").head()) print() print("Closest by Cosine distance:") print(df_distance.sort_values(by="cos").head()) # Print the detail metrics print() print("Detail metrics:") print(df_nonagg[df_nonagg.country.isin(["Mexico", "Colombia", "Australia"])])

Vector space models are common in information retrieval systems. We can present documents (e.g., a paragraph, a long passage, a book, or even a sentence) as vectors. This vector can be as simple as counting of the words that the document contains (i.e., a bag-of-word model) or a complicated embedding vector (e.g., Doc2Vec). Then a query to find the most relevant document can be answered by ranking all documents by the cosine distance. Cosine distance should be used because we do not want to favor longer or shorter documents, but to focus on what it contains. Hence we leverage the normalization comes with it to consider how relevant are the documents to the query rather than how many times the words on the query are mentioned in a document.

If we consider each word in a document as a feature and compute the cosine distance, it is the “hard” distance because we do not care about words with similar meanings (e.g. “document” and “passage” have similar meanings but not “distance”). Embedding vectors such as word2vec would allow us to consider the ontology. Computing the cosine distance with the meaning of words considered is the “**soft cosine distance**“. Libraries such as gensim provides a way to do this.

Another use case of the cosine distance and vector space model is in computer vision. Imagine the task of recognizing hand gesture, we can make certain parts of the hand (e.g. five fingers) the key points. Then with the (x,y) coordinates of the key points lay out as a vector, we can compare with our existing database to see which cosine distance is the closest and determine which hand gesture it is. We need cosine distance because everyone’s hand has a different size. We do not want that to affect our decision on what gesture it is showing.

As you may imagine, there are much more examples you can use this technique.

This section provides more resources on the topic if you are looking to go deeper.

- Introduction to Linear Algebra, Fifth Edition, 2016.
- Introduction to Information Retrieval, 2008.

In this tutorial, you discovered the vector space model for measuring the similarities of vectors.

Specifically, you learned:

- How to construct a vector space model
- How to compute the cosine similarity and hence the cosine distance between two vectors in the vector space model
- How to interpret the difference between cosine distance and other distance metrics such as Euclidean distance
- What are the use of the vector space model

The post A Gentle Introduction to Vector Space Models appeared first on Machine Learning Mastery.

]]>The post Principal Component Analysis for Visualization appeared first on Machine Learning Mastery.

]]>Last Updated on October 27, 2021

Principal component analysis (PCA) is an unsupervised machine learning technique. Perhaps the most popular use of principal component analysis is dimensionality reduction. Besides using PCA as a data preparation technique, we can also use it to help visualize data. A picture is worth a thousand words. With the data visualized, it is easier for us to get some insights and decide on the next step in our machine learning models.

In this tutorial, you will discover how to visualize data using PCA, as well as using visualization to help determining the parameter for dimensionality reduction.

After completing this tutorial, you will know:

- How to use visualize a high dimensional data
- What is explained variance in PCA
- Visually observe the explained variance from the result of PCA of high dimensional data

Let’s get started.

This tutorial is divided into two parts; they are:

- Scatter plot of high dimensional data
- Visualizing the explained variance

For this tutorial, we assume that you are already familiar with:

- How to Calculate Principal Component Analysis (PCA) from Scratch in Python

- Principal Component Analysis for Dimensionality Reduction in Python

Visualization is a crucial step to get insights from data. We can learn from the visualization that whether a pattern can be observed and hence estimate which machine learning model is suitable.

It is easy to depict things in two dimension. Normally a scatter plot with x- and y-axis are in two dimensional. Depicting things in three dimensional is a bit challenging but not impossible. In matplotlib, for example, can plot in 3D. The only problem is on paper or on screen, we can only look at a 3D plot at one viewport or projection at a time. In matplotlib, this is controlled by the degree of elevation and azimuth. Depicting things in four or five dimensions is impossible because we live in a three-dimensional world and have no idea of how things in such a high dimension would look like.

This is where a dimensionality reduction technique such as PCA comes into play. We can reduce the dimension to two or three so we can visualize it. Let’s start with an example.

We start with the wine dataset, which is a classification dataset with 13 features (i.e., the dataset is 13 dimensional) and 3 classes. There are 178 samples:

from sklearn.datasets import load_wine winedata = load_wine() X, y = winedata['data'], winedata['target'] print(X.shape) print(y.shape)

(178, 13) (178,)

Among the 13 features, we can pick any two and plot with matplotlib (we color-coded the different classes using the `c`

argument):

... import matplotlib.pyplot as plt plt.scatter(X[:,1], X[:,2], c=y) plt.show()

or we can also pick any three and show in 3D:

... ax = fig.add_subplot(projection='3d') ax.scatter(X[:,1], X[:,2], X[:,3], c=y) plt.show()

But this doesn’t reveal much of how the data looks like, because majority of the features are not shown. We now resort to principal component analysis:

... from sklearn.decomposition import PCA pca = PCA() Xt = pca.fit_transform(X) plot = plt.scatter(Xt[:,0], Xt[:,1], c=y) plt.legend(handles=plot.legend_elements()[0], labels=list(winedata['target_names'])) plt.show()

Here we transform the input data `X`

by PCA into `Xt`

. We consider only the first two columns, which contain the most information, and plot it in two dimensional. We can see that the purple class is quite distinctive, but there is still some overlap. If we scale the data before PCA, the result would be different:

... from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline pca = PCA() pipe = Pipeline([('scaler', StandardScaler()), ('pca', pca)]) Xt = pipe.fit_transform(X) plot = plt.scatter(Xt[:,0], Xt[:,1], c=y) plt.legend(handles=plot.legend_elements()[0], labels=list(winedata['target_names'])) plt.show()

Because PCA is sensitive to the scale, if we normalized each feature by `StandardScaler`

we can see a better result. Here the different classes are more distinctive. By looking at this plot, we are confident that a simple model such as SVM can classify this dataset in high accuracy.

Putting these together, the following is the complete code to generate the visualizations:

from sklearn.datasets import load_wine from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline import matplotlib.pyplot as plt # Load dataset winedata = load_wine() X, y = winedata['data'], winedata['target'] print("X shape:", X.shape) print("y shape:", y.shape) # Show any two features plt.figure(figsize=(8,6)) plt.scatter(X[:,1], X[:,2], c=y) plt.xlabel(winedata["feature_names"][1]) plt.ylabel(winedata["feature_names"][2]) plt.title("Two particular features of the wine dataset") plt.show() # Show any three features fig = plt.figure(figsize=(10,8)) ax = fig.add_subplot(projection='3d') ax.scatter(X[:,1], X[:,2], X[:,3], c=y) ax.set_xlabel(winedata["feature_names"][1]) ax.set_ylabel(winedata["feature_names"][2]) ax.set_zlabel(winedata["feature_names"][3]) ax.set_title("Three particular features of the wine dataset") plt.show() # Show first two principal components without scaler pca = PCA() plt.figure(figsize=(8,6)) Xt = pca.fit_transform(X) plot = plt.scatter(Xt[:,0], Xt[:,1], c=y) plt.legend(handles=plot.legend_elements()[0], labels=list(winedata['target_names'])) plt.xlabel("PC1") plt.ylabel("PC2") plt.title("First two principal components") plt.show() # Show first two principal components with scaler pca = PCA() pipe = Pipeline([('scaler', StandardScaler()), ('pca', pca)]) plt.figure(figsize=(8,6)) Xt = pipe.fit_transform(X) plot = plt.scatter(Xt[:,0], Xt[:,1], c=y) plt.legend(handles=plot.legend_elements()[0], labels=list(winedata['target_names'])) plt.xlabel("PC1") plt.ylabel("PC2") plt.title("First two principal components after scaling") plt.show()

If we apply the same method on a different dataset, such as MINST handwritten digits, the scatterplot is not showing distinctive boundary and therefore it needs a more complicated model such as neural network to classify:

from sklearn.datasets import load_digits from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline import matplotlib.pyplot as plt digitsdata = load_digits() X, y = digitsdata['data'], digitsdata['target'] pca = PCA() pipe = Pipeline([('scaler', StandardScaler()), ('pca', pca)]) plt.figure(figsize=(8,6)) Xt = pipe.fit_transform(X) plot = plt.scatter(Xt[:,0], Xt[:,1], c=y) plt.legend(handles=plot.legend_elements()[0], labels=list(digitsdata['target_names'])) plt.show()

PCA in essence is to rearrange the features by their linear combinations. Hence it is called a feature extraction technique. One characteristic of PCA is that the first principal component holds the most information about the dataset. The second principal component is more informative than the third, and so on.

To illustrate this idea, we can remove the principal components from the original dataset in steps and see how the dataset looks like. Let’s consider a dataset with fewer features, and show two features in a plot:

from sklearn.datasets import load_iris irisdata = load_iris() X, y = irisdata['data'], irisdata['target'] plt.figure(figsize=(8,6)) plt.scatter(X[:,0], X[:,1], c=y) plt.show()

This is the iris dataset which has only four features. The features are in comparable scales and hence we can skip the scaler. With a 4-features data, the PCA can produce at most 4 principal components:

... pca = PCA().fit(X) print(pca.components_)

[[ 0.36138659 -0.08452251 0.85667061 0.3582892 ] [ 0.65658877 0.73016143 -0.17337266 -0.07548102] [-0.58202985 0.59791083 0.07623608 0.54583143] [-0.31548719 0.3197231 0.47983899 -0.75365743]]

For example, the first row is the first principal axis on which the first principal component is created. For any data point $p$ with features $p=(a,b,c,d)$, since the principal axis is denoted by the vector $v=(0.36,-0.08,0.86,0.36)$, the first principal component of this data point has the value $0.36 \times a – 0.08 \times b + 0.86 \times c + 0.36\times d$ on the principal axis. Using vector dot product, this value can be denoted by

$$

p \cdot v

$$

Therefore, with the dataset $X$ as a 150 $\times$ 4 matrix (150 data points, each has 4 features), we can map each data point into to the value on this principal axis by matrix-vector multiplication:

$$

X \cdot v

$$

and the result is a vector of length 150. Now if we remove from each data point the corresponding value along the principal axis vector, that would be

$$

X – (X \cdot v) \cdot v^T

$$

where the transposed vector $v^T$ is a row and $X\cdot v$ is a column. The product $(X \cdot v) \cdot v^T$ follows matrix-matrix multiplication and the result is a $150\times 4$ matrix, same dimension as $X$.

If we plot the first two feature of $(X \cdot v) \cdot v^T$, it looks like this:

... # Remove PC1 Xmean = X - X.mean(axis=0) value = Xmean @ pca.components_[0] pc1 = value.reshape(-1,1) @ pca.components_[0].reshape(1,-1) Xremove = X - pc1 plt.scatter(Xremove[:,0], Xremove[:,1], c=y) plt.show()

The numpy array `Xmean`

is to shift the features of `X`

to centered at zero. This is required for PCA. Then the array `value`

is computed by matrix-vector multiplication.

The array `value`

is the magnitude of each data point mapped on the principal axis. So if we multiply this value to the principal axis vector we get back an array `pc1`

. Removing this from the original dataset `X`

, we get a new array `Xremove`

. In the plot we observed that the points on the scatter plot crumbled together and the cluster of each class is less distinctive than before. This means we removed a lot of information by removing the first principal component. If we repeat the same process again, the points are further crumbled:

... # Remove PC2 value = Xmean @ pca.components_[1] pc2 = value.reshape(-1,1) @ pca.components_[1].reshape(1,-1) Xremove = Xremove - pc2 plt.scatter(Xremove[:,0], Xremove[:,1], c=y) plt.show()

This looks like a straight line but actually not. If we repeat once more, all points collapse into a straight line:

... # Remove PC3 value = Xmean @ pca.components_[2] pc3 = value.reshape(-1,1) @ pca.components_[2].reshape(1,-1) Xremove = Xremove - pc3 plt.scatter(Xremove[:,0], Xremove[:,1], c=y) plt.show()

The points all fall on a straight line because we removed three principal components from the data where there are only four features. Hence our data matrix becomes **rank 1**. You can try repeat once more this process and the result would be all points collapse into a single point. The amount of information removed in each step as we removed the principal components can be found by the corresponding **explained variance ratio** from the PCA:

... print(pca.explained_variance_ratio_)

[0.92461872 0.05306648 0.01710261 0.00521218]

Here we can see, the first component explained 92.5% variance and the second component explained 5.3% variance. If we removed the first two principal components, the remaining variance is only 2.2%, hence visually the plot after removing two components looks like a straight line. In fact, when we check with the plots above, not only we see the points are crumbled, but the range in the x- and y-axes are also smaller as we removed the components.

In terms of machine learning, we can consider using only one single feature for classification in this dataset, namely the first principal component. We should expect to achieve no less than 90% of the original accuracy as using the full set of features:

... from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score from collections import Counter from sklearn.svm import SVC X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) clf = SVC(kernel="linear", gamma='auto').fit(X_train, y_train) print("Using all features, accuracy: ", clf.score(X_test, y_test)) print("Using all features, F1: ", f1_score(y_test, clf.predict(X_test), average="macro")) mean = X_train.mean(axis=0) X_train2 = X_train - mean X_train2 = (X_train2 @ pca.components_[0]).reshape(-1,1) clf = SVC(kernel="linear", gamma='auto').fit(X_train2, y_train) X_test2 = X_test - mean X_test2 = (X_test2 @ pca.components_[0]).reshape(-1,1) print("Using PC1, accuracy: ", clf.score(X_test2, y_test)) print("Using PC1, F1: ", f1_score(y_test, clf.predict(X_test2), average="macro"))

Using all features, accuracy: 1.0 Using all features, F1: 1.0 Using PC1, accuracy: 0.96 Using PC1, F1: 0.9645191409897292

The other use of the explained variance is on compression. Given the explained variance of the first principal component is large, if we need to store the dataset, we can store only the the projected values on the first principal axis ($X\cdot v$), as well as the vector $v$ of the principal axis. Then we can approximately reproduce the original dataset by multiplying them:

$$

X \approx (X\cdot v) \cdot v^T

$$

In this way, we need storage for only one value per data point instead of four values for four features. The approximation is more accurate if we store the projected values on multiple principal axes and add up multiple principal components.

Putting these together, the following is the complete code to generate the visualizations:

from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.decomposition import PCA from sklearn.metrics import f1_score from sklearn.svm import SVC import matplotlib.pyplot as plt # Load iris dataset irisdata = load_iris() X, y = irisdata['data'], irisdata['target'] plt.figure(figsize=(8,6)) plt.scatter(X[:,0], X[:,1], c=y) plt.xlabel(irisdata["feature_names"][0]) plt.ylabel(irisdata["feature_names"][1]) plt.title("Two features from the iris dataset") plt.show() # Show the principal components pca = PCA().fit(X) print("Principal components:") print(pca.components_) # Remove PC1 Xmean = X - X.mean(axis=0) value = Xmean @ pca.components_[0] pc1 = value.reshape(-1,1) @ pca.components_[0].reshape(1,-1) Xremove = X - pc1 plt.figure(figsize=(8,6)) plt.scatter(Xremove[:,0], Xremove[:,1], c=y) plt.xlabel(irisdata["feature_names"][0]) plt.ylabel(irisdata["feature_names"][1]) plt.title("Two features from the iris dataset after removing PC1") plt.show() # Remove PC2 Xmean = X - X.mean(axis=0) value = Xmean @ pca.components_[1] pc2 = value.reshape(-1,1) @ pca.components_[1].reshape(1,-1) Xremove = Xremove - pc2 plt.figure(figsize=(8,6)) plt.scatter(Xremove[:,0], Xremove[:,1], c=y) plt.xlabel(irisdata["feature_names"][0]) plt.ylabel(irisdata["feature_names"][1]) plt.title("Two features from the iris dataset after removing PC1 and PC2") plt.show() # Remove PC3 Xmean = X - X.mean(axis=0) value = Xmean @ pca.components_[2] pc3 = value.reshape(-1,1) @ pca.components_[2].reshape(1,-1) Xremove = Xremove - pc3 plt.figure(figsize=(8,6)) plt.scatter(Xremove[:,0], Xremove[:,1], c=y) plt.xlabel(irisdata["feature_names"][0]) plt.ylabel(irisdata["feature_names"][1]) plt.title("Two features from the iris dataset after removing PC1 to PC3") plt.show() # Print the explained variance ratio print("Explainedd variance ratios:") print(pca.explained_variance_ratio_) # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) # Run classifer on all features clf = SVC(kernel="linear", gamma='auto').fit(X_train, y_train) print("Using all features, accuracy: ", clf.score(X_test, y_test)) print("Using all features, F1: ", f1_score(y_test, clf.predict(X_test), average="macro")) # Run classifier on PC1 mean = X_train.mean(axis=0) X_train2 = X_train - mean X_train2 = (X_train2 @ pca.components_[0]).reshape(-1,1) clf = SVC(kernel="linear", gamma='auto').fit(X_train2, y_train) X_test2 = X_test - mean X_test2 = (X_test2 @ pca.components_[0]).reshape(-1,1) print("Using PC1, accuracy: ", clf.score(X_test2, y_test)) print("Using PC1, F1: ", f1_score(y_test, clf.predict(X_test2), average="macro"))

This section provides more resources on the topic if you are looking to go deeper.

- How to Calculate Principal Component Analysis (PCA) from Scratch in Python
- Principal Component Analysis for Dimensionality Reduction in Python

- scikit-learn toy datasets
- scikit-learn iris dataset
- scikit-learn wine dataset
- matplotlib scatter API
- The mplot3d toolkit

In this tutorial, you discovered how to visualize data using principal component analysis.

Specifically, you learned:

- Visualize a high dimensional dataset in 2D using PCA
- How to use the plot in PCA dimensions to help choosing an appropriate machine learning model
- How to observe the explained variance ratio of PCA
- What the explained variance ratio means for machine learning

The post Principal Component Analysis for Visualization appeared first on Machine Learning Mastery.

]]>