The post Calculus for Machine Learning (7-day mini-course) appeared first on Machine Learning Mastery.

]]>Get familiar with the calculus techniques in machine learning in 7 days.

Calculus is an important mathematics technique behind many machine learning algorithms. You don’t always need to know it to use the algorithms. When you go deeper, you will see it is ubiquitous in every discussion on the theory behind a machine learning model.

As a practitioner, we are most likely not going to encounter very hard calculus problems. If we need to do one, there are tools such as computer algebra systems to help, or at least, verify our solution. However, what is more important is understanding the idea behind calculus and relating the calculus terms to its use in our machine learning algorithms.

In this crash course, you will discover some common calculus ideas used in machine learning. You will learn with exercises in Python in seven days.

This is a big and important post. You might want to bookmark it.

Let’s get started.

Before we get started, let’s make sure you are in the right place.

This course is for developers who may know some applied machine learning. Maybe you know how to work through a predictive modeling problem end to end, or at least most of the main steps, with popular tools.

The lessons in this course do assume a few things about you, such as:

- You know your way around basic Python for programming.
- You may know some basic linear algebra.
- You may know some basic machine learning models.

You do NOT need to be:

- A math wiz!
- A machine learning expert!

This crash course will take you from a developer who knows a little machine learning to a developer who can effectively talk about the calculus concepts in machine learning algorithms.

Note: This crash course assumes you have a working Python 3.7 environment with some libraries such as SciPy and SymPy installed. If you need help with your environment, you can follow the step-by-step tutorial here:

This crash course is broken down into seven lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.

Below is a list of the seven lessons that will get you started and productive with data preparation in Python:

**Lesson 01**: Differential calculus**Lesson 02**: Integration**Lesson 03**: Gradient of a vector function**Lesson 04**: Jacobian**Lesson 05**: Backpropagation**Lesson 06**: Optimization**Lesson 07**: Support vector machine

Each lesson could take you 5 minutes or up to 1 hour. Take your time and complete the lessons at your own pace. Ask questions, and even post results in the comments below.

The lessons might expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help with and about the algorithms and the best-of-breed tools in Python. (**Hint**: *I have all of the answers on this blog; use the search box*.)

**Post your results in the comments**; I’ll cheer you on!

Hang in there; don’t give up.

In this lesson, you will discover what is differential calculus or differentiation.

Differentiation is the operation of transforming one mathematical function to another, called the derivative. The derivative tells the slope, or the rate of change, of the original function.

For example, if we have a function $f(x)=x^2$, its derivative is a function that tells us the rate of change of this function at $x$. The rate of change is defined as: $$f'(x) = \frac{f(x+\delta x)-f(x)}{\delta x}$$ for a small quantity $\delta x$.

Usually we will define the above in the form of a limit, i.e.,

$$f'(x) = \lim_{\delta x\to 0} \frac{f(x+\delta x)-f(x)}{\delta x}$$

to mean $\delta x$ should be as close to zero as possible.

There are several rules of differentiation to help us find the derivative easier. One rule that fits the above example is $\frac{d}{dx} x^n = nx^{n-1}$. Hence for $f(x)=x^2$, we have the derivative $f'(x)=2x$.

We can confirm this is the case by plotting the function $f'(x)$ computed according to the rate of change together with that computed according to the rule of differentiation. The following uses NumPy and matplotlib in Python:

import numpy as np import matplotlib.pyplot as plt # Define function f(x) def f(x): return x**2 # compute f(x) = x^2 for x=-10 to x=10 x = np.linspace(-10,10,500) y = f(x) # Plot f(x) on left half of the figure fig = plt.figure(figsize=(12,5)) ax = fig.add_subplot(121) ax.plot(x, y) ax.set_title("y=f(x)") # f'(x) using the rate of change delta_x = 0.0001 y1 = (f(x+delta_x) - f(x))/delta_x # f'(x) using the rule y2 = 2 * x # Plot f'(x) on right half of the figure ax = fig.add_subplot(122) ax.plot(x, y1, c="r", alpha=0.5, label="rate") ax.plot(x, y2, c="b", alpha=0.5, label="rule") ax.set_title("y=f'(x)") ax.legend() plt.show()

In the plot above, we can see the derivative function found using the rate of change and then using the rule of differentiation coincide perfectly.

We can similarly do a differentiation of other functions. For example, $f(x)=x^3 – 2x^2 + 1$. Find the derivative of this function using the rules of differentiation and compare your result with the result found using the rate of limits. Verify your result with the plot above. If you’re doing it correctly, you should see the following graph:

In the next lesson, you will discover that integration is the reverse of differentiation.

In this lesson, you will discover integration is the reverse of differentiation.

If we consider a function $f(x)=2x$ and at intervals of $\delta x$ each step (e.g., $\delta x = 0.1$), we can compute, say, from $x=-10$ to $x=10$ as:

$$

f(-10), f(-9.9), f(-9.8), \cdots, f(9.8), f(9.9), f(10)

$$

Obviously, if we have a smaller step $\delta x$, there are more terms in the above.

If we multiply each of the above with the step size and then add them up, i.e.,

$$

f(-10)\times 0.1 + f(-9.9)\times 0.1 + \cdots + f(9.8)\times 0.1 + f(9.9)\times 0.1

$$

this sum is called the integral of $f(x)$. In essence, this sum is the **area under the curve** of $f(x)$, from $x=-10$ to $x=10$. A theorem in calculus says if we put the area under the curve as a function, its derivative is $f(x)$. Hence we can see the integration as a reverse operation of differentiation.

As we saw in Lesson 01, the differentiation of $f(x)=x^2$ is $f'(x)=2x$. This means for $f(x)=2x$, we can write $\int f(x) dx = x^2$ or we can say the antiderivative of $f(x)=x$ is $x^2$. We can confirm this in Python by calculating the area directly:

import numpy as np import matplotlib.pyplot as plt def f(x): return 2*x # Set up x from -10 to 10 with small steps delta_x = 0.1 x = np.arange(-10, 10, delta_x) # Find f(x) * delta_x fx = f(x) * delta_x # Compute the running sum y = fx.cumsum() # Plot plt.plot(x, y) plt.show()

This plot has the same shape as $f(x)$ in Lesson 01. Indeed, all functions differ by a constant (e.g., $f(x)$ and $f(x)+5$) that have the same derivative. Hence the plot of the antiderivative computed will be the original shifted vertically.

Consider $f(x)=3x^2-4x$, find the antiderivative of this function and plot it. Also, try to replace the Python code above with this function. If you plot both together, you should see the following:

Post your answer in the comments below. I would love to see what you come up with.

In this lesson, you will learn the concept of gradient of a multivariate function.

If we have a function of not one variable but two or more, the differentiation is extended naturally to be the differentiation of the function with respect to each variable. For example, if we have the function $f(x,y) = x^2 + y^3$, we can write the differentiation in each variable as:

$$

\begin{aligned}

\frac{\partial f}{\partial x} &= 2x \\

\frac{\partial f}{\partial y} &= 3y^2

\end{aligned}

$$

Here we introduced the notation of a partial derivative, meaning to differentiate a function on one variable while assuming the other variables are constants. Hence in the above, when we compute $\frac{\partial f}{\partial x}$, we ignored the $y^3$ part in the function $f(x,y)$.

A function with two variables can be visualized as a surface on a plane. The above function $f(x,y)$ can be visualized using matplotlib:

import numpy as np import matplotlib.pyplot as plt # Define the range for x and y x = np.linspace(-10,10,1000) xv, yv = np.meshgrid(x, x, indexing='ij') # Compute f(x,y) = x^2 + y^3 zv = xv**2 + yv**3 # Plot the surface fig = plt.figure(figsize=(6,6)) ax = fig.add_subplot(projection='3d') ax.plot_surface(xv, yv, zv, cmap="viridis") plt.show()

The gradient of this function is denoted as:

$$\nabla f(x,y) = \Big(\frac{\partial f}{\partial x},\; \frac{\partial f}{\partial y}\Big) = (2x,\;3y^2)$$

Therefore, at each coordinate $(x,y)$, the gradient $\nabla f(x,y)$ is a vector. This vector tells us two things:

- The direction of the vector points to where the function $f(x,y)$ is increasing the fastest
- The size of the vector is the rate of change of the function $f(x,y)$ in this direction

One way to visualize the gradient is to consider it as a **vector field**:

import numpy as np import matplotlib.pyplot as plt # Define the range for x and y x = np.linspace(-10,10,20) xv, yv = np.meshgrid(x, x, indexing='ij') # Compute the gradient of f(x,y) fx = 2*xv fy = 2*yv # Convert the vector (fx,fy) into size and direction size = np.sqrt(fx**2 + fy**2) dir_x = fx/size dir_y = fy/size # Plot the surface plt.figure(figsize=(6,6)) plt.quiver(xv, yv, dir_x, dir_y, size, cmap="viridis") plt.show()

The viridis color map in matplotlib will show a larger value in yellow and a lower value in purple. Hence we see the gradient is “steeper” at the edges than in the center in the above plot.

If we consider the coordinate (2,3), we can check which direction $f(x,y)$ will increase the fastest using the following:

import numpy as np def f(x, y): return x**2 + y**3 # 0 to 360 degrees at 0.1-degree steps angles = np.arange(0, 360, 0.1) # coordinate to check x, y = 2, 3 # step size for differentiation step = 0.0001 # To keep the size and direction of maximum rate of change maxdf, maxangle = -np.inf, 0 for angle in angles: # convert degree to radian rad = angle * np.pi / 180 # delta x and delta y for a fixed step size dx, dy = np.sin(rad)*step, np.cos(rad)*step # rate of change at a small step df = (f(x+dx, y+dy) - f(x,y))/step # keep the maximum rate of change if df > maxdf: maxdf, maxangle = df, angle # Report the result dx, dy = np.sin(maxangle*np.pi/180), np.cos(maxangle*np.pi/180) gradx, grady = dx*maxdf, dy*maxdf print(f"Max rate of change at {maxangle} degrees") print(f"Gradient vector at ({x},{y}) is ({dx*maxdf},{dy*maxdf})")

Its output is:

Max rate of change at 8.4 degrees Gradient vector at (2,3) is (3.987419245872443,27.002750276227097)

The gradient vector according to the formula is (4,27), which the numerical result above is close enough.

Consider the function $f(x,y)=x^2+y^2$, what is the gradient vector at (1,1)? If you get the answer from partial differentiation, can you modify the above Python code to confirm it by checking the rate of change at different directions?

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover the differentiation of a function that takes vector input and produces vector output.

In this lesson, you will learn about Jacobian matrix.

The function $f(x,y)=(p(x,y), q(x,y))=(2xy, x^2y)$ is one with two input and two outputs. Sometimes we call this function taking vector arguments and returning a vector value. The differentiation of this function is a matrix called the Jacobian. The Jacobian of the above function is:

$$

\mathbf{J} =

\begin{bmatrix}

\frac{\partial p}{\partial x} & \frac{\partial p}{\partial y} \\

\frac{\partial q}{\partial x} & \frac{\partial q}{\partial y}

\end{bmatrix}

=

\begin{bmatrix}

2y & 2x \\

2xy & x^2

\end{bmatrix}

$$

In the Jacobian matrix, each row has the partial differentiation of each element of the output vector, and each column has the partial differentiation with respect to each element of the input vector.

We will see the use of Jacobian later. Since finding a Jacobian matrix involves a lot of partial differentiations, it would be great if we could let a computer check our math. In Python, we can verify the above result using SymPy:

from sympy.abc import x, y from sympy import Matrix, pprint f = Matrix([2*x*y, x**2*y]) variables = Matrix([x,y]) pprint(f.jacobian(variables))

Its output is:

⎡ 2⋅y 2⋅x⎤ ⎢ ⎥ ⎢ 2 ⎥ ⎣2⋅x⋅y x ⎦

We asked SymPy to define the symbols `x`

and `y`

and then defined the vector function `f`

. Afterward, the Jacobian can be found by calling the `jacobian()`

function.

Consider the function

$$

f(x,y) = \begin{bmatrix}

\frac{1}{1+e^{-(px+qy)}} & \frac{1}{1+e^{-(rx+sy)}} & \frac{1}{1+e^{-(tx+uy)}}

\end{bmatrix}

$$

where $p,q,r,s,t,u$ are constants. What is the Jacobian matrix of $f(x,y)$? Can you verify it with SymPy?

In the next lesson, you will discover the application of the Jacobian matrix in a neural network’s backpropagation algorithm.

In this lesson, you will see how the backpropagation algorithm uses the Jacobian matrix.

If we consider a neural network with one hidden layer, we can represent it as a function:

$$

y = g\Big(\sum_{k=1}^M u_k f_k\big(\sum_{i=1}^N w_{ik}x_i\big)\Big)

$$

The input to the neural network is a vector $\mathbf{x}=(x_1, x_2, \cdots, x_N)$ and each $x_i$ will be multiplied with weight $w_{ik}$ and fed into the hidden layer. The output of neuron $k$ in the hidden layer will be multiplied with weight $u_k$ and fed into the output layer. The activation function of the hidden layer and output layer are $f$ and $g$, respectively.

If we consider

$$z_k = f_k\big(\sum_{i=1}^N w_{ik}x_i\big)$$

then

$$

\frac{\partial y}{\partial x_i} = \sum_{k=1}^M \frac{\partial y}{\partial z_k}\frac{\partial z_k}{\partial x_i}

$$

If we consider the entire layer at once, we have $\mathbf{z}=(z_1, z_2, \cdots, z_M)$ and then

$$

\frac{\partial y}{\partial \mathbf{x}} = \mathbf{W}^\top\frac{\partial y}{\partial \mathbf{z}}

$$

where $\mathbf{W}$ is the $M\times N$ Jacobian matrix, where the element on row $k$ and column $i$ is $\frac{\partial z_k}{\partial x_i}$.

This is how the backpropagation algorithm works in training a neural network! For a network with multiple hidden layers, we need to compute the Jacobian matrix for each layer.

The code below implements a neural network model that you can try yourself. It has two hidden layers and a classification network to separate points in 2-dimension into two classes. Try to look at the function `backward()`

and identify which is the Jacobian matrix.

If you play with this code, the class `mlp`

should not be modified, but you can change the parameters on how a model is created.

from sklearn.datasets import make_circles from sklearn.metrics import accuracy_score import numpy as np np.random.seed(0) # Find a small float to avoid division by zero epsilon = np.finfo(float).eps # Sigmoid function and its differentiation def sigmoid(z): return 1/(1+np.exp(-z.clip(-500, 500))) def dsigmoid(z): s = sigmoid(z) return 2 * s * (1-s) # ReLU function and its differentiation def relu(z): return np.maximum(0, z) def drelu(z): return (z > 0).astype(float) # Loss function L(y, yhat) and its differentiation def cross_entropy(y, yhat): """Binary cross entropy function L = - y log yhat - (1-y) log (1-yhat) Args: y, yhat (np.array): nx1 matrices which n are the number of data instances Returns: average cross entropy value of shape 1x1, averaging over the n instances """ return ( -(y.T @ np.log(yhat.clip(epsilon)) + (1-y.T) @ np.log((1-yhat).clip(epsilon)) ) / y.shape[1] ) def d_cross_entropy(y, yhat): """ dL/dyhat """ return ( - np.divide(y, yhat.clip(epsilon)) + np.divide(1-y, (1-yhat).clip(epsilon)) ) class mlp: '''Multilayer perceptron using numpy ''' def __init__(self, layersizes, activations, derivatives, lossderiv): """remember config, then initialize array to hold NN parameters without init""" # hold NN config self.layersizes = tuple(layersizes) self.activations = tuple(activations) self.derivatives = tuple(derivatives) self.lossderiv = lossderiv # parameters, each is a 2D numpy array L = len(self.layersizes) self.z = [None] * L self.W = [None] * L self.b = [None] * L self.a = [None] * L self.dz = [None] * L self.dW = [None] * L self.db = [None] * L self.da = [None] * L def initialize(self, seed=42): """initialize the value of weight matrices and bias vectors with small random numbers.""" np.random.seed(seed) sigma = 0.1 for l, (n_in, n_out) in enumerate(zip(self.layersizes, self.layersizes[1:]), 1): self.W[l] = np.random.randn(n_in, n_out) * sigma self.b[l] = np.random.randn(1, n_out) * sigma def forward(self, x): """Feed forward using existing `W` and `b`, and overwrite the result variables `a` and `z` Args: x (numpy.ndarray): Input data to feed forward """ self.a[0] = x for l, func in enumerate(self.activations, 1): # z = W a + b, with `a` as output from previous layer # `W` is of size rxs and `a` the size sxn with n the number of data # instances, `z` the size rxn, `b` is rx1 and broadcast to each # column of `z` self.z[l] = (self.a[l-1] @ self.W[l]) + self.b[l] # a = g(z), with `a` as output of this layer, of size rxn self.a[l] = func(self.z[l]) return self.a[-1] def backward(self, y, yhat): """back propagation using NN output yhat and the reference output y, generates dW, dz, db, da """ # first `da`, at the output self.da[-1] = self.lossderiv(y, yhat) for l, func in reversed(list(enumerate(self.derivatives, 1))): # compute the differentials at this layer self.dz[l] = self.da[l] * func(self.z[l]) self.dW[l] = self.a[l-1].T @ self.dz[l] self.db[l] = np.mean(self.dz[l], axis=0, keepdims=True) self.da[l-1] = self.dz[l] @ self.W[l].T def update(self, eta): """Updates W and b Args: eta (float): Learning rate """ for l in range(1, len(self.W)): self.W[l] -= eta * self.dW[l] self.b[l] -= eta * self.db[l] # Make data: Two circles on x-y plane as a classification problem X, y = make_circles(n_samples=1000, factor=0.5, noise=0.1) y = y.reshape(-1,1) # our model expects a 2D array of (n_sample, n_dim) # Build a model model = mlp(layersizes=[2, 4, 3, 1], activations=[relu, relu, sigmoid], derivatives=[drelu, drelu, dsigmoid], lossderiv=d_cross_entropy) model.initialize() yhat = model.forward(X) loss = cross_entropy(y, yhat) score = accuracy_score(y, (yhat > 0.5)) print(f"Before training - loss value {loss} accuracy {score}") # train for each epoch n_epochs = 150 learning_rate = 0.005 for n in range(n_epochs): model.forward(X) yhat = model.a[-1] model.backward(y, yhat) model.update(learning_rate) loss = cross_entropy(y, yhat) score = accuracy_score(y, (yhat > 0.5)) print(f"Iteration {n} - loss value {loss} accuracy {score}")

In the next lesson, you will discover the use of differentiation to find the optimal value of a function.

In this lesson, you will learn an important use of differentiation.

Because the differentiation of a function is the rate of change, we can make use of differentiation to find the optimal point of a function.

If a function attained its maximum, we would expect it to move from a lower point to the maximum, and if we move further, it falls to another lower point. Hence at the point of maximum, the rate of change of a function is zero. And vice versa for the minimum.

As an example, consider $f(x)=x^3-2x^2+1$. The derivative is $f'(x) = 3x^2-4x$ and $f'(x)=0$ at $x=0$ and $x=4/3$. Hence these positions of $x$ are where $f(x)$ is at its maximum or minimum. We can visually confirm it by plotting $f(x)$ (see the plot in Lesson 01).

Consider the function $f(x)=\log x$ and find its derivative. What will be the value of $x$ when $f'(x)=0$? What does it tell you about the maximum or minimum of the log function? Try to plot the function of $\log x$ to visually confirm your answer.

In the next lesson, you will discover the application of this technique in finding the support vector.

In this lesson, you will learn how we can convert support vector machine into an optimization problem.

In a two-dimensional plane, any straight line can be represented by the equation:

$$ax+by+c=0$$

in the $xy$-coordinate system. A result from the study of coordinate geometry says that for any point $(x_0,y_0)$, its **distance** to the line $ax+by+c=0$ is:

$$

\frac{\vert ax_0+by_0+c \vert}{\sqrt{a^2+b^2}}

$$

Consider the points (0,0), (1,2), and (2,1) in the $xy$-plane, in which the first point and the latter two points are in different classes. What is the line that best separates these two classes? This is the basis of a support vector machine classifier. The support vector is the line of maximum separation in this case.

To find such a line, we are looking for:

$$

\begin{aligned}

\text{minimize} && a^2 + b^2 \\

\text{subject to} && -1(0a+0b+c) &\ge 1 \\

&& +1(1a+2b+c) &\ge 1 \\

&& +1(2a+1b+c) &\ge 1

\end{aligned}

$$

The objective $a^2+b^2$ is to be minimized so that the distances from each data point to the line are maximized. The condition $-1(0a+0b+c)\ge 1$ means the point (0,0) is of class $-1$; similarly for the other two points, they are of class $+1$. The straight line should put these two classes in different sides of the plane.

This is a **constrained optimization** problem, and the way to solve it is to use the Lagrange multiplier approach. The first step in using the Lagrange multiplier approach is to find the partial differentials of the following Lagrange function:

$$

L = a^2+b^2 + \lambda_1(-c-1) + \lambda_2 (a+2b+c-1) + \lambda_3 (2a+b+c-1)

$$

and set the partial differentials to zero, then solve for $a$, $b$, and $c$. It would be too lengthy to demonstrate here, but we can use SciPy to find the solution to the above numerically:

import numpy as np from scipy.optimize import minimize def objective(w): return w[0]**2 + w[1]**2 def constraint1(w): "Inequality for point (0,0)" return -1*w[2] - 1 def constraint2(w): "Inequality for point (1,2)" return w[0] + 2*w[1] + w[2] - 1 def constraint3(w): "Inequality for point (2,1)" return 2*w[0] + w[1] + w[2] - 1 # initial guess w0 = np.array([1, 1, 1]) # optimize bounds = ((-10,10), (-10,10), (-10,10)) constraints = [ {"type":"ineq", "fun":constraint1}, {"type":"ineq", "fun":constraint2}, {"type":"ineq", "fun":constraint3}, ] solution = minimize(objective, w0, method="SLSQP", bounds=bounds, constraints=constraints) w = solution.x print("Objective:", objective(w)) print("Solution:", w)

It will print:

Objective: 0.8888888888888942 Solution: [ 0.66666667 0.66666667 -1. ]

The above means the line to separate these three points is $0.67x + 0.67y – 1 = 0$. Note that if you provided $N$ data points, there would be $N$ constraints to be defined.

Let’s consider the points (-1,-1) and (-3,-1) to be the first class together with (0,0) and point (3,3) to be the second class together with points (1,2) and (2,1). In this problem of six points, can you modify the above program and find the line that separates the two classes? Don’t be surprised to see the solution remain the same as above. There is a reason for it. Can you tell?

Post your answer in the comments below. I would love to see what you come up with.

This was the final lesson.

(

You made it. Well done!

Take a moment and look back at how far you have come.

You discovered:

- What is differentiation, and what it means to a function
- What is integration
- How to extend differentiation to a function of vector argument
- How to do differentiation on a vector-valued function
- The role of Jacobian in the backpropagation algorithm in neural networks
- How to use differentiation to find the optimum points of a function
- Support vector machine is a constrained optimization problem, which would need differentiation to solve

**How did you do with the mini-course?**

Did you enjoy this crash course?

**Do you have any questions? Were there any sticking points?**

Let me know. Leave a comment below.

The post Calculus for Machine Learning (7-day mini-course) appeared first on Machine Learning Mastery.

]]>The post Method of Lagrange Multipliers: The Theory Behind Support Vector Machines (Part 3: Implementing An SVM From Scratch In Python) appeared first on Machine Learning Mastery.

]]>After completing this tutorial, you will know:

- How to use SciPy’s optimization routines
- How to define the objective function
- How to define bounds and linear constraints
- How to implement your own SVM classifier in Python

Let’s get started.

This tutorial is divided into 2 parts; they are:

- The optimization problem of an SVM
- Solution of the optimization problem in Python
- Define the objective function
- Define the bounds and linear constraints

- Solve the problem with different C values

For this tutorial, it is assumed that you are already familiar with the following topics. You can click on the individual links to get more details.

- A Gentle Introduction to Optimization / Mathematical Programming
- A Gentle Introduction To Method Of Lagrange Multipliers
- Lagrange Multiplier Approach with Inequality Constraints
- Method Of Lagrange Multipliers: The Theory Behind Support Vector Machines (Part 1: The Separable Case))
- Method Of Lagrange Multipliers: The Theory Behind Support Vector Machines (Part 2: The Non-Separable Case

A basic SVM machine assumes a binary classification problem. Suppose, we have $m$ training points, each point being an $n$-dimensional vector. We’ll use the following notations:

- $m$: Total training points
- $n$: Dimensionality of each training point
- $x$: Data point, which is an $n$-dimensional vector
- $i$: Subscript used to index the training points. $0 \leq i < m$
- $k$: Subscript used to index the training points. $0 \leq k < m$
- $j$: Subscript used to index each dimension of a training point
- $t$: Label of a data point. It is an $m$-dimensional vector, with $t_i \in \{-1, +1\}$
- $T$: Transpose operator
- $w$: Weight vector denoting the coefficients of the hyperplane. It is also an $n$-dimensional vector
- $\alpha$: Vector of Lagrange multipliers, also an $m$-dimensional vector
- $C$: User defined penalty factor/regularization constant

The SVM classifier maximizes the following Lagrange dual given by:

$$

L_d = -\frac{1}{2} \sum_i \sum_k \alpha_i \alpha_k t_i t_k (x_i)^T (x_k) + \sum_i \alpha_i

$$

The above function is subject to the following constraints:

\begin{eqnarray}

0 \leq \alpha_i \leq C, & \forall i\\

\sum_i \alpha_i t_i = 0& \\

\end{eqnarray}

All we have to do is find the Lagrange multiplier $\alpha$ associated with each training point, while satisfying the above constraints.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

We’ll use the SciPy optimize package to find the optimal values of Lagrange multipliers, and compute the soft margin and the separating hyperplane.

Let’s write the import section for optimization, plotting and synthetic data generation.

import numpy as np # For optimization from scipy.optimize import Bounds, BFGS from scipy.optimize import LinearConstraint, minimize # For plotting import matplotlib.pyplot as plt import seaborn as sns # For generating dataset import sklearn.datasets as dt

We also need the following constant to detect all alphas numerically close to zero, so we need to define our own threshold for zero.

ZERO = 1e-7

Let’s define a very simple dataset, the corresponding labels and a simple routine for plotting this data. Optionally, if a string of alphas is given to the plotting function, then it will also label all support vectors with their corresponding alpha values. Just to recall support vectors are those points for which $\alpha>0$.

dat = np.array([[0, 3], [-1, 0], [1, 2], [2, 1], [3,3], [0, 0], [-1, -1], [-3, 1], [3, 1]]) labels = np.array([1, 1, 1, 1, 1, -1, -1, -1, -1]) def plot_x(x, t, alpha=[], C=0): sns.scatterplot(dat[:,0], dat[:, 1], style=labels, hue=labels, markers=['s', 'P'], palette=['magenta', 'green']) if len(alpha) > 0: alpha_str = np.char.mod('%.1f', np.round(alpha, 1)) ind_sv = np.where(alpha > ZERO)[0] for i in ind_sv: plt.gca().text(dat[i,0], dat[i, 1]-.25, alpha_str[i] ) plot_x(dat, labels)

`minimize()`

FunctionLet’s look at the `minimize()`

function in `scipy.optimize`

library. It requires the following arguments:

- The objective function to minimize. Lagrange dual in our case.
- The initial values of variables with respect to which the minimization takes place. In this problem, we have to determine the Lagrange multipliers $\alpha$. We’ll initialize all $\alpha$ randomly.
- The method to use for optimization. We’ll use
`trust-constr`

. - The linear constraints on $\alpha$.
- The bounds on $\alpha$.

Our objective function is $L_d$ defined above, which has to be maximized. As we are using the `minimize()`

function, we have to multiply $L_d$ by (-1) to maximize it. Its implementation is given below. The first parameter for the objective function is the variable w.r.t. which the optimization takes place. We also need the training points and the corresponding labels as additional arguments.

You can shorten the code for the `lagrange_dual()`

function given below by using matrices. However, in this tutorial, it is kept very simple to make everything clear.

# Objective function def lagrange_dual(alpha, x, t): result = 0 ind_sv = np.where(alpha > ZERO)[0] for i in ind_sv: for k in ind_sv: result = result + alpha[i]*alpha[k]*t[i]*t[k]*np.dot(x[i, :], x[k, :]) result = 0.5*result - sum(alpha) return result

The linear constraint on alpha for each point is given by:

$$

\sum_i \alpha_i t_i = 0

$$

We can also write this as:

$$

\alpha_0 t_0 + \alpha_1 t_1 + \ldots \alpha_m t_m = 0

$$

The `LinearConstraint()`

method requires all constraints to be written as matrix form, which is:

\begin{equation}

0 =

\begin{bmatrix}

t_0 & t_1 & \ldots t_m

\end{bmatrix}

\begin{bmatrix}

\alpha_0\\ \alpha_1 \\ \vdots \\ \alpha_m

\end{bmatrix}

= 0

\end{equation}

The first matrix is the first parameter in the `LinearConstraint()`

method. The left and right bounds are the second and third arguments.

linear_constraint = LinearConstraint(labels, [0], [0]) print(linear_constraint)

<scipy.optimize._constraints.LinearConstraint object at 0x12c87f5b0>

The bounds on alpha are defined using the `Bounds()`

method. All alphas are constrained to lie between 0 and $C$. Here is an example for $C=10$.

bounds_alpha = Bounds(np.zeros(dat.shape[0]), np.full(dat.shape[0], 10)) print(bounds_alpha)

Bounds(array([0., 0., 0., 0., 0., 0., 0., 0., 0.]), array([10, 10, 10, 10, 10, 10, 10, 10, 10]))

Let’s write the overall routine to find the optimal values of `alpha`

when given the parameters `x`

, `t`

, and `C`

. The objective function requires the additional arguments `x`

and `t`

, which are passed via args in `minimize()`

.

def optimize_alpha(x, t, C): m, n = x.shape np.random.seed(1) # Initialize alphas to random values alpha_0 = np.random.rand(m)*C # Define the constraint linear_constraint = LinearConstraint(t, [0], [0]) # Define the bounds bounds_alpha = Bounds(np.zeros(m), np.full(m, C)) # Find the optimal value of alpha result = minimize(lagrange_dual, alpha_0, args = (x, t), method='trust-constr', hess=BFGS(), constraints=[linear_constraint], bounds=bounds_alpha) # The optimized value of alpha lies in result.x alpha = result.x return alpha

The expression for the hyperplane is given by:

$$

w^T x + w_0 = 0

$$

For the hyperplane, we need the weight vector $w$ and the constant $w_0$. The weight vector is given by:

$$

w = \sum_i \alpha_i t_i x_i

$$

If there are too many training points, it’s best to use only support vectors with $\alpha>0$ to compute the weight vector.

For $w_0$, we’ll compute it from each support vector $s$, for which $\alpha_s < C$, and then take the average. For a single support vector $x_s$, $w_0$ is given by:

$$

w_0 = t_s – w^T x_s

$$

A support vector’s alpha cannot be numerically exactly equal to C. Hence, we can subtract a small constant from C to find all support vectors with $\alpha_s < C$. This is done in the `get_w0()`

function.

def get_w(alpha, t, x): m = len(x) # Get all support vectors w = np.zeros(x.shape[1]) for i in range(m): w = w + alpha[i]*t[i]*x[i, :] return w def get_w0(alpha, t, x, w, C): C_numeric = C-ZERO # Indices of support vectors with alpha<C ind_sv = np.where((alpha > ZERO)&(alpha < C_numeric))[0] w0 = 0.0 for s in ind_sv: w0 = w0 + t[s] - np.dot(x[s, :], w) # Take the average w0 = w0 / len(ind_sv) return w0

To classify a test point $x_{test}$, we use the sign of $y(x_{test})$ as:

$$

\text{label}_{x_{test}} = \text{sign}(y(x_{test})) = \text{sign}(w^T x_{test} + w_0)

$$

Let’s write the corresponding function that can take as argument an array of test points along with $w$ and $w_0$ and classify various points. We have also added a second function for calculating the misclassification rate:

def classify_points(x_test, w, w0): # get y(x_test) predicted_labels = np.sum(x_test*w, axis=1) + w0 predicted_labels = np.sign(predicted_labels) # Assign a label arbitrarily a +1 if it is zero predicted_labels[predicted_labels==0] = 1 return predicted_labels def misclassification_rate(labels, predictions): total = len(labels) errors = sum(labels != predictions) return errors/total*100

Let’s also define functions to plot the hyperplane and the soft margin.

def plot_hyperplane(w, w0): x_coord = np.array(plt.gca().get_xlim()) y_coord = -w0/w[1] - w[0]/w[1] * x_coord plt.plot(x_coord, y_coord, color='red') def plot_margin(w, w0): x_coord = np.array(plt.gca().get_xlim()) ypos_coord = 1/w[1] - w0/w[1] - w[0]/w[1] * x_coord plt.plot(x_coord, ypos_coord, '--', color='green') yneg_coord = -1/w[1] - w0/w[1] - w[0]/w[1] * x_coord plt.plot(x_coord, yneg_coord, '--', color='magenta')

It’s now time to run the SVM. The function `display_SVM_result()`

will help us visualize everything. We’ll initialize alpha to random values, define C and find the best values of alpha in this function. We’ll also plot the hyperplane, the margin and the data points. The support vectors would also be labelled by their corresponding alpha value. The title of the plot would be the percentage of errors and number of support vectors.

def display_SVM_result(x, t, C): # Get the alphas alpha = optimize_alpha(x, t, C) # Get the weights w = get_w(alpha, t, x) w0 = get_w0(alpha, t, x, w, C) plot_x(x, t, alpha, C) xlim = plt.gca().get_xlim() ylim = plt.gca().get_ylim() plot_hyperplane(w, w0) plot_margin(w, w0) plt.xlim(xlim) plt.ylim(ylim) # Get the misclassification error and display it as title predictions = classify_points(x, w, w0) err = misclassification_rate(t, predictions) title = 'C = ' + str(C) + ', Errors: ' + '{:.1f}'.format(err) + '%' title = title + ', total SV = ' + str(len(alpha[alpha > ZERO])) plt.title(title) display_SVM_result(dat, labels, 100) plt.show()

`C`

If you change the value of `C`

to $\infty$, then the soft margin turns into a hard margin, with no toleration for errors. The problem we defined above is not solvable in this case. Let’s generate an artificial set of points and look at the effect of `C`

on classification. To understand the entire problem, we’ll use a simple dataset, where the positive and negative examples are separable.

Below are the points generated via `make_blobs()`

:

dat, labels = dt.make_blobs(n_samples=[20,20], cluster_std=1, random_state=0) labels[labels==0] = -1 plot_x(dat, labels)

Now let’s define different values of C and run the code.

fig = plt.figure(figsize=(8,25)) i=0 C_array = [1e-2, 100, 1e5] for C in C_array: fig.add_subplot(311+i) display_SVM_result(dat, labels, C) i = i + 1

The above is a nice example, which shows that increasing $C$, decreases the margin. A high value of $C$ adds a stricter penalty on errors. A smaller value allows a wider margin and more misclassification errors. Hence, $C$ defines a tradeoff between the maximization of margin and classification errors.

Here is the consolidated code, that you can paste in your Python file and run it at your end. You can experiment with different values of $C$ and try out the different optimization methods given as arguments to the `minimize()`

function.

import numpy as np # For optimization from scipy.optimize import Bounds, BFGS from scipy.optimize import LinearConstraint, minimize # For plotting import matplotlib.pyplot as plt import seaborn as sns # For generating dataset import sklearn.datasets as dt ZERO = 1e-7 def plot_x(x, t, alpha=[], C=0): sns.scatterplot(dat[:,0], dat[:, 1], style=labels, hue=labels, markers=['s', 'P'], palette=['magenta', 'green']) if len(alpha) > 0: alpha_str = np.char.mod('%.1f', np.round(alpha, 1)) ind_sv = np.where(alpha > ZERO)[0] for i in ind_sv: plt.gca().text(dat[i,0], dat[i, 1]-.25, alpha_str[i] ) # Objective function def lagrange_dual(alpha, x, t): result = 0 ind_sv = np.where(alpha > ZERO)[0] for i in ind_sv: for k in ind_sv: result = result + alpha[i]*alpha[k]*t[i]*t[k]*np.dot(x[i, :], x[k, :]) result = 0.5*result - sum(alpha) return result def optimize_alpha(x, t, C): m, n = x.shape np.random.seed(1) # Initialize alphas to random values alpha_0 = np.random.rand(m)*C # Define the constraint linear_constraint = LinearConstraint(t, [0], [0]) # Define the bounds bounds_alpha = Bounds(np.zeros(m), np.full(m, C)) # Find the optimal value of alpha result = minimize(lagrange_dual, alpha_0, args = (x, t), method='trust-constr', hess=BFGS(), constraints=[linear_constraint], bounds=bounds_alpha) # The optimized value of alpha lies in result.x alpha = result.x return alpha def get_w(alpha, t, x): m = len(x) # Get all support vectors w = np.zeros(x.shape[1]) for i in range(m): w = w + alpha[i]*t[i]*x[i, :] return w def get_w0(alpha, t, x, w, C): C_numeric = C-ZERO # Indices of support vectors with alpha<C ind_sv = np.where((alpha > ZERO)&(alpha < C_numeric))[0] w0 = 0.0 for s in ind_sv: w0 = w0 + t[s] - np.dot(x[s, :], w) # Take the average w0 = w0 / len(ind_sv) return w0 def classify_points(x_test, w, w0): # get y(x_test) predicted_labels = np.sum(x_test*w, axis=1) + w0 predicted_labels = np.sign(predicted_labels) # Assign a label arbitrarily a +1 if it is zero predicted_labels[predicted_labels==0] = 1 return predicted_labels def misclassification_rate(labels, predictions): total = len(labels) errors = sum(labels != predictions) return errors/total*100 def plot_hyperplane(w, w0): x_coord = np.array(plt.gca().get_xlim()) y_coord = -w0/w[1] - w[0]/w[1] * x_coord plt.plot(x_coord, y_coord, color='red') def plot_margin(w, w0): x_coord = np.array(plt.gca().get_xlim()) ypos_coord = 1/w[1] - w0/w[1] - w[0]/w[1] * x_coord plt.plot(x_coord, ypos_coord, '--', color='green') yneg_coord = -1/w[1] - w0/w[1] - w[0]/w[1] * x_coord plt.plot(x_coord, yneg_coord, '--', color='magenta') def display_SVM_result(x, t, C): # Get the alphas alpha = optimize_alpha(x, t, C) # Get the weights w = get_w(alpha, t, x) w0 = get_w0(alpha, t, x, w, C) plot_x(x, t, alpha, C) xlim = plt.gca().get_xlim() ylim = plt.gca().get_ylim() plot_hyperplane(w, w0) plot_margin(w, w0) plt.xlim(xlim) plt.ylim(ylim) # Get the misclassification error and display it as title predictions = classify_points(x, w, w0) err = misclassification_rate(t, predictions) title = 'C = ' + str(C) + ', Errors: ' + '{:.1f}'.format(err) + '%' title = title + ', total SV = ' + str(len(alpha[alpha > ZERO])) plt.title(title) dat = np.array([[0, 3], [-1, 0], [1, 2], [2, 1], [3,3], [0, 0], [-1, -1], [-3, 1], [3, 1]]) labels = np.array([1, 1, 1, 1, 1, -1, -1, -1, -1]) plot_x(dat, labels) plt.show() display_SVM_result(dat, labels, 100) plt.show() dat, labels = dt.make_blobs(n_samples=[20,20], cluster_std=1, random_state=0) labels[labels==0] = -1 plot_x(dat, labels) fig = plt.figure(figsize=(8,25)) i=0 C_array = [1e-2, 100, 1e5] for C in C_array: fig.add_subplot(311+i) display_SVM_result(dat, labels, C) i = i + 1

This section provides more resources on the topic if you are looking to go deeper.

- Pattern Recognition and Machine Learning by Christopher M. Bishop

- Support Vector Machines for Machine Learning
- A Tutorial on Support Vector Machines for Pattern Recognition by Christopher J.C. Burges

- SciPy’s optimization library
- Scikit-learn’s sample generation library (sklearn.datasets)
- NumPy random number generator

In this tutorial, you discovered how to implement an SVM classifier from scratch.

Specifically, you learned:

- How to write the objective function and constraints for the SVM optimization problem
- How to write code to determine the hyperplane from Lagrange multipliers
- The effect of C on determining the margin

Do you have any questions about SVMs discussed in this post? Ask your questions in the comments below and I will do my best to answer.

The post Method of Lagrange Multipliers: The Theory Behind Support Vector Machines (Part 3: Implementing An SVM From Scratch In Python) appeared first on Machine Learning Mastery.

]]>The post Method of Lagrange Multipliers: The Theory Behind Support Vector Machines (Part 2: The Non-Separable Case) appeared first on Machine Learning Mastery.

]]>In this tutorial, we’ll cover the basics of a linear SVM. We won’t go into details of non-linear SVMs derived using the kernel trick. The content is enough to understand the basic mathematical model behind an SVM classifier.

After completing this tutorial, you will know:

- Concept of a soft margin
- How to maximize the margin while allowing mistakes in classification
- How to formulate the optimization problem and compute the Lagrange dual

Let’s get started.

This tutorial is divided into 2 parts; they are:

- The solution of the SVM problem for the case where positive and negative examples are not linearly separable
- The separating hyperplane and the corresponding relaxed constraints
- The quadratic optimization problem for finding the soft margin

- A worked example

For this tutorial, it is assumed that you are already familiar with the following topics. You can click on the individual links to get more information.

- A Gentle Introduction to Optimization / Mathematical Programming
- A Gentle Introduction To Method Of Lagrange Multipliers
- Lagrange Multiplier Approach with Inequality Constraints
- Method Of Lagrange Multipliers: The Theory Behind Support Vector Machines (Part 1: The Separable Case))

This is a continuation of Part 1, so the same notations will be used.

- $m$: Total training points
- $x$: Data point, which is an $n$-dimensional vector. Each dimension is indexed by j.
- $x^+$: Positive example
- $x^-$: Negative example
- $i$: Subscript used to index the training points. $0 \leq i < m$
- $j$: Subscript to index a dimension of the data point. $1 \leq j \leq n$
- $t$: Label of data points. It is an m-dimensional vector
- $T$: Transpose operator
- $w$: Weight vector denoting the coefficients of the hyperplane. It is an $n$-dimensional vector
- $\alpha$: Vector of Lagrange multipliers, an $m$-dimensional vector
- $\mu$: Vector of Lagrange multipliers, again an $m$-dimensional vector
- $\xi$: Error in classification. An $m$-dimensional vector

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Let’s find a separating hyperplane between the positive and negative examples. Just to recall, the separating hyperplane is given by the following expression, with \(w_j\) being the coefficients and \(w_0\) being the arbitrary constant that determines the distance of the hyperplane from the origin:

$$

w^T x_i + w_0 = 0

$$

As we allow positive and negative examples to lie on the wrong side of the hyperplane, we have a set of relaxed constraints. Defining $\xi_i \geq 0, \forall i$, for positive examples we require:

$$

w^T x_i^+ + w_0 \geq 1 – \xi_i

$$

Also for negative examples we require:

$$

w^T x_i^- + w_0 \leq -1 + \xi_i

$$

Combining the above two constraints by using the class label $t_i \in \{-1,+1\}$ we have the following constraint for all points:

$$

t_i(w^T x_i + w_0) \geq 1 – \xi_i

$$

The variable $\xi$ allows more flexibility in our model. It has the following interpretations:

- $\xi_i =0$: This means that $x_i$ is correctly classified and this data point is on the correct side of the hyperplane and away from the margin.
- $0 < \xi_i < 1$: When this condition is met, $x_i$ lies on the correct side of the hyperplane but inside the margin.
- $\xi_i > 0$: Satisfying this condition implies that $x_i$ is misclassified.

Hence, $\xi$ quantifies the errors in the classification of training points. We can define the soft error as:

$$

E_{soft} = \sum_i \xi_i

$$

We are now in a position to formulate the objective function along with the constraints on it. We still want to maximize the margin, i.e., we want to minimize the norm of the weight vector. Along with this, we also want to keep the soft error as small as possible. Hence, now our new objective function is given by the following expression, with $C$ being a user defined constant and represents the penalty factor or the regularization constant.

$$

\frac{1}{2}||w||^2 + C \sum_i \xi_i

$$

The overall quadratic programming problem is, therefore, given by the following expression:

$$

\min_w \frac{1}{2}||w||^2 + C \sum_i \xi_i \;\text{ subject to } t_i(w^Tx_i+w_0) \geq +1 – \xi_i, \forall i \; \text{ and } \xi_i \geq 0, \forall i

$$

To understand the penalty factor $C$, consider the product term $C \sum_i \xi_i$, which has to be minimized. If C is kept large, then the soft margin $\sum_i \xi_i$ would automatically be small. If $C$ is close to zero, then we are allowing the soft margin to be large making the overall product small.

In short, a large value of $C$ means we have a high penalty on errors and hence our model is not allowed to make too many mistakes in classification. A small value of $C$ allows the errors to grow.

Let’s use the method of Lagrange multipliers to solve the quadratic programming problem that we formulated earlier. The Lagrange function is given by:

$$

L(w, w_0, \alpha, \mu) = \frac{1}{2}||w||^2 + \sum_i \alpha_i\big(t_i(w^Tx_i+w_0) – 1 + \xi_i\big) – \sum_i \mu_i \xi_i

$$

To solve the above, we set the following:

\begin{equation}

\frac{\partial L}{ \partial w} = 0, \\

\frac{\partial L}{ \partial \alpha} = 0, \\

\frac{\partial L}{ \partial w_0} = 0, \\

\frac{\partial L}{ \partial \mu} = 0 \\

\end{equation}

Solving the above gives us:

$$

w = \sum_i \alpha_i t_i x_i

$$

and

$$

0= C – \alpha_i – \mu_i

$$

Substitute the above in the Lagrange function gives us the following optimization problem, also called the dual:

$$

L_d = -\frac{1}{2} \sum_i \sum_k \alpha_i \alpha_k t_i t_k (x_i)^T (x_k) + \sum_i \alpha_i

$$

We have to maximize the above subject to the following constraints:

\begin{equation}

\sum_i \alpha_i t_i = 0 \\ \text{ and }

0 \leq \alpha_i \leq C, \forall i

\end{equation}

Similar to the separable case, we have an expression for $w$ in terms of Lagrange multipliers. The objective function involves no $w$ term. There is a Lagrange multiplier $\alpha$ and $\mu$ associated with each data point.

Following cases are true for each training data point $x_i$:

- $\alpha_i = 0$: The ith training point lies on the correct side of the hyperplane away from the margin. This point plays no role in the classification of a test point.
- $0 < \alpha_i < C$: The ith training point is a support vector and lies on the margin. For this point $\xi_i = 0$ and $t_i(w^T x_i + w_0) = 1$ and hence it can be used to compute $w_0$. In practice $w_0$ is computed from all support vectors and an average is taken.
- $\alpha_i = C$: The ith training point is either inside the margin on the correct side of the hyperplane or this point is on the wrong side of the hyperplane.

The picture below will help you understand the above concepts:

The classification of any test point $x$ can be determined using this expression:

$$

y(x) = \sum_i \alpha_i t_i x^T x_i + w_0

$$

A positive value of $y(x)$ implies $x\in+1$ and a negative value means $x\in-1$. Hence, the predicted class of a test point is the sign of $y(x)$.

Karush-Kuhn-Tucker (KKT) conditions are satisfied by the above constrained optimization problem as given by:

\begin{eqnarray}

\alpha_i &\geq& 0 \\

t_i y(x_i) -1 + \xi_i &\geq& 0 \\

\alpha_i(t_i y(x_i) -1 + \xi_i) &=& 0 \\

\mu_i \geq 0 \\

\xi_i \geq 0 \\

\mu_i\xi_i = 0

\end{eqnarray}

Shown above is a solved example for 2D training points to illustrate all the concepts. A few things to note about this solution are:

- The training data points and their corresponding labels act as input
- The user defined constant $C$ is set to 10
- The solution satisfies all the constraints, however, it is not the optimal solution
- We have to make sure that all the $\alpha$ lie between 0 and C
- The sum of alphas of all negative examples should equal the sum of alphas of all positive examples
- The points (1,2), (2,1) and (-2,-2) lie on the soft margin on the correct side of the hyperplane. Their values have been arbitrarily set to 3, 3 and 6 respectively to balance the problem and satisfy the constraints.
- The points with $\alpha=C=10$ lie either inside the margin or on the wrong side of the hyperplane

This section provides more resources on the topic if you are looking to go deeper.

- Pattern Recognition and Machine Learning by Christopher M. Bishop

- Support Vector Machines for Machine Learning
- A Tutorial on Support Vector Machines for Pattern Recognition by Christopher J.C. Burges

In this tutorial, you discovered the method of Lagrange multipliers for finding the soft margin in an SVM classifier.

Specifically, you learned:

- How to formulate the optimization problem for the non-separable case
- How to find the hyperplane and the soft margin using the method of Lagrange multipliers
- How to find the equation of the separating hyperplane for very simple problems

Do you have any questions about SVMs discussed in this post? Ask your questions in the comments below and I will do my best to answer.

The post Method of Lagrange Multipliers: The Theory Behind Support Vector Machines (Part 2: The Non-Separable Case) appeared first on Machine Learning Mastery.

]]>The post Application of differentiations in neural networks appeared first on Machine Learning Mastery.

]]>In this tutorial, we will see how the back-propagation technique is used in finding the gradients in neural networks.

After completing this tutorial, you will know

- What is a total differential and total derivative
- How to compute the total derivatives in neural networks
- How back-propagation helped in computing the total derivatives

Let’s get started

This tutorial is divided into 5 parts; they are:

- Total differential and total derivatives
- Algebraic representation of a multilayer perceptron model
- Finding the gradient by back-propagation
- Matrix form of gradient equations
- Implementing back-propagation

For a function such as $f(x)$, we call denote its derivative as $f'(x)$ or $\frac{df}{dx}$. But for a multivariate function, such as $f(u,v)$, we have a partial derivative of $f$ with respect to $u$ denoted as $\frac{\partial f}{\partial u}$, or sometimes written as $f_u$. A partial derivative is obtained by differentiation of $f$ with respect to $u$ while assuming the other variable $v$ is a constant. Therefore, we use $\partial$ instead of $d$ as the symbol for differentiation to signify the difference.

However, what if the $u$ and $v$ in $f(u,v)$ are both function of $x$? In other words, we can write $u(x)$ and $v(x)$ and $f(u(x), v(x))$. So $x$ determines the value of $u$ and $v$ and in turn, determines $f(u,v)$. In this case, it is perfectly fine to ask what is $\frac{df}{dx}$, as $f$ is eventually determined by $x$.

This is the concept of total derivatives. In fact, for a multivariate function $f(t,u,v)=f(t(x),u(x),v(x))$, we always have

$$

\frac{df}{dx} = \frac{\partial f}{\partial t}\frac{dt}{dx} + \frac{\partial f}{\partial u}\frac{du}{dx} + \frac{\partial f}{\partial v}\frac{dv}{dx}

$$

The above notation is called the total derivative because it is sum of the partial derivatives. In essence, it is applying chain rule to find the differentiation.

If we take away the $dx$ part in the above equation, what we get is an approximate change in $f$ with respect to $x$, i.e.,

$$

df = \frac{\partial f}{\partial t}dt + \frac{\partial f}{\partial u}du + \frac{\partial f}{\partial v}dv

$$

We call this notation the total differential.

Consider the network:

This is a simple, fully-connected, 4-layer neural network. Let’s call the input layer as layer 0, the two hidden layers the layer 1 and 2, and the output layer as layer 3. In this picture, we see that we have $n_0=3$ input units, and $n_1=4$ units in the first hidden layer and $n_2=2$ units in the second input layer. There are $n_3=2$ output units.

If we denote the input to the network as $x_i$ where $i=1,\cdots,n_0$ and the network’s output as $\hat{y}_i$ where $i=1,\cdots,n_3$. Then we can write

$$

\begin{aligned}

h_{1i} &= f_1(\sum_{j=1}^{n_0} w^{(1)}_{ij} x_j + b^{(1)}_i) & \text{for } i &= 1,\cdots,n_1\\

h_{2i} &= f_2(\sum_{j=1}^{n_1} w^{(2)}_{ij} h_{1j} + b^{(2)}_i) & i &= 1,\cdots,n_2\\

\hat{y}_i &= f_3(\sum_{j=1}^{n_2} w^{(3)}_{ij} h_{2j} + b^{(3)}_i) & i &= 1,\cdots,n_3

\end{aligned}

$$

Here the activation function at layer $i$ is denoted as $f_i$. The outputs of first hidden layer are denoted as $h_{1i}$ for the $i$-th unit. Similarly, the outputs of second hidden layer are denoted as $h_{2i}$. The weights and bias of unit $i$ in layer $k$ are denoted as $w^{(k)}_{ij}$ and $b^{(k)}_i$ respectively.

In the above, we can see that the output of layer $k-1$ will feed into layer $k$. Therefore, while $\hat{y}_i$ is expressed as a function of $h_{2j}$, but $h_{2i}$ is also a function of $h_{1j}$ and in turn, a function of $x_j$.

The above describes the construction of a neural network in terms of algebraic equations. Training a neural network would need to specify a *loss function* as well so we can minimize it in the training loop. Depends on the application, we commonly use cross entropy for categorization problems or mean squared error for regression problems. With the target variables as $y_i$, the mean square error loss function is specified as

$$

L = \sum_{i=1}^{n_3} (y_i-\hat{y}_i)^2

$$

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In the above construct, $x_i$ and $y_i$ are from the dataset. The parameters to the neural network are $w$ and $b$. While the activation functions $f_i$ are by design the outputs at each layer $h_{1i}$, $h_{2i}$, and $\hat{y}_i$ are dependent variables. In training the neural network, our goal is to update $w$ and $b$ in each iteration, namely, by the gradient descent update rule:

$$

\begin{aligned}

w^{(k)}_{ij} &= w^{(k)}_{ij} – \eta \frac{\partial L}{\partial w^{(k)}_{ij}} \\

b^{(k)}_{i} &= b^{(k)}_{i} – \eta \frac{\partial L}{\partial b^{(k)}_{i}}

\end{aligned}

$$

where $\eta$ is the learning rate parameter to gradient descent.

From the equation of $L$ we know that $L$ is not dependent on $w^{(k)}_{ij}$ or $b^{(k)}_i$ but on $\hat{y}_i$. However, $\hat{y}_i$ can be written as function of $w^{(k)}_{ij}$ or $b^{(k)}_i$ eventually. Let’s see one by one how the weights and bias at layer $k$ can be connected to $\hat{y}_i$ at the output layer.

We begin with the loss metric. If we consider the loss of a single data point, we have

$$

\begin{aligned}

L &= \sum_{i=1}^{n_3} (y_i-\hat{y}_i)^2\\

\frac{\partial L}{\partial \hat{y}_i} &= 2(y_i – \hat{y}_i) & \text{for } i &= 1,\cdots,n_3

\end{aligned}

$$

Here we see that the loss function depends on all outputs $\hat{y}_i$ and therefore we can find a partial derivative $\frac{\partial L}{\partial \hat{y}_i}$.

Now let’s look at the output layer:

$$

\begin{aligned}

\hat{y}_i &= f_3(\sum_{j=1}^{n_2} w^{(3)}_{ij} h_{2j} + b^{(3)}_i) & \text{for }i &= 1,\cdots,n_3 \\

\frac{\partial L}{\partial w^{(3)}_{ij}} &= \frac{\partial L}{\partial \hat{y}_i}\frac{\partial \hat{y}_i}{\partial w^{(3)}_{ij}} & i &= 1,\cdots,n_3;\ j=1,\cdots,n_2 \\

&= \frac{\partial L}{\partial \hat{y}_i} f’_3(\sum_{j=1}^{n_2} w^{(3)}_{ij} h_{2j} + b^{(3)}_i)h_{2j} \\

\frac{\partial L}{\partial b^{(3)}_i} &= \frac{\partial L}{\partial \hat{y}_i}\frac{\partial \hat{y}_i}{\partial b^{(3)}_i} & i &= 1,\cdots,n_3 \\

&= \frac{\partial L}{\partial \hat{y}_i}f’_3(\sum_{j=1}^{n_2} w^{(3)}_{ij} h_{2j} + b^{(3)}_i)

\end{aligned}

$$

Because the weight $w^{(3)}_{ij}$ at layer 3 applies to input $h_{2j}$ and affects output $\hat{y}_i$ only. Hence we can write the derivative $\frac{\partial L}{\partial w^{(3)}_{ij}}$ as the product of two derivatives $\frac{\partial L}{\partial \hat{y}_i}\frac{\partial \hat{y}_i}{\partial w^{(3)}_{ij}}$. Similar case for the bias $b^{(3)}_i$ as well. In the above, we make use of $\frac{\partial L}{\partial \hat{y}_i}$, which we already derived previously.

But in fact, we can also write the partial derivative of $L$ with respect to output of second layer $h_{2j}$. It is not used for the update of weights and bias on layer 3 but we will see its importance later:

$$

\begin{aligned}

\frac{\partial L}{\partial h_{2j}} &= \sum_{i=1}^{n_3}\frac{\partial L}{\partial \hat{y}_i}\frac{\partial \hat{y}_i}{\partial h_{2j}} & \text{for }j &= 1,\cdots,n_2 \\

&= \sum_{i=1}^{n_3}\frac{\partial L}{\partial \hat{y}_i}f’_3(\sum_{j=1}^{n_2} w^{(3)}_{ij} h_{2j} + b^{(3)}_i)w^{(3)}_{ij}

\end{aligned}

$$

This one is the interesting one and different from the previous partial derivatives. Note that $h_{2j}$ is an output of layer 2. Each and every output in layer 2 will affect the output $\hat{y}_i$ in layer 3. Therefore, to find $\frac{\partial L}{\partial h_{2j}}$ we need to add up every output at layer 3. Thus the summation sign in the equation above. And we can consider $\frac{\partial L}{\partial h_{2j}}$ as the total derivative, in which we applied the chain rule $\frac{\partial L}{\partial \hat{y}_i}\frac{\partial \hat{y}_i}{\partial h_{2j}}$ for every output $i$ and then sum them up.

If we move back to layer 2, we can derive the derivatives similarly:

$$

\begin{aligned}

h_{2i} &= f_2(\sum_{j=1}^{n_1} w^{(2)}_{ij} h_{1j} + b^{(2)}_i) & \text{for }i &= 1,\cdots,n_2\\

\frac{\partial L}{\partial w^{(2)}_{ij}} &= \frac{\partial L}{\partial h_{2i}}\frac{\partial h_{2i}}{\partial w^{(2)}_{ij}} & i&=1,\cdots,n_2;\ j=1,\cdots,n_1 \\

&= \frac{\partial L}{\partial h_{2i}}f’_2(\sum_{j=1}^{n_1} w^{(2)}_{ij} h_{1j} + b^{(2)}_i)h_{1j} \\

\frac{\partial L}{\partial b^{(2)}_i} &= \frac{\partial L}{\partial h_{2i}}\frac{\partial h_{2i}}{\partial b^{(2)}_i} & i &= 1,\cdots,n_2 \\

&= \frac{\partial L}{\partial h_{2i}}f’_2(\sum_{j=1}^{n_1} w^{(2)}_{ij} h_{1j} + b^{(2)}_i) \\

\frac{\partial L}{\partial h_{1j}} &= \sum_{i=1}^{n_2}\frac{\partial L}{\partial h_{2i}}\frac{\partial h_{2i}}{\partial h_{1j}} & j&= 1,\cdots,n_1 \\

&= \sum_{i=1}^{n_2}\frac{\partial L}{\partial h_{2i}}f’_2(\sum_{j=1}^{n_1} w^{(2)}_{ij} h_{1j} + b^{(2)}_i) w^{(2)}_{ij}

\end{aligned}

$$

In the equations above, we are reusing $\frac{\partial L}{\partial h_{2i}}$ that we derived earlier. Again, this derivative is computed as a sum of several products from the chain rule. Also similar to the previous, we derived $\frac{\partial L}{\partial h_{1j}}$ as well. It is not used to train $w^{(2)}_{ij}$ nor $b^{(2)}_i$ but will be used for the layer prior. So for layer 1, we have

$$

\begin{aligned}

h_{1i} &= f_1(\sum_{j=1}^{n_0} w^{(1)}_{ij} x_j + b^{(1)}_i) & \text{for } i &= 1,\cdots,n_1\\

\frac{\partial L}{\partial w^{(1)}_{ij}} &= \frac{\partial L}{\partial h_{1i}}\frac{\partial h_{1i}}{\partial w^{(1)}_{ij}} & i&=1,\cdots,n_1;\ j=1,\cdots,n_0 \\

&= \frac{\partial L}{\partial h_{1i}}f’_1(\sum_{j=1}^{n_0} w^{(1)}_{ij} x_j + b^{(1)}_i)x_j \\

\frac{\partial L}{\partial b^{(1)}_i} &= \frac{\partial L}{\partial h_{1i}}\frac{\partial h_{1i}}{\partial b^{(1)}_i} & i&=1,\cdots,n_1 \\

&= \frac{\partial L}{\partial h_{1i}}f’_1(\sum_{j=1}^{n_0} w^{(1)}_{ij} x_j + b^{(1)}_i)

\end{aligned}

$$

and this completes all the derivatives needed for training of the neural network using gradient descent algorithm.

Recall how we derived the above: We first start from the loss function $L$ and find the derivatives one by one in the reverse order of the layers. We write down the derivatives on layer $k$ and reuse it for the derivatives on layer $k-1$. While computing the output $\hat{y}_i$ from input $x_i$ starts from layer 0 forward, computing gradients are in the reversed order. Hence the name “back-propagation”.

While we did not use it above, it is cleaner to write the equations in vectors and matrices. We can rewrite the layers and the outputs as:

$$

\mathbf{a}_k = f_k(\mathbf{z}_k) = f_k(\mathbf{W}_k\mathbf{a}_{k-1}+\mathbf{b}_k)

$$

where $\mathbf{a}_k$ is a vector of outputs of layer $k$, and assume $\mathbf{a}_0=\mathbf{x}$ is the input vector and $\mathbf{a}_3=\hat{\mathbf{y}}$ is the output vector. Also denote $\mathbf{z}_k = \mathbf{W}_k\mathbf{a}_{k-1}+\mathbf{b}_k$ for convenience of notation.

Under such notation, we can represent $\frac{\partial L}{\partial\mathbf{a}_k}$ as a vector (so as that of $\mathbf{z}_k$ and $\mathbf{b}_k$) and $\frac{\partial L}{\partial\mathbf{W}_k}$ as a matrix. And then if $\frac{\partial L}{\partial\mathbf{a}_k}$ is known, we have

$$

\begin{aligned}

\frac{\partial L}{\partial\mathbf{z}_k} &= \frac{\partial L}{\partial\mathbf{a}_k}\odot f_k'(\mathbf{z}_k) \\

\frac{\partial L}{\partial\mathbf{W}_k} &= \left(\frac{\partial L}{\partial\mathbf{z}_k}\right)^\top \cdot \mathbf{a}_k \\

\frac{\partial L}{\partial\mathbf{b}_k} &= \frac{\partial L}{\partial\mathbf{z}_k} \\

\frac{\partial L}{\partial\mathbf{a}_{k-1}} &= \left(\frac{\partial\mathbf{z}_k}{\partial\mathbf{a}_{k-1}}\right)^\top\cdot\frac{\partial L}{\partial\mathbf{z}_k} = \mathbf{W}_k^\top\cdot\frac{\partial L}{\partial\mathbf{z}_k}

\end{aligned}

$$

where $\frac{\partial\mathbf{z}_k}{\partial\mathbf{a}_{k-1}}$ is a Jacobian matrix as both $\mathbf{z}_k$ and $\mathbf{a}_{k-1}$ are vectors, and this Jacobian matrix happens to be $\mathbf{W}_k$.

We need the matrix form of equations because it will make our code simpler and avoided a lot of loops. Let’s see how we can convert these equations into code and make a multilayer perceptron model for classification from scratch using numpy.

The first thing we need to implement the activation function and the loss function. Both need to be differentiable functions or otherwise our gradient descent procedure would not work. Nowadays, it is common to use ReLU activation in the hidden layers and sigmoid activation in the output layer. We define them as a function (which assumes the input as numpy array) as well as their differentiation:

import numpy as np # Find a small float to avoid division by zero epsilon = np.finfo(float).eps # Sigmoid function and its differentiation def sigmoid(z): return 1/(1+np.exp(-z.clip(-500, 500))) def dsigmoid(z): s = sigmoid(z) return 2 * s * (1-s) # ReLU function and its differentiation def relu(z): return np.maximum(0, z) def drelu(z): return (z > 0).astype(float)

We deliberately clip the input of the sigmoid function to between -500 to +500 to avoid overflow. Otherwise, these functions are trivial. Then for classification, we care about accuracy but the accuracy function is not differentiable. Therefore, we use the cross entropy function as loss for training:

# Loss function L(y, yhat) and its differentiation def cross_entropy(y, yhat): """Binary cross entropy function L = - y log yhat - (1-y) log (1-yhat) Args: y, yhat (np.array): 1xn matrices which n are the number of data instances Returns: average cross entropy value of shape 1x1, averaging over the n instances """ return -(y.T @ np.log(yhat.clip(epsilon)) + (1-y.T) @ np.log((1-yhat).clip(epsilon))) / y.shape[1] def d_cross_entropy(y, yhat): """ dL/dyhat """ return - np.divide(y, yhat.clip(epsilon)) + np.divide(1-y, (1-yhat).clip(epsilon))

In the above, we assume the output and the target variables are row matrices in numpy. Hence we use the dot product operator `@`

to compute the sum and divide by the number of elements in the output. Note that this design is to compute the **average cross entropy** over a **batch** of samples.

Then we can implement our multilayer perceptron model. To make it easier to read, we want to create the model by providing the number of neurons at each layer as well as the activation function at the layers. But at the same time, we would also need the differentiation of the activation functions as well as the differentiation of the loss function for the training. The loss function itself, however, is not required but useful for us to track the progress. We create a class to ensapsulate the entire model, and define each layer $k$ according to the formula:

$$

\mathbf{a}_k = f_k(\mathbf{z}_k) = f_k(\mathbf{a}_{k-1}\mathbf{W}_k+\mathbf{b}_k)

$

class mlp: '''Multilayer perceptron using numpy ''' def __init__(self, layersizes, activations, derivatives, lossderiv): """remember config, then initialize array to hold NN parameters without init""" # hold NN config self.layersizes = layersizes self.activations = activations self.derivatives = derivatives self.lossderiv = lossderiv # parameters, each is a 2D numpy array L = len(self.layersizes) self.z = [None] * L self.W = [None] * L self.b = [None] * L self.a = [None] * L self.dz = [None] * L self.dW = [None] * L self.db = [None] * L self.da = [None] * L def initialize(self, seed=42): np.random.seed(seed) sigma = 0.1 for l, (insize, outsize) in enumerate(zip(self.layersizes, self.layersizes[1:]), 1): self.W[l] = np.random.randn(insize, outsize) * sigma self.b[l] = np.random.randn(1, outsize) * sigma def forward(self, x): self.a[0] = x for l, func in enumerate(self.activations, 1): # z = W a + b, with `a` as output from previous layer # `W` is of size rxs and `a` the size sxn with n the number of data instances, `z` the size rxn # `b` is rx1 and broadcast to each column of `z` self.z[l] = (self.a[l-1] @ self.W[l]) + self.b[l] # a = g(z), with `a` as output of this layer, of size rxn self.a[l] = func(self.z[l]) return self.a[-1]

The variables in this class `z`

, `W`

, `b`

, and `a`

are for the forward pass and the variables `dz`

, `dW`

, `db`

, and `da`

are their respective gradients that to be computed in the back-propagation. All these variables are presented as numpy arrays.

As we will see later, we are going to test our model using data generated by scikit-learn. Hence we will see our data in numpy array of shape “(number of samples, number of features)”. Therefore, each sample is presented as a row on a matrix, and in function `forward()`

, the weight matrix is right-multiplied to each input `a`

to the layer. While the activation function and dimension of each layer can be different, the process is the same. Thus we transform the neural network’s input `x`

to its output by a loop in the `forward()`

function. The network’s output is simply the output of the last layer.

To train the network, we need to run the back-propagation after each forward pass. The back-propagation is to compute the gradient of the weight and bias of each layer, starting from the output layer to the input layer. With the equations we derived above, the back-propagation function is implemented as:

class mlp: ... def backward(self, y, yhat): # first `da`, at the output self.da[-1] = self.lossderiv(y, yhat) for l, func in reversed(list(enumerate(self.derivatives, 1))): # compute the differentials at this layer self.dz[l] = self.da[l] * func(self.z[l]) self.dW[l] = self.a[l-1].T @ self.dz[l] self.db[l] = np.mean(self.dz[l], axis=0, keepdims=True) self.da[l-1] = self.dz[l] @ self.W[l].T def update(self, eta): for l in range(1, len(self.W)): self.W[l] -= eta * self.dW[l] self.b[l] -= eta * self.db[l]

The only difference here is that we compute `db`

not for one training sample, but for the entire batch. Since the loss function is the cross entropy averaged across the batch, we compute `db`

also by averaging across the samples.

Up to here, we completed our model. The `update()`

function simply applies the gradients found by the back-propagation to the parameters `W`

and `b`

using the gradient descent update rule.

To test out our model, we make use of scikit-learn to generate a classification dataset:

from sklearn.datasets import make_circles from sklearn.metrics import accuracy_score # Make data: Two circles on x-y plane as a classification problem X, y = make_circles(n_samples=1000, factor=0.5, noise=0.1) y = y.reshape(-1,1) # our model expects a 2D array of (n_sample, n_dim)

and then we build our model: Input is two-dimensional and output is one dimensional (logistic regression). We make two hidden layers of 4 and 3 neurons respectively:

# Build a model model = mlp(layersizes=[2, 4, 3, 1], activations=[relu, relu, sigmoid], derivatives=[drelu, drelu, dsigmoid], lossderiv=d_cross_entropy) model.initialize() yhat = model.forward(X) loss = cross_entropy(y, yhat) print("Before training - loss value {} accuracy {}".format(loss, accuracy_score(y, (yhat > 0.5))))

We see that, under random weight, the accuracy is 50%:

Before training - loss value [[693.62972747]] accuracy 0.5

Now we train our network. To make things simple, we perform full-batch gradient descent with fixed learning rate:

# train for each epoch n_epochs = 150 learning_rate = 0.005 for n in range(n_epochs): model.forward(X) yhat = model.a[-1] model.backward(y, yhat) model.update(learning_rate) loss = cross_entropy(y, yhat) print("Iteration {} - loss value {} accuracy {}".format(n, loss, accuracy_score(y, (yhat > 0.5))))

and the output is:

Iteration 0 - loss value [[693.62972747]] accuracy 0.5 Iteration 1 - loss value [[693.62166655]] accuracy 0.5 Iteration 2 - loss value [[693.61534159]] accuracy 0.5 Iteration 3 - loss value [[693.60994018]] accuracy 0.5 ... Iteration 145 - loss value [[664.60120828]] accuracy 0.818 Iteration 146 - loss value [[697.97739669]] accuracy 0.58 Iteration 147 - loss value [[681.08653776]] accuracy 0.642 Iteration 148 - loss value [[665.06165774]] accuracy 0.71 Iteration 149 - loss value [[683.6170298]] accuracy 0.614

Although not perfect, we see the improvement by training. At least in the example above, we can see the accuracy was up to more than 80% at iteration 145, but then we saw the model diverged. That can be improved by reducing the learning rate, which we didn’t implement above. Nonetheless, this shows how we computed the gradients by back-propagations and chain rules.

The complete code is as follows:

from sklearn.datasets import make_circles from sklearn.metrics import accuracy_score import numpy as np np.random.seed(0) # Find a small float to avoid division by zero epsilon = np.finfo(float).eps # Sigmoid function and its differentiation def sigmoid(z): return 1/(1+np.exp(-z.clip(-500, 500))) def dsigmoid(z): s = sigmoid(z) return 2 * s * (1-s) # ReLU function and its differentiation def relu(z): return np.maximum(0, z) def drelu(z): return (z > 0).astype(float) # Loss function L(y, yhat) and its differentiation def cross_entropy(y, yhat): """Binary cross entropy function L = - y log yhat - (1-y) log (1-yhat) Args: y, yhat (np.array): nx1 matrices which n are the number of data instances Returns: average cross entropy value of shape 1x1, averaging over the n instances """ return -(y.T @ np.log(yhat.clip(epsilon)) + (1-y.T) @ np.log((1-yhat).clip(epsilon))) / y.shape[1] def d_cross_entropy(y, yhat): """ dL/dyhat """ return - np.divide(y, yhat.clip(epsilon)) + np.divide(1-y, (1-yhat).clip(epsilon)) class mlp: '''Multilayer perceptron using numpy ''' def __init__(self, layersizes, activations, derivatives, lossderiv): """remember config, then initialize array to hold NN parameters without init""" # hold NN config self.layersizes = tuple(layersizes) self.activations = tuple(activations) self.derivatives = tuple(derivatives) self.lossderiv = lossderiv assert len(self.layersizes)-1 == len(self.activations), \ "number of layers and the number of activation functions does not match" assert len(self.activations) == len(self.derivatives), \ "number of activation functions and number of derivatives does not match" assert all(isinstance(n, int) and n >= 1 for n in layersizes), \ "Only positive integral number of perceptons is allowed in each layer" # parameters, each is a 2D numpy array L = len(self.layersizes) self.z = [None] * L self.W = [None] * L self.b = [None] * L self.a = [None] * L self.dz = [None] * L self.dW = [None] * L self.db = [None] * L self.da = [None] * L def initialize(self, seed=42): """initialize the value of weight matrices and bias vectors with small random numbers.""" np.random.seed(seed) sigma = 0.1 for l, (insize, outsize) in enumerate(zip(self.layersizes, self.layersizes[1:]), 1): self.W[l] = np.random.randn(insize, outsize) * sigma self.b[l] = np.random.randn(1, outsize) * sigma def forward(self, x): """Feed forward using existing `W` and `b`, and overwrite the result variables `a` and `z` Args: x (numpy.ndarray): Input data to feed forward """ self.a[0] = x for l, func in enumerate(self.activations, 1): # z = W a + b, with `a` as output from previous layer # `W` is of size rxs and `a` the size sxn with n the number of data instances, `z` the size rxn # `b` is rx1 and broadcast to each column of `z` self.z[l] = (self.a[l-1] @ self.W[l]) + self.b[l] # a = g(z), with `a` as output of this layer, of size rxn self.a[l] = func(self.z[l]) return self.a[-1] def backward(self, y, yhat): """back propagation using NN output yhat and the reference output y, generates dW, dz, db, da """ assert y.shape[1] == self.layersizes[-1], "Output size doesn't match network output size" assert y.shape == yhat.shape, "Output size doesn't match reference" # first `da`, at the output self.da[-1] = self.lossderiv(y, yhat) for l, func in reversed(list(enumerate(self.derivatives, 1))): # compute the differentials at this layer self.dz[l] = self.da[l] * func(self.z[l]) self.dW[l] = self.a[l-1].T @ self.dz[l] self.db[l] = np.mean(self.dz[l], axis=0, keepdims=True) self.da[l-1] = self.dz[l] @ self.W[l].T assert self.z[l].shape == self.dz[l].shape assert self.W[l].shape == self.dW[l].shape assert self.b[l].shape == self.db[l].shape assert self.a[l].shape == self.da[l].shape def update(self, eta): """Updates W and b Args: eta (float): Learning rate """ for l in range(1, len(self.W)): self.W[l] -= eta * self.dW[l] self.b[l] -= eta * self.db[l] # Make data: Two circles on x-y plane as a classification problem X, y = make_circles(n_samples=1000, factor=0.5, noise=0.1) y = y.reshape(-1,1) # our model expects a 2D array of (n_sample, n_dim) print(X.shape) print(y.shape) # Build a model model = mlp(layersizes=[2, 4, 3, 1], activations=[relu, relu, sigmoid], derivatives=[drelu, drelu, dsigmoid], lossderiv=d_cross_entropy) model.initialize() yhat = model.forward(X) loss = cross_entropy(y, yhat) print("Before training - loss value {} accuracy {}".format(loss, accuracy_score(y, (yhat > 0.5)))) # train for each epoch n_epochs = 150 learning_rate = 0.005 for n in range(n_epochs): model.forward(X) yhat = model.a[-1] model.backward(y, yhat) model.update(learning_rate) loss = cross_entropy(y, yhat) print("Iteration {} - loss value {} accuracy {}".format(n, loss, accuracy_score(y, (yhat > 0.5))))

The back-propagation algorithm is the center of all neural network training, regardless of what variation of gradient descent algorithms you used. Textbook such as this one covered it:

*Deep Learning*, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016.

(https://www.amazon.com/dp/0262035618)

Previously also implemented the neural network from scratch without discussing the math, it explained the steps in greater detail:

In this tutorial, you learned how differentiation is applied to training a neural network.

Specifically, you learned:

- What is a total differential and how it is expressed as a sum of partial differentials
- How to express a neural network as equations and derive the gradients by differentiation
- How back-propagation helped us to express the gradients of each layer in the neural network
- How to convert the gradients into code to make a neural network model

The post Application of differentiations in neural networks appeared first on Machine Learning Mastery.

]]>The post Method of Lagrange Multipliers: The Theory Behind Support Vector Machines (Part 1: The Separable Case) appeared first on Machine Learning Mastery.

]]>In this tutorial, we’ll look at the simplest SVM that assumes that the positive and negative examples can be completely separated via a linear hyperplane.

After completing this tutorial, you will know:

- How the hyperplane acts as the decision boundary
- Mathematical constraints on the positive and negative examples
- What is the margin and how to maximize the margin
- Role of Lagrange multipliers in maximizing the margin
- How to determine the separating hyperplane for the separable case

Let’s get started.

This tutorial is divided into three parts; they are:

- Formulation of the mathematical model of SVM
- Solution of finding the maximum margin hyperplane via the method of Lagrange multipliers
- Solved example to demonstrate all concepts

- $m$: Total training points.
- $n$: Total features or the dimensionality of all data points
- $x$: Data point, which is an n-dimensional vector.
- $x^+$: Data point labelled as +1.
- $x^-$: Data point labelled as -1.
- $i$: Subscript used to index the training points. $0 \leq i < m$
- $j$: Subscript used to index the individual dimension of a data point. $1 \leq j \leq n$
- $t$: Label of a data point.
- T: Transpose operator.
- $w$: Weight vector denoting the coefficients of the hyperplane. It is also an n-dimensional vector.
- $\alpha$: Lagrange multipliers, one per each training point. This is an m-dimensional vector.
- $d$: Perpendicular distance of a data point from the decision boundary.

The support vector machine is designed to discriminate data points belonging to two different classes. One set of points is labelled as +1 also called the positive class. The other set of points is labeled as -1 also called the negative class. For now, we’ll make a simplifying assumption that points from both classes can be discriminated via linear hyperplane.

The SVM assumes a linear decision boundary between the two classes and the goal is to find a hyperplane that gives the maximum separation between the two classes. For this reason, the alternate term `maximum margin classifier`

is also sometimes used to refer to an SVM. The perpendicular distance between the closest data point and the decision boundary is referred to as the `margin`

. As the margin completely separates the positive and negative examples and does not tolerate any errors, it is also called the `hard margin`

.

The mathematical expression for a hyperplane is given below with \(w_j\) being the coefficients and \(w_0\) being the arbitrary constant that determines the distance of the hyperplane from the origin:

$$

w^T x_i + w_0 = 0

$$

For the ith 2-dimensional point $(x_{i1}, x_{i2})$ the above expression is reduced to:

$$

w_1x_{i1} + w_2 x_{i2} + w_0 = 0

$$

As we are looking to maximize the margin between positive and negative data points, we would like the positive data points to satisfy the following constraint:

$$

w^T x_i^+ + w_0 \geq +1

$$

Similarly, the negative data points should satisfy:

$$

w^T x_i^- + w_0 \leq -1

$$

We can use a neat trick to write a uniform equation for both set of points by using $t_i \in \{-1,+1\}$ to denote the class label of data point $x_i$:

$$

t_i(w^T x_i + w_0) \geq +1

$$

The perpendicular distance $d_i$ of a data point $x_i$ from the margin is given by:

$$

d_i = \frac{|w^T x_i + w_0|}{||w||}

$$

To maximize this distance, we can minimize the square of the denominator to give us a quadratic programming problem given by:

$$

\min \frac{1}{2}||w||^2 \;\text{ subject to } t_i(w^Tx_i+w_0) \geq +1, \forall i

$$

To solve the above quadratic programming problem with inequality constraints, we can use the method of Lagrange multipliers. The Lagrange function is therefore:

$$

L(w, w_0, \alpha) = \frac{1}{2}||w||^2 + \sum_i \alpha_i\big(t_i(w^Tx_i+w_0) – 1\big)

$$

To solve the above, we set the following:

\begin{equation}

\frac{\partial L}{ \partial w} = 0, \\

\frac{\partial L}{ \partial \alpha} = 0, \\

\frac{\partial L}{ \partial w_0} = 0 \\

\end{equation}

Plugging above in the Lagrange function gives us the following optimization problem, also called the dual:

$$

L_d = -\frac{1}{2} \sum_i \sum_k \alpha_i \alpha_k t_i t_k (x_i)^T (x_k) + \sum_i \alpha_i

$$

We have to maximize the above subject to the following:

$$

w = \sum_i \alpha_i t_i x_i

$$

and

$$

0=\sum_i \alpha_i t_i

$$

The nice thing about the above is that we have an expression for \(w\) in terms of Lagrange multipliers. The objective function involves no $w$ term. There is a Lagrange multiplier associated with each data point. The computation of $w_0$ is also explained later.

The classification of any test point $x$ can be determined using this expression:

$$

y(x) = \sum_i \alpha_i t_i x^T x_i + w_0

$$

A positive value of $y(x)$ implies $x\in+1$ and a negative value means $x\in-1$

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Also, Karush-Kuhn-Tucker (KKT) conditions are satisfied by the above constrained optimization problem as given by:

\begin{eqnarray}

\alpha_i &\geq& 0 \\

t_i y(x_i) -1 &\geq& 0 \\

\alpha_i(t_i y(x_i) -1) &=& 0

\end{eqnarray}

The KKT conditions dictate that for each data point one of the following is true:

- The Lagrange multiplier is zero, i.e., \(\alpha_i=0\). This point, therefore, plays no role in classification

OR

- $ t_i y(x_i) = 1$ and $\alpha_i > 0$: In this case, the data point has a role in deciding the value of $w$. Such a point is called a support vector.

For $w_0$, we can select any support vector $x_s$ and solve

$$

t_s y(x_s) = 1

$$

giving us:

$$

t_s(\sum_i \alpha_i t_i x_s^T x_i + w_0) = 1

$$

To help you understand the above concepts, here is a simple arbitrarily solved example. Of course, for a large number of points you would use an optimization software to solve this. Also, this is one possible solution that satisfies all the constraints. The objective function can be maximized further but the slope of the hyperplane will remain the same for an optimal solution. Also, for this example, $w_0$ was computed by taking the average of $w_0$ from all three support vectors.

This example will show you that the model is not as complex as it looks.

For the above set of points, we can see that (1,2), (2,1) and (0,0) are points closest to the separating hyperplane and hence, act as support vectors. Points far away from the boundary (e.g. (-3,1)) do not play any role in determining the classification of the points.

This section provides more resources on the topic if you are looking to go deeper.

- Pattern Recognition and Machine Learning by Christopher M. Bishop
- Thomas’ Calculus, 14th edition, 2017 (based on the original works of George B. Thomas, revised by Joel Hass, Christopher Heil, Maurice Weir)

- Support Vector Machines for Machine Learning
- A Tutorial on Support Vector Machines for Pattern Recognition by Christopher J.C. Burges

In this tutorial, you discovered how to use the method of Lagrange multipliers to solve the problem of maximizing the margin via a quadratic programming problem with inequality constraints.

Specifically, you learned:

- The mathematical expression for a separating linear hyperplane
- The maximum margin as a solution of a quadratic programming problem with inequality constraint
- How to find a linear hyperplane between positive and negative examples using the method of Lagrange multipliers

Do you have any questions about the SVM discussed in this post? Ask your questions in the comments below and I will do my best to answer.

The post Method of Lagrange Multipliers: The Theory Behind Support Vector Machines (Part 1: The Separable Case) appeared first on Machine Learning Mastery.

]]>The post Lagrange Multiplier Approach with Inequality Constraints appeared first on Machine Learning Mastery.

]]>In this tutorial, you will discover the method of Lagrange multipliers applied to find the local minimum or maximum of a function when inequality constraints are present, optionally together with equality constraints.

After completing this tutorial, you will know

- How to find points of local maximum or minimum of a function with equality constraints
- Method of Lagrange multipliers with equality constraints

Let’s get started.

For this tutorial, we assume that you already have reviewed:

- Derivative of functions
- Function of several variables, partial derivatives and gradient vectors
- A gentle introduction to optimization
- Gradient descent

as well as

You can review these concepts by clicking on the links above.

Extending from our previous post, a constrained optimization problem can be generally considered as

$$

\begin{aligned}

\min && f(X) \\

\textrm{subject to} && g(X) &= 0 \\

&& h(X) &\ge 0 \\

&& k(X) &\le 0

\end{aligned}

$$

where $X$ is a scalar or vector values. Here, $g(X)=0$ is the equality constraint, and $h(X)\ge 0$, $k(X)\le 0$ are inequality constraints. Note that we always use $\ge$ and $\le$ rather than $\gt$ and $\lt$ in optimization problems because the former defined a **closed set** in mathematics from where we should look for the value of $X$. These can be many constraints of each type in an optimization problem.

The equality constraints are easy to handle but the inequality constraints are not. Therefore, one way to make it easier to tackle is to convert the inequalities into equalities, by introducing **slack variables**:

$$

\begin{aligned}

\min && f(X) \\

\textrm{subject to} && g(X) &= 0 \\

&& h(X) – s^2 &= 0 \\

&& k(X) + t^2 &= 0

\end{aligned}

$$

When something is negative, adding a certain positive quantity into it will make it equal to zero, and vice versa. That quantity is the slack variable; the $s^2$ and $t^2$ above are examples. We deliberately put $s^2$ and $t^2$ terms there to denote that they must not be negative.

With the slack variables introduced, we can use the Lagrange multipliers approach to solve it, in which the Lagrangian is defined as:

$$

L(X, \lambda, \theta, \phi) = f(X) – \lambda g(X) – \theta (h(X)-s^2) + \phi (k(X)+t^2)

$$

It is useful to know that, for the optimal solution $X^*$ to the problem, the inequality constraints are either having the equality holds (which the slack variable is zero), or not. For those inequality constraints with their equality hold are called the active constraints. Otherwise, the inactive constraints. In this sense, you can consider that the equality constraints are always active.

The reason we need to know whether a constraint is active or not is because of the Krush-Kuhn-Tucker (KKT) conditions. Precisely, the KKT conditions describe what happens when $X^*$ is the optimal solution to a constrained optimization problem:

- The gradient of the Lagrangian function is zero
- All constraints are satisfied
- The inequality constraints satisfied complementary slackness condition

The most important of them is the complementary slackness condition. While we learned that optimization problem with equality constraint can be solved using Lagrange multiplier which the gradient of the Lagrangian is zero at the optimal solution, the complementary slackness condition extends this to the case of inequality constraint by saying that at the optimal solution $X^*$, either the Lagrange multiplier is zero or the corresponding inequality constraint is active.

The use of complementary slackness condition is to help us explore different cases in solving the optimization problem. It is the best to be explained with an example.

This is an example from finance. If we have 1 dollar and were to engage in two different investments, in which their return is modeled as a bi-variate Gaussian distribution. How much should we invest in each to minimize the overall variance in return?

This optimization problem, also known as Markowitz mean-variance portfolio optimization, is formulated as:

$$

\begin{aligned}

\min && f(w_1, w_2) &= w_1^2\sigma_1^2+w_2^2\sigma_2^2+2w_1w_2\sigma_{12} \\

\textrm{subject to} && w_1+w_2 &= 1 \\

&& w_1 &\ge 0 \\

&& w_1 &\le 1

\end{aligned}

$$

which the last two are to bound the weight of each investment to between 0 and 1 dollar. Let’s assume $\sigma_1^2=0.25$, $\sigma_2^2=0.10$, $\sigma_{12} = 0.15$ Then the Lagrangian function is defined as:

$$

\begin{aligned}

L(w_1,w_2,\lambda,\theta,\phi) =& 0.25w_1^2+0.1w_2^2+0.3w_1w_2 \\

&- \lambda(w_1+w_2-1) \\

&- \theta(w_1-s^2) – \phi(w_1-1+t^2)

\end{aligned}

$$

and we have the gradients:

$$

\begin{aligned}

\frac{\partial L}{\partial w_1} &= 0.5w_1+0.3w_2-\lambda-\theta-\phi \\

\frac{\partial L}{\partial w_2} &= 0.2w_2+0.3w_1-\lambda \\

\frac{\partial L}{\partial\lambda} &= 1-w_1-w_2 \\

\frac{\partial L}{\partial\theta} &= s^2-w_1 \\

\frac{\partial L}{\partial\phi} &= 1-w_1-t^2

\end{aligned}

$$

From this point onward, the complementary slackness condition have to be considered. We have two slack variables $s$ and $t$ and the corresponding Lagrange multipliers are $\theta$ and $\phi$. We now have to consider whether a slack variable is zero (which the corresponding inequality constraint is active) or the Lagrange multiplier is zero (the constraint is inactive). There are four possible cases:

- $\theta=\phi=0$ and $s^2>0$, $t^2>0$
- $\theta\ne 0$ but $\phi=0$, and $s^2=0$, $t^2>0$
- $\theta=0$ but $\phi\ne 0$, and $s^2>0$, $t^2=0$
- $\theta\ne 0$ and $\phi\ne 0$, and $s^2=t^2=0$

For case 1, using $\partial L/\partial\lambda=0$, $\partial L/\partial w_1=0$ and $\partial L/\partial w_2=0$ we get

$$

\begin{align}

w_2 &= 1-w_1 \\

0.5w_1 + 0.3w_2 &= \lambda \\

0.3w_1 + 0.2w_2 &= \lambda

\end{align}

$$

which we get $w_1=-1$, $w_2=2$, $\lambda=0.1$. But with $\partial L/\partial\theta=0$, we get $s^2=-1$, which we cannot find a solution ($s^2$ cannot be negative). Thus this case is infeasible.

For case 2, with $\partial L/\partial\theta=0$ we get $w_1=0$. Hence from $\partial L/\partial\lambda=0$, we know $w_2=1$. And with $\partial L/\partial w_2=0$, we found $\lambda=0.2$ and from $\partial L/\partial w_1$ we get $\phi=0.1$. In this case, the objective function is 0.1

For case 3, with $\partial L/\partial\phi=0$ we get $w_1=1$. Hence from $\partial L/\partial\lambda=0$, we know $w_2=0$. And with $\partial L/\partial w_2=0$, we get $\lambda=0.3$ and from $\partial L/\partial w_1$ we get $\theta=0.2$. In this case, the objective function is 0.25

For case 4, we get $w_1=0$ from $\partial L/\partial\theta=0$ but $w_1=1$ from $\partial L/\partial\phi=0$. Hence this case is infeasible.

Comparing the objective function from case 2 and case 3, we see that the value from case 2 is lower. Hence that is taken as our solution to the optimization problem, with the optimal solution attained at $w_1=0$, $w_2=1$.

As an exercise, you can retry the above with $\sigma_{12}=-0.15$. The solution would be 0.0038 attained when $w_1=\frac{5}{13}$, with the two inequality constraints inactive.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

This is an example from communication engineering. If we have a channel (say, a wireless bandwidth) in which the noise power is $N$ and the signal power is $S$, the channel capacity (in terms of bits per second) is proportional to $\log_2(1+S/N)$. If we have $k$ similar channels, each has its own noise and signal level, the total capacity of all channels is the sum $\sum_i \log_2(1+S_i/N_i)$.

Assume we are using a battery that can give only 1 watt of power and this power have to distribute to the $k$ channels (denoted as $p_1,\cdots,p_k$). Each channel may have different attenuation so at the end, the signal power is discounted by a gain $g_i$ for each channel. Then the maximum total capacity we can achieve by using these $k$ channels is formulated as an optimization problem

$$

\begin{aligned}

\max && f(p_1,\cdots,p_k) &= \sum_{i=1}^k \log_2\left(1+\frac{g_ip_i}{n_i}\right) \\

\textrm{subject to} && \sum_{i=1}^k p_i &= 1 \\

&& p_1,\cdots,p_k &\ge 0 \\

\end{aligned}

$$

For convenience of differentiation, we notice $\log_2x=\log x/\log 2$ and $\log(1+g_ip_i/n_i)=\log(n_i+g_ip_i)-\log(n_i)$, hence the objective function can be replaced with

$$

f(p_1,\cdots,p_k) = \sum_{i=1}^k \log(n_i+g_ip_i)

$$

Assume we have $k=3$ channels, each has noise level of 1.0, 0.9, 1.0 respectively, and the channel gain is 0.9, 0.8, 0.7, then the optimization problem is

$$

\begin{aligned}

\max && f(p_1,p_2,p_k) &= \log(1+0.9p_1) + \log(0.9+0.8p_2) + \log(1+0.7p_3)\\

\textrm{subject to} && p_1+p_2+p_3 &= 1 \\

&& p_1,p_2,p_3 &\ge 0

\end{aligned}

$$

We have three inequality constraints here. The Lagrangian function is defined as

$$

\begin{aligned}

& L(p_1,p_2,p_3,\lambda,\theta_1,\theta_2,\theta_3) \\

=\ & \log(1+0.9p_1) + \log(0.9+0.8p_2) + \log(1+0.7p_3) \\

& – \lambda(p_1+p_2+p_3-1) \\

& – \theta_1(p_1-s_1^2) – \theta_2(p_2-s_2^2) – \theta_3(p_3-s_3^2)

\end{aligned}

$$

The gradient is therefore

$$

\begin{aligned}

\frac{\partial L}{\partial p_1} & = \frac{0.9}{1+0.9p_1}-\lambda-\theta_1 \\

\frac{\partial L}{\partial p_2} & = \frac{0.8}{0.9+0.8p_2}-\lambda-\theta_2 \\

\frac{\partial L}{\partial p_3} & = \frac{0.7}{1+0.7p_3}-\lambda-\theta_3 \\

\frac{\partial L}{\partial\lambda} &= 1-p_1-p_2-p_3 \\

\frac{\partial L}{\partial\theta_1} &= s_1^2-p_1 \\

\frac{\partial L}{\partial\theta_2} &= s_2^2-p_2 \\

\frac{\partial L}{\partial\theta_3} &= s_3^2-p_3 \\

\end{aligned}

$$

But now we have 3 slack variables and we have to consider 8 cases:

- $\theta_1=\theta_2=\theta_3=0$, hence none of $s_1^2,s_2^2,s_3^2$ are zero
- $\theta_1=\theta_2=0$ but $\theta_3\ne 0$, hence only $s_3^2=0$
- $\theta_1=\theta_3=0$ but $\theta_2\ne 0$, hence only $s_2^2=0$
- $\theta_2=\theta_3=0$ but $\theta_1\ne 0$, hence only $s_1^2=0$
- $\theta_1=0$ but $\theta_2,\theta_3$ non-zero, hence only $s_2^2=s_3^2=0$
- $\theta_2=0$ but $\theta_1,\theta_3$ non-zero, hence only $s_1^2=s_3^2=0$
- $\theta_3=0$ but $\theta_1,\theta_2$ non-zero, hence only $s_1^2=s_2^2=0$
- all of $\theta_1,\theta_2,\theta_3$ are non-zero, hence $s_1^2=s_2^2=s_3^2=0$

Immediately we can tell case 8 is infeasible since from $\partial L/\partial\theta_i=0$ we can make $p_1=p_2=p_3=0$ but it cannot make $\partial L/\partial\lambda=0$.

For case 1, we have

$$

\frac{0.9}{1+0.9p_1}=\frac{0.8}{0.9+0.8p_2}=\frac{0.7}{1+0.7p_3}=\lambda

$$

from $\partial L/\partial p_1=\partial L/\partial p_2=\partial L/\partial p_3=0$. Together with $p_3=1-p_1-p_2$ from $\partial L/\partial\lambda=0$, we found the solution to be $p_1=0.444$, $p_2=0.430$, $p_3=0.126$, and the objective function $f(p_1,p_2,p_3)=0.639$.

For case 2, we have $p_3=0$ from $\partial L/\partial\theta_3=0$. Further, using $p_2=1-p_1$ from $\partial L/\partial\lambda=0$, and

$$

\frac{0.9}{1+0.9p_1}=\frac{0.8}{0.9+0.8p_2}=\lambda

$$

from $\partial L/\partial p_1=\partial L/\partial p_2=0$, we can solve for $p_1=0.507$ and $p_2=0.493$. The objective function $f(p_1,p_2,p_3)=0.634$.

Similarly in case 3, $p_2=0$ and we solved $p_1=0.659$ and $p_3=0.341$, with the objective function $f(p_1,p_2,p_3)=0.574$.

In case 4, we have $p_1=0$, $p_2=0.652$, $p_3=0.348$, and the objective function $f(p_1,p_2,p_3)=0.570$.

Case 5 we have $p_2=p_3=0$ and hence $p_3=1$. Thus we have the objective function $f(p_1,p_2,p_3)=0.0.536$.

Similarly in case 6 and case 7, we have $p_2=1$ and $p_1=1$ respectively. The objective function attained 0.531 and 0.425 respectively.

Comparing all these cases, we found that the maximum value that the objective function attained is in case 1. Hence the solution to this optimization problem is

$p_1=0.444$, $p_2=0.430$, $p_3=0.126$, with $f(p_1,p_2,p_3)=0.639$.

While in the above example, we introduced the slack variables into the Lagrangian function, some books may prefer not to add the slack variables but to limit the Lagrange multipliers for inequality constraints as positive. In that case you may see the Lagrangian function written as

$$

L(X, \lambda, \theta, \phi) = f(X) – \lambda g(X) – \theta h(X) + \phi k(X)

$$

but requires $\theta\ge 0;\phi\ge 0$.

The Lagrangian function is also useful to apply to primal-dual approach for finding the maximum or minimum. This is particularly helpful if the objectives or constraints are non-linear, which the solution may not be easily found.

Some books that covers this topic are:

- Convex Optimization by Stephen Boyd and Lieven Vandenberghe, 2004
- Chapter 4 of Deep Learning by Ian Goodfellow et al, 2016

In this tutorial, you discovered how the method of Lagrange multipliers can be applied to inequality constraints. Specifically, you learned:

- Lagrange multipliers and the Lagrange function in presence of inequality constraints
- How to use KKT conditions to solve an optimization problem when inequality constraints are given

The post Lagrange Multiplier Approach with Inequality Constraints appeared first on Machine Learning Mastery.

]]>The post Calculus in Action: Neural Networks appeared first on Machine Learning Mastery.

]]>It is inspired by the structure of the human brain, in that it is similarly composed of a network of interconnected neurons that propagate information upon receiving sets of stimuli from neighbouring neurons.

Training a neural network involves a process that employs the backpropagation and gradient descent algorithms in tandem. As we will be seeing, both of these algorithms make extensive use of calculus.

In this tutorial, you will discover how aspects of calculus are applied in neural networks.

After completing this tutorial, you will know:

- An artificial neural network is organized into layers of neurons and connections, where the latter are attributed a weight value each.
- Each neuron implements a nonlinear function that maps a set of inputs to an output activation.
- In training a neural network, calculus is used extensively by the backpropagation and gradient descent algorithms.

Let’s get started.

This tutorial is divided into three parts; they are:

- An Introduction to the Neural Network
- The Mathematics of a Neuron
- Training the Network

For this tutorial, we assume that you already know what are:

- Function approximation
- Rate of change
- Partial derivatives
- The chain rule
- The chain rule on more functions
- Gradient descent

You can review these concepts by clicking on the links given above.

Artificial neural networks can be considered as function approximation algorithms.

In a supervised learning setting, when presented with many input observations representing the problem of interest, together with their corresponding target outputs, the artificial neural network will seek to approximate the mapping that exists between the two.

A neural network is a computational model that is inspired by the structure of the human brain.– Page 65, Deep Learning, 2019.

The human brain consists of a massive network of interconnected neurons (around one hundred billion of them), with each comprising a cell body, a set of fibres called dendrites, and an axon:

The dendrites act as the input channels to a neuron, whereas the axon acts as the output channel. Therefore, a neuron would receive input signals through its dendrites, which in turn would be connected to the (output) axons of other neighbouring neurons. In this manner, a sufficiently strong electrical pulse (also called an action potential) can be transmitted along the axon of one neuron, to all the other neurons that are connected to it. This permits signals to be propagated along the structure of the human brain.

So, a neuron acts as an all-or-none switch, that takes in a set of inputs and either outputs an action potential or no output.– Page 66, Deep Learning, 2019.

An artificial neural network is analogous to the structure of the human brain, because (1) it is similarly composed of a large number of interconnected neurons that, (2) seek to propagate information across the network by, (3) receiving sets of stimuli from neighbouring neurons and mapping these to outputs, to be fed to the next layer of neurons.

The structure of an artificial neural network is typically organised into layers of neurons (recall the depiction of a tree diagram). For example, the following diagram illustrates a fully-connected neural network, where all the neurons in one layer are connected to all the neurons in the next layer:

The inputs are presented on the left hand side of the network, and the information propagates (or flows) rightward towards the outputs at the opposite end. Since the information is, hereby, propagating in the *forward* direction through the network, then we would also refer to such a network as a *feedforward neural network*.

The layers of neurons in between the input and output layers are called *hidden* layers, because they are not directly accessible.

Each connection (represented by an arrow in the diagram) between two neurons is attributed a weight, which acts on the data flowing through the network, as we will see shortly.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

More specifically, let’s say that a particular artificial neuron (or a *perceptron*, as Frank Rosenblatt had initially named it) receives *n* inputs, [*x*_{1}, …, *x*_{n}], where each connection is attributed a corresponding weight, [*w*_{1}, …, *w*_{n}].

The first operation that is carried out multiplies the input values by their corresponding weight, and adds a bias term, *b*, to their sum, producing an output, *z*:

*z* = ((*x*_{1} × *w*_{1}) + (*x*_{2} × *w*_{2}) + … + (*x*_{n} × *w*_{n})) + *b*

We can, alternatively, represent this operation in a more compact form as follows:

This weighted sum calculation that we have performed so far is a linear operation. If every neuron had to implement this particular calculation alone, then the neural network would be restricted to learning only linear input-output mappings.

However, many of the relationships in the world that we might want to model are nonlinear, and if we attempt to model these relationships using a linear model, then the model will be very inaccurate.– Page 77, Deep Learning, 2019.

Hence, a second operation is performed by each neuron that transforms the weighted sum by the application of a nonlinear activation function, *a*(.):

We can represent the operations performed by each neuron even more compactly, if we had to integrate the bias term into the sum as another weight, *w*_{0} (notice that the sum now starts from 0):

The operations performed by each neuron can be illustrated as follows:

Therefore, each neuron can be considered to implement a nonlinear function that maps a set of inputs to an output activation.

Training an artificial neural network involves the process of searching for the set of weights that model best the patterns in the data. It is a process that employs the backpropagation and gradient descent algorithms in tandem. Both of these algorithms make extensive use of calculus.

Each time that the network is traversed in the forward (or rightward) direction, the error of the network can be calculated as the difference between the output produced by the network and the expected ground truth, by means of a loss function (such as the sum of squared errors (SSE)). The backpropagation algorithm, then, calculates the gradient (or the rate of change) of this error to changes in the weights. In order to do so, it requires the use of the chain rule and partial derivatives.

For simplicity, consider a network made up of two neurons connected by a single path of activation. If we had to break them open, we would find that the neurons perform the following operations in cascade:

The first application of the chain rule connects the overall error of the network to the input, *z*_{2}, of the activation function *a*_{2} of the second neuron, and subsequently to the weight, *w*_{2}, as follows:

You may notice that the application of the chain rule involves, among other terms, a multiplication by the partial derivative of the neuron’s activation function with respect to its input, *z*_{2}. There are different activation functions to choose from, such as the sigmoid or the logistic functions. If we had to take the logistic function as an example, then its partial derivative would be computed as follows:

Hence, we can compute 𝛿_{2} as follows:

Here, *t*_{2} is the expected activation, and in finding the difference between *t*_{2} and *a*_{2} we are, therefore, computing the error between the activation generated by the network and the expected ground truth.

Since we are computing the derivative of the activation function, it should, therefore, be continuous and differentiable over the entire space of real numbers. In the case of deep neural networks, the error gradient is propagated backwards over a large number of hidden layers. This can cause the error signal to rapidly diminish to zero, especially if the maximum value of the derivative function is already small to begin with (for instance, the inverse of the logistic function has a maximum value of 0.25). This is known as the *vanishing gradient problem*. The ReLU function has been so popularly used in deep learning to alleviate this problem, because its derivative in the positive portion of its domain is equal to 1.

The next weight backwards is deeper into the network and, hence, the application of the chain rule can similarly be extended to connect the overall error to the weight, *w*_{1}, as follows:

If we take the logistic function again as the activation function of choice, then we would compute 𝛿_{1} as follows:

Once we have computed the gradient of the network error with respect to each weight, then the gradient descent algorithm can be applied to update each weight for the next *forward propagation* at time, *t*+1. For the weight, *w*_{1}, the weight update rule using gradient descent would be specified as follows:

Even though we have hereby considered a simple network, the process that we have gone through can be extended to evaluate more complex and deeper ones, such convolutional neural networks (CNNs).

If the network under consideration is characterised by multiple branches coming from multiple inputs (and possibly flowing towards multiple outputs), then its evaluation would involve the summation of different derivative chains for each path, similarly to how we have previously derived the generalized chain rule.

This section provides more resources on the topic if you are looking to go deeper.

- Deep Learning, 2019.
- Pattern Recognition and Machine Learning, 2016.

In this tutorial, you discovered how aspects of calculus are applied in neural networks.

Specifically, you learned:

- An artificial neural network is organized into layers of neurons and connections, where the latter are each attributed a weight value.
- Each neuron implements a nonlinear function that maps a set of inputs to an output activation.
- In training a neural network, calculus is used extensively by the backpropagation and gradient descent algorithms.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Calculus in Action: Neural Networks appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Taylor Series appeared first on Machine Learning Mastery.

]]>Taylor series expansion is an awesome concept, not only the world of mathematics, but also in optimization theory, function approximation and machine learning. It is widely applied in numerical computations when estimates of a function’s values at different points are required.

In this tutorial, you will discover Taylor series and how to approximate the values of a function around different points using its Taylor series expansion.

After completing this tutorial, you will know:

- Taylor series expansion of a function
- How to approximate functions using Taylor series expansion

Let’s get started.

This tutorial is divided into 3 parts; they are:

- Power series and Taylor series
- Taylor polynomials
- Function approximation using Taylor polynomials

The following is a power series about the center x=a and constant coefficients c_0, c_1, etc.

It is an amazing fact that functions which are infinitely differentiable can generate a power series called the Taylor series. Suppose we have a function f(x) and f(x) has derivatives of all orders on a given interval, then the Taylor series generated by f(x) at x=a is given by:

The second line of the above expression gives the value of the kth coefficient.

If we set a=0, then we have an expansion called the Maclaurin series expansion of f(x).

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Taylor series generated by f(x) = 1/x can be found by first differentiating the function and finding a general expression for the kth derivative.

The Taylor series about various points can now be found. For example:

A Taylor polynomial of order k, generated by f(x) at x=a is given by:

For the example of f(x)=1/x, the Taylor polynomial of order 2 is given by:

We can approximate the value of a function at a point x=a using Taylor polynomials. The higher the order of the polynomial, the more the terms in the polynomial and the closer the approximation is to the actual value of the function at that point.

In the graph below, the function 1/x is plotted around the point x=1 (left) and x=3 (right). The line in green is the actual function f(x)= 1/x. The pink line represents the approximation via an order 2 polynomial.

Let’s look at the function g(x) = e^x. Noting the fact that the kth order derivative of g(x) is also g(x), the expansion of g(x) about x=a, is given by:

Hence, around x=0, the series expansion of g(x) is given by (obtained by setting a=0):

The polynomial of order k generated for the function e^x around the point x=0 is given by:

The plots below show polynomials of different orders that estimate the value of e^x around x=0. We can see that as we move away from zero, we need more terms to approximate e^x more accurately. The green line representing the actual function is hiding behind the blue line of the approximating polynomial of order 7.

A popular method in machine learning for finding the optimal points of a function is the Newton’s method. Newton’s method uses the second order polynomials to approximate a function’s value at a point. Such methods that use second order derivatives are called second order optimization algorithms.

This section lists some ideas for extending the tutorial that you may wish to explore.

- Newton’s method
- Second order optimization algorithms

If you explore any of these extensions, I’d love to know. Post your findings in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- Jason Brownlee’s excellent resource on Calculus Books for Machine Learning

- Pattern recognition and machine learning by Christopher M. Bishop.
- Deep learning by Ian Goodfellow, Joshua Begio, Aaron Courville.
- Thomas Calculus, 14th edition, 2017. (based on the original works of George B. Thomas, revised by Joel Hass, Christopher Heil, Maurice Weir)
- Calculus, 3rd Edition, 2017. (Gilbert Strang)
- Calculus, 8th edition, 2015. (James Stewart)

In this tutorial, you discovered what is Taylor series expansion of a function about a point. Specifically, you learned:

- Power series and Taylor series
- Taylor polynomials
- How to approximate functions around a value using Taylor polynomials

Ask your questions in the comments below and I will do my best to answer

The post A Gentle Introduction to Taylor Series appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction To Approximation appeared first on Machine Learning Mastery.

]]>In this tutorial, you will discover what is approximation and its importance in machine learning and pattern recognition.

After completing this tutorial, you will know:

- What is approximation
- Importance of approximation in machine learning

Let’s get started.

This tutorial is divided into 3 parts; they are:

- What is approximation?
- Approximation when the form of function is not known
- Approximation when the form of function is known

We come across approximation very often. For example, the irrational number π can be approximated by the number 3.14. A more accurate value is 3.141593, which remains an approximation. You can similarly approximate the values of all irrational numbers like sqrt(3), sqrt(7), etc.

Approximation is used whenever a numerical value, a model, a structure or a function is either unknown or difficult to compute. In this article we’ll focus on function approximation and describe its application to machine learning problems. There are two different cases:

- The function is known but it is difficult or numerically expensive to compute its exact value. In this case approximation methods are used to find values, which are close to the function’s actual values.
- The function itself is unknown and hence a model or learning algorithm is used to closely find a function that can produce outputs close to the unknown function’s outputs.

If the form of a function is known, then a well known method in calculus and mathematics is approximation via Taylor series. The Taylor series of a function is the sum of infinite terms, which are computed using function’s derivatives. The Taylor series expansion of a function is discussed in this tutorial.

Another well known method for approximation in calculus and mathematics is Newton’s method. It can be used to approximate the roots of polynomials, hence making it a useful technique for approximating quantities such as the square root of different values or the reciprocal of different numbers, etc.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In data science and machine learning, it is assumed that there is an underlying function that holds the key to the relationship between the inputs and outputs. The form of this function is unknown. Here, we discuss several machine learning problems that employ approximation.

Regression involves the prediction of an output variable when given a set of inputs. In regression, the function that truly maps the input variables to outputs is not known. It is assumed that some linear or non-linear regression model can approximate the mapping of inputs to outputs.

For example, we may have data related to consumed calories per day and the corresponding blood sugar. To describe the relationship between the calorie input and blood sugar output, we can assume a straight line relationship/mapping function. The straight line is therefore the approximation of the mapping of inputs to outputs. A learning method such as the method of least squares is used to find this line.

A classic example of models that approximate functions in classification problems is that of neural networks. It is assumed that the neural network as a whole can approximate a true function that maps the inputs to the class labels. Gradient descent or some other learning algorithm is then used to learn that function approximation by adjusting the weights of the neural network.

Below is a typical example of unsupervised learning. Here we have points in 2D space and the label of none of these points is given. A clustering algorithm generally assumes a model according to which a point can be assigned to a class or label. For example, k-means learns the labels of data by assuming that data clusters are circular, and hence, assigns the same label or class to points lying in the same circle or an n-sphere in case of multi-dimensional data. In the figure below we are approximating the relationship between points and their labels via circular functions.

This section lists some ideas for extending the tutorial that you may wish to explore.

- Maclaurin series
- Taylor’s series

If you explore any of these extensions, I’d love to know. Post your findings in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- Jason Brownlee’s excellent resource on Calculus Books for Machine Learning

- Pattern recognition and machine learning by Christopher M. Bishop.
- Deep learning by Ian Goodfellow, Joshua Begio, Aaron Courville.
- Thomas’ Calculus, 14th edition, 2017. (based on the original works of George B. Thomas, revised by Joel Hass, Christopher Heil, Maurice Weir)

In this tutorial, you discovered what is approximation. Specifically, you learned:

- Approximation
- Approximation when the form of a function is known
- Approximation when the form of a function is unknown

Ask your questions in the comments below and I will do my best to answer

The post A Gentle Introduction To Approximation appeared first on Machine Learning Mastery.

]]>The post The Chain Rule of Calculus – Even More Functions appeared first on Machine Learning Mastery.

]]>In this tutorial, you will discover how to apply the chain rule of calculus to challenging functions.

After completing this tutorial, you will know:

- The process of applying the chain rule to univariate functions can be extended to multivariate ones.
- The application of the chain rule follows a similar process, no matter how complex the function is: take the derivative of the outer function first, and then move inwards. Along the way, the application of other derivative rules might be required.
- Applying the chain rule to multivariate functions requires the use of partial derivatives.

Let’s get started.

This tutorial is divided into two parts; they are:

- The Chain Rule on Univariate Functions
- The Chain Rule on Multivariate Functions

For this tutorial, we assume that you already know what are:

You can review these concepts by clicking on the links given above.

We have already discovered the chain rule for univariate and multivariate functions, but we have only seen a few simple examples so far. Let’s see a few more challenging ones here. We will be starting with univariate functions first, and then apply what we learn to multivariate functions.

**EXAMPLE 1**: Let’s raise the bar a little by considering the following composite function:

We can separate the composite function into the inner function, *f*(*x*) = *x*^{2} – 10, and the outer function, *g*(*x*) = √*x* = (*x*)^{1/2}. The output of the inner function is denoted by the intermediate variable, *u*, and its value will be fed into the input of the outer function.

The first step is to find the derivative of the outer part of the composite function, while ignoring whatever is inside. For this purpose, we can apply the power rule:

*dh / du* = (1/2) (*x*^{2} – 10)^{-1/2}

The next step is to find the derivative of the inner part of the composite function, this time ignoring whatever is outside. We can apply the power rule here too:

*du / dx* = 2*x*

Putting the two parts together and simplifying, we have:

**EXAMPLE 2**: Let’s repeat the procedure, this time with a different composite function:

We will again use, *u*, the output of the inner function, as our intermediate variable.

The outer function in this case is, cos *x*. Finding its derivative, again ignoring the inside, gives us:

*dh* / *du* = (cos(*x*^{3} – 1))’ = -sin(*x*^{3} – 1)

The inner function is, *x*^{3} – 1. Hence, its derivative becomes:

*du* / *dx* = (*x*^{3} – 1)’ = 3*x*^{2}

Putting the two parts together, we obtain the derivative of the composite function:

**EXAMPLE 3**: Let’s now raise the bar a little further by considering a more challenging composite function:

If we observe this closely, we realize that not only do we have nested functions for which we will need to apply the chain rule multiple times, but we also have a product to which we will need to apply the product rule.

We find that the outermost function is a cosine. In finding its derivative by the chain rule, we shall be using the intermediate variable, *u*:

*dh* / *du* = (cos(*x *√(*x*^{2} – 10) ))’ = -sin(*x *√(*x*^{2} – 10) )

Inside the cosine, we have the product, *x *√(x^{2} – 10), to which we will be applying the product rule to find its derivative (notice that we are always moving from the outside to the inside, in order to discover the operation that needs to be tackled next):

*du* / *dx* = (*x *√(x^{2} – 10) )’ = √(x^{2} – 10) + *x* ( √(x^{2} – 10) )’

One of the components in the resulting term is, ( √(x^{2} – 10) )’, to which we shall be applying the chain rule again. Indeed, we have already done so above, and hence we can simply re-utilise the result:

( √(x^{2} – 10) )’ = *x* (*x*^{2} – 10)^{-1/2}

Putting all the parts together, we obtain the derivative of the composite function:

This can be simplified further into:

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

**EXAMPLE 4**: Suppose that we are now presented by a multivariate function of two independent variables, *s* and *t*, with each of these variables being dependent on another two independent variables, *x* and *y*:

*h* = *g*(*s*, *t*) = *s*^{2} + *t*^{3}

Where the functions, *s *= *xy*, and *t* = 2*x* – *y*.

Implementing the chain rule here requires the computation of partial derivatives, since we are working with multiple independent variables. Furthermore, *s* and *t* will also act as our intermediate variables. The formulae that we will be working with, defined with respect to each input, are the following:

From these formulae, we can see that we will need to find six different partial derivatives:

We can now proceed to substitute these terms in the formulae for ∂*h* / ∂*x *and* *∂*h* / ∂*y*:

And subsequently substitute for *s* and *t *to find the derivatives:

**EXAMPLE 5**: Let’s repeat this again, this time with a multivariate function of three independent variables, $r$, $s$ and $t$, with each of these variables being dependent on another two independent variables, $x$ and $y$:

$$h=g(r,s,t)=r^2-rs+t^3$$

Where the functions, $r = x \cos y$, $s=xe^y$, and $t=x+y$.

This time round, $r$, $s$ and $t$ will act as our intermediate variables. The formulae that we will be working with, defined with respect to each input, are the following:

From these formulae, we can see that we will now need to find nine different partial derivatives:

Again, we proceed to substitute these terms in the formulae for ∂*h* / ∂*x *and* *∂*h* / ∂*y*:

And subsequently substitute for $r$, $s$ and $t$ to find the derivatives:

Which may be simplified a little further (hint: apply the trigonometric identity $2\sin y\cos y=\sin 2y$ to $\partial h/\partial y$):

No matter how complex the expression is, the procedure to follow remains similar:

Your last computation tells you the first thing to do.– Page 143, Calculus for Dummies, 2016.

Hence, start by tackling the outer function first, then move inwards to the next one. You may need to apply other rules along the way, as we have seen for Example 3. Do not forget to take the partial derivatives if you are working with multivariate functions.

This section provides more resources on the topic if you are looking to go deeper.

- Calculus for Dummies, 2016.
- Single and Multivariable Calculus, 2020.
- Mathematics for Machine Learning, 2020.

In this tutorial, you discovered how to apply the chain rule of calculus to challenging functions.

Specifically, you learned:

- The process of applying the chain rule to univariate functions can be extended to multivariate ones.
- The application of the chain rule follows a similar process, no matter how complex the function is: take the derivative of the outer function first, and then move inwards. Along the way, the application of other derivative rules might be required.
- Applying the chain rule to multivariate functions requires the use of partial derivatives.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post The Chain Rule of Calculus – Even More Functions appeared first on Machine Learning Mastery.

]]>