The post How to Choose an Activation Function for Deep Learning appeared first on Machine Learning Mastery.

]]>Last Updated on January 19, 2021

**Activation functions** are a critical part of the design of a neural network.

The choice of activation function in the hidden layer will control how well the network model learns the training dataset. The choice of activation function in the output layer will define the type of predictions the model can make.

As such, a careful choice of activation function must be made for each deep learning neural network project.

In this tutorial, you will discover how to choose activation functions for neural network models.

After completing this tutorial, you will know:

- Activation functions are a key part of neural network design.
- The modern default activation function for hidden layers is the ReLU function.
- The activation function for output layers depends on the type of prediction problem.

Let’s get started.

This tutorial is divided into three parts; they are:

- Activation Functions
- Activation for Hidden Layers
- Activation for Output Layers

An activation function in a neural network defines how the weighted sum of the input is transformed into an output from a node or nodes in a layer of the network.

Sometimes the activation function is called a “*transfer function*.” If the output range of the activation function is limited, then it may be called a “*squashing function*.” Many activation functions are nonlinear and may be referred to as the “*nonlinearity*” in the layer or the network design.

The choice of activation function has a large impact on the capability and performance of the neural network, and different activation functions may be used in different parts of the model.

Technically, the activation function is used within or after the internal processing of each node in the network, although networks are designed to use the same activation function for all nodes in a layer.

A network may have three types of layers: input layers that take raw input from the domain, **hidden layers** that take input from another layer and pass output to another layer, and **output layers** that make a prediction.

All hidden layers typically use the same activation function. The output layer will typically use a different activation function from the hidden layers and is dependent upon the type of prediction required by the model.

Activation functions are also typically differentiable, meaning the first-order derivative can be calculated for a given input value. This is required given that neural networks are typically trained using the backpropagation of error algorithm that requires the derivative of prediction error in order to update the weights of the model.

There are many different types of activation functions used in neural networks, although perhaps only a small number of functions used in practice for hidden and output layers.

Let’s take a look at the activation functions used for each type of layer in turn.

A hidden layer in a neural network is a layer that receives input from another layer (such as another hidden layer or an input layer) and provides output to another layer (such as another hidden layer or an output layer).

A hidden layer does not directly contact input data or produce outputs for a model, at least in general.

A neural network may have zero or more hidden layers.

Typically, a differentiable nonlinear activation function is used in the hidden layers of a neural network. This allows the model to learn more complex functions than a network trained using a linear activation function.

In order to get access to a much richer hypothesis space that would benefit from deep representations, you need a non-linearity, or activation function.

— Page 72, Deep Learning with Python, 2017.

There are perhaps three activation functions you may want to consider for use in hidden layers; they are:

- Rectified Linear Activation (
**ReLU**) - Logistic (
**Sigmoid**) - Hyperbolic Tangent (
**Tanh**)

This is not an exhaustive list of activation functions used for hidden layers, but they are the most commonly used.

Let’s take a closer look at each in turn.

The rectified linear activation function, or ReLU activation function, is perhaps the most common function used for hidden layers.

It is common because it is both simple to implement and effective at overcoming the limitations of other previously popular activation functions, such as Sigmoid and Tanh. Specifically, it is less susceptible to vanishing gradients that prevent deep models from being trained, although it can suffer from other problems like saturated or “*dead*” units.

The ReLU function is calculated as follows:

- max(0.0, x)

This means that if the input value (x) is negative, then a value 0.0 is returned, otherwise, the value is returned.

You can learn more about the details of the ReLU activation function in this tutorial:

We can get an intuition for the shape of this function with the worked example below.

# example plot for the relu activation function from matplotlib import pyplot # rectified linear function def rectified(x): return max(0.0, x) # define input data inputs = [x for x in range(-10, 10)] # calculate outputs outputs = [rectified(x) for x in inputs] # plot inputs vs outputs pyplot.plot(inputs, outputs) pyplot.show()

Running the example calculates the outputs for a range of values and creates a plot of inputs versus outputs.

We can see the familiar kink shape of the ReLU activation function.

When using the ReLU function for hidden layers, it is a good practice to use a “*He Normal*” or “*He Uniform*” weight initialization and scale input data to the range 0-1 (normalize) prior to training.

The sigmoid activation function is also called the logistic function.

It is the same function used in the logistic regression classification algorithm.

The function takes any real value as input and outputs values in the range 0 to 1. The larger the input (more positive), the closer the output value will be to 1.0, whereas the smaller the input (more negative), the closer the output will be to 0.0.

The sigmoid activation function is calculated as follows:

- 1.0 / (1.0 + e^-x)

Where e is a mathematical constant, which is the base of the natural logarithm.

We can get an intuition for the shape of this function with the worked example below.

# example plot for the sigmoid activation function from math import exp from matplotlib import pyplot # sigmoid activation function def sigmoid(x): return 1.0 / (1.0 + exp(-x)) # define input data inputs = [x for x in range(-10, 10)] # calculate outputs outputs = [sigmoid(x) for x in inputs] # plot inputs vs outputs pyplot.plot(inputs, outputs) pyplot.show()

Running the example calculates the outputs for a range of values and creates a plot of inputs versus outputs.

We can see the familiar S-shape of the sigmoid activation function.

When using the Sigmoid function for hidden layers, it is a good practice to use a “*Xavier Normal*” or “*Xavier Uniform*” weight initialization (also referred to Glorot initialization, named for Xavier Glorot) and scale input data to the range 0-1 (e.g. the range of the activation function) prior to training.

The hyperbolic tangent activation function is also referred to simply as the Tanh (also “*tanh*” and “*TanH*“) function.

It is very similar to the sigmoid activation function and even has the same S-shape.

The function takes any real value as input and outputs values in the range -1 to 1. The larger the input (more positive), the closer the output value will be to 1.0, whereas the smaller the input (more negative), the closer the output will be to -1.0.

The Tanh activation function is calculated as follows:

- (e^x – e^-x) / (e^x + e^-x)

Where e is a mathematical constant that is the base of the natural logarithm.

We can get an intuition for the shape of this function with the worked example below.

# example plot for the tanh activation function from math import exp from matplotlib import pyplot # tanh activation function def tanh(x): return (exp(x) - exp(-x)) / (exp(x) + exp(-x)) # define input data inputs = [x for x in range(-10, 10)] # calculate outputs outputs = [tanh(x) for x in inputs] # plot inputs vs outputs pyplot.plot(inputs, outputs) pyplot.show()

Running the example calculates the outputs for a range of values and creates a plot of inputs versus outputs.

We can see the familiar S-shape of the Tanh activation function.

When using the TanH function for hidden layers, it is a good practice to use a “*Xavier Normal*” or “*Xavier Uniform*” weight initialization (also referred to Glorot initialization, named for Xavier Glorot) and scale input data to the range -1 to 1 (e.g. the range of the activation function) prior to training.

A neural network will almost always have the same activation function in all hidden layers.

It is most unusual to vary the activation function through a network model.

Traditionally, the sigmoid activation function was the default activation function in the 1990s. Perhaps through the mid to late 1990s to 2010s, the Tanh function was the default activation function for hidden layers.

… the hyperbolic tangent activation function typically performs better than the logistic sigmoid.

— Page 195, Deep Learning, 2016.

Both the sigmoid and Tanh functions can make the model more susceptible to problems during training, via the so-called vanishing gradients problem.

You can learn more about this problem in this tutorial:

The activation function used in hidden layers is typically chosen based on the type of neural network architecture.

Modern neural network models with common architectures, such as MLP and CNN, will make use of the ReLU activation function, or extensions.

In modern neural networks, the default recommendation is to use the rectified linear unit or ReLU …

— Page 174, Deep Learning, 2016.

Recurrent networks still commonly use Tanh or sigmoid activation functions, or even both. For example, the LSTM commonly uses the Sigmoid activation for recurrent connections and the Tanh activation for output.

**Multilayer Perceptron (MLP)**: ReLU activation function.**Convolutional Neural Network (CNN)**: ReLU activation function.**Recurrent Neural Network**: Tanh and/or Sigmoid activation function.

If you’re unsure which activation function to use for your network, try a few and compare the results.

The figure below summarizes how to choose an activation function for the hidden layers of your neural network model.

The output layer is the layer in a neural network model that directly outputs a prediction.

All feed-forward neural network models have an output layer.

There are perhaps three activation functions you may want to consider for use in the output layer; they are:

- Linear
- Logistic (Sigmoid)
- Softmax

This is not an exhaustive list of activation functions used for output layers, but they are the most commonly used.

Let’s take a closer look at each in turn.

The linear activation function is also called “*identity*” (multiplied by 1.0) or “*no activation*.”

This is because the linear activation function does not change the weighted sum of the input in any way and instead returns the value directly.

We can get an intuition for the shape of this function with the worked example below.

# example plot for the linear activation function from matplotlib import pyplot # linear activation function def linear(x): return x # define input data inputs = [x for x in range(-10, 10)] # calculate outputs outputs = [linear(x) for x in inputs] # plot inputs vs outputs pyplot.plot(inputs, outputs) pyplot.show()

We can see a diagonal line shape where inputs are plotted against identical outputs.

Target values used to train a model with a linear activation function in the output layer are typically scaled prior to modeling using normalization or standardization transforms.

The sigmoid of logistic activation function was described in the previous section.

Nevertheless, to add some symmetry, we can review for the shape of this function with the worked example below.

# example plot for the sigmoid activation function from math import exp from matplotlib import pyplot # sigmoid activation function def sigmoid(x): return 1.0 / (1.0 + exp(-x)) # define input data inputs = [x for x in range(-10, 10)] # calculate outputs outputs = [sigmoid(x) for x in inputs] # plot inputs vs outputs pyplot.plot(inputs, outputs) pyplot.show()

We can see the familiar S-shape of the sigmoid activation function.

Target labels used to train a model with a sigmoid activation function in the output layer will have the values 0 or 1.

The softmax function outputs a vector of values that sum to 1.0 that can be interpreted as probabilities of class membership.

It is related to the argmax function that outputs a 0 for all options and 1 for the chosen option. Softmax is a “*softer*” version of argmax that allows a probability-like output of a winner-take-all function.

As such, the input to the function is a vector of real values and the output is a vector of the same length with values that sum to 1.0 like probabilities.

The softmax function is calculated as follows:

- e^x / sum(e^x)

Where *x* is a vector of outputs and e is a mathematical constant that is the base of the natural logarithm.

You can learn more about the details of the Softmax function in this tutorial:

We cannot plot the softmax function, but we can give an example of calculating it in Python.

from math import exp # softmax activation function def softmax(x): return exp(x) / exp(x).sum() # define input data inputs = [1.0, 3.0, 2.0] # calculate outputs outputs = softmax(inputs) # report the probabilities print(outputs) # report the sum of the probabilities print(outputs.sum())

Running the example calculates the softmax output for the input vector.

We then confirm that the sum of the outputs of the softmax indeed sums to the value 1.0.

[0.09003057 0.66524096 0.24472847] 1.0

Target labels used to train a model with the softmax activation function in the output layer will be vectors with 1 for the target class and 0 for all other classes.

You must choose the activation function for your output layer based on the type of prediction problem that you are solving.

Specifically, the type of variable that is being predicted.

For example, you may divide prediction problems into two main groups, predicting a categorical variable (*classification*) and predicting a numerical variable (*regression*).

If your problem is a regression problem, you should use a linear activation function.

**Regression**: One node, linear activation.

If your problem is a classification problem, then there are three main types of classification problems and each may use a different activation function.

Predicting a probability is not a regression problem; it is classification. In all cases of classification, your model will predict the probability of class membership (e.g. probability that an example belongs to each class) that you can convert to a crisp class label by rounding (for sigmoid) or argmax (for softmax).

If there are two mutually exclusive classes (binary classification), then your output layer will have one node and a sigmoid activation function should be used. If there are more than two mutually exclusive classes (multiclass classification), then your output layer will have one node per class and a softmax activation should be used. If there are two or more mutually inclusive classes (multilabel classification), then your output layer will have one node for each class and a sigmoid activation function is used.

**Binary Classification**: One node, sigmoid activation.**Multiclass Classification**: One node per class, softmax activation.**Multilabel Classification**: One node per class, sigmoid activation.

The figure below summarizes how to choose an activation function for the output layer of your neural network model.

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to the Rectified Linear Unit (ReLU)
- Softmax Activation Function with Python
- 4 Types of Classification Tasks in Machine Learning
- How to Fix the Vanishing Gradients Problem Using the ReLU

- Deep Learning, 2016.
- Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.
- Neural Networks for Pattern Recognition, 1996.
- Deep Learning with Python, 2017.

In this tutorial, you discovered how to choose activation functions for neural network models.

Specifically, you learned:

- Activation functions are a key part of neural network design.
- The modern default activation function for hidden layers is the ReLU function.
- The activation function for output layers depends on the type of prediction problem.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Choose an Activation Function for Deep Learning appeared first on Machine Learning Mastery.

]]>The post Visualization for Function Optimization in Python appeared first on Machine Learning Mastery.

]]>**Function optimization** involves finding the input that results in the optimal value from an objective function.

Optimization algorithms navigate the search space of input variables in order to locate the optima, and both the shape of the objective function and behavior of the algorithm in the search space are opaque on real-world problems.

As such, it is common to study optimization algorithms using simple low-dimensional functions that can be easily visualized directly. Additionally, the samples in the input space of these simple functions made by an optimization algorithm can be visualized with their appropriate context.

Visualization of lower-dimensional functions and algorithm behavior on those functions can help to develop the intuitions that can carry over to more complex higher-dimensional function optimization problems later.

In this tutorial, you will discover how to create visualizations for function optimization in Python.

After completing this tutorial, you will know:

- Visualization is an important tool when studying function optimization algorithms.
- How to visualize one-dimensional functions and samples using line plots.
- How to visualize two-dimensional functions and samples using contour and surface plots.

Let’s get started.

This tutorial is divided into three parts; they are:

- Visualization for Function Optimization
- Visualize 1D Function Optimization
- Test Function
- Sample Test Function
- Line Plot of Test Function
- Scatter Plot of Test Function
- Line Plot with Marked Optima
- Line Plot with Samples

- Visualize 2D Function Optimization
- Test Function
- Sample Test Function
- Contour Plot of Test Function
- Filled Contour Plot of Test Function
- Filled Contour Plot of Test Function with Samples
- Surface Plot of Test Function

Function optimization is a field of mathematics concerned with finding the inputs to a function that result in the optimal output for the function, typically a minimum or maximum value.

Optimization may be straightforward for simple differential functions where the solution can be calculated analytically. However, most functions we’re interested in solving in applied machine learning may or may not be well behaved and may be complex, nonlinear, multivariate, and non-differentiable.

As such, it is important to have an understanding of a wide range of different algorithms that can be used to address function optimization problems.

An important aspect of studying function optimization is understanding the objective function that is being optimized and understanding the behavior of an optimization algorithm over time.

Visualization plays an important role when getting started with function optimization.

We can select simple and well-understood test functions to study optimization algorithms. These simple functions can be plotted to understand the relationship between the input to the objective function and the output of the objective function and highlighting hills, valleys, and optima.

In addition, the samples selected from the search space by an optimization algorithm can also be plotted on top of plots of the objective function. These plots of algorithm behavior can provide insight and intuition into how specific optimization algorithms work and navigate a search space that can generalize to new problems in the future.

Typically, one-dimensional or two-dimensional functions are chosen to study optimization algorithms as they are easy to visualize using standard plots, like line plots and surface plots. We will explore both in this tutorial.

First, let’s explore how we might visualize a one-dimensional function optimization.

A one-dimensional function takes a single input variable and outputs the evaluation of that input variable.

Input variables are typically continuous, represented by a real-valued floating-point value. Often, the input domain is unconstrained, although for test problems we impose a domain of interest.

In this case we will explore function visualization with a simple x^2 objective function:

- f(x) = x^2

This has an optimal value with an input of x=0.0, which equals 0.0.

The example below implements this objective function and evaluates a single input.

# example of a 1d objective function # objective function def objective(x): return x**2.0 # evaluate inputs to the objective function x = 4.0 result = objective(x) print('f(%.3f) = %.3f' % (x, result))

Running the example evaluates the value 4.0 with the objective function, which equals 16.0.

f(4.000) = 16.000

The first thing we might want to do with a new function is define an input range of interest and sample the domain of interest using a uniform grid.

This sample will provide the basis for generating a plot later.

In this case, we will define a domain of interest around the optima of x=0.0 from x=-5.0 to x=5.0 and sample a grid of values in this range with 0.1 increments, such as -5.0, -4.9, -4.8, etc.

... # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # summarize some of the input domain print(inputs[:5])

We can then evaluate each of the x values in our sample.

... # compute targets results = objective(inputs) # summarize some of the results print(results[:5])

Finally, we can check some of the input and their corresponding outputs.

... # create a mapping of some inputs to some results for i in range(5): print('f(%.3f) = %.3f' % (inputs[i], results[i]))

Tying this together, the complete example of sampling the input space and evaluating all points in the sample is listed below.

# sample 1d objective function from numpy import arange # objective function def objective(x): return x**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # summarize some of the input domain print(inputs[:5]) # compute targets results = objective(inputs) # summarize some of the results print(results[:5]) # create a mapping of some inputs to some results for i in range(5): print('f(%.3f) = %.3f' % (inputs[i], results[i]))

Running the example first generates a uniform sample of input points as we expected.

The input points are then evaluated using the objective function and finally, we can see a simple mapping of inputs to outputs of the objective function.

[-5. -4.9 -4.8 -4.7 -4.6] [25. 24.01 23.04 22.09 21.16] f(-5.000) = 25.000 f(-4.900) = 24.010 f(-4.800) = 23.040 f(-4.700) = 22.090 f(-4.600) = 21.160

Now that we have some confidence in generating a sample of inputs and evaluating them with the objective function, we can look at generating plots of the function.

We could sample the input space randomly, but the benefit of a uniform line or grid of points is that it can be used to generate a smooth plot.

It is smooth because the points in the input space are ordered from smallest to largest. This ordering is important as we expect (hope) that the output of the objective function has a similar smooth relationship between values, e.g. small changes in input result in locally consistent (smooth) changes in the output of the function.

In this case, we can use the samples to generate a line plot of the objective function with the input points (x) on the x-axis of the plot and the objective function output (results) on the y-axis of the plot.

... # create a line plot of input vs result pyplot.plot(inputs, results) # show the plot pyplot.show()

Tying this together, the complete example is listed below.

# line plot of input vs result for a 1d objective function from numpy import arange from matplotlib import pyplot # objective function def objective(x): return x**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # compute targets results = objective(inputs) # create a line plot of input vs result pyplot.plot(inputs, results) # show the plot pyplot.show()

Running the example creates a line plot of the objective function.

We can see that the function has a large U-shape, called a parabola. This is a common shape when studying curves, e.g. the study of calculus.

The line is a construct. It is not really the function, just a smooth summary of the function. Always keep this in mind.

Recall that we, in fact, generated a sample of points in the input space and corresponding evaluation of those points.

As such, it would be more accurate to create a scatter plot of points; for example:

# scatter plot of input vs result for a 1d objective function from numpy import arange from matplotlib import pyplot # objective function def objective(x): return x**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # compute targets results = objective(inputs) # create a scatter plot of input vs result pyplot.scatter(inputs, results) # show the plot pyplot.show()

Running the example creates a scatter plot of the objective function.

We can see the familiar shape of the function, but we don’t gain anything from plotting the points directly.

The line and the smooth interpolation between the points it provides are more useful as we can draw other points on top of the line, such as the location of the optima or the points sampled by an optimization algorithm.

Next, let’s draw the line plot again and this time draw a point where the known optima of the function is located.

This can be helpful when studying an optimization algorithm as we might want to see how close an optimization algorithm can get to the optima.

First, we must define the input for the optima, then evaluate that point to give the x-axis and y-axis values for plotting.

... # define the known function optima optima_x = 0.0 optima_y = objective(optima_x)

We can then plot this point with any shape or color we like, in this case, a red square.

... # draw the function optima as a red square pyplot.plot([optima_x], [optima_y], 's', color='r')

Tying this together, the complete example of creating a line plot of the function with the optima highlighted by a point is listed below.

# line plot of input vs result for a 1d objective function and show optima from numpy import arange from matplotlib import pyplot # objective function def objective(x): return x**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # compute targets results = objective(inputs) # create a line plot of input vs result pyplot.plot(inputs, results) # define the known function optima optima_x = 0.0 optima_y = objective(optima_x) # draw the function optima as a red square pyplot.plot([optima_x], [optima_y], 's', color='r') # show the plot pyplot.show()

Running the example creates the familiar line plot of the function, and this time, the optima of the function, e.g. the input that results in the minimum output of the function, is marked with a red square.

This is a very simple function and the red square for the optima is easy to see.

Sometimes the function might be more complex, with lots of hills and valleys, and we might want to make the optima more visible.

In this case, we can draw a vertical line across the whole plot.

... # draw a vertical line at the optimal input pyplot.axvline(x=optima_x, ls='--', color='red')

Tying this together, the complete example is listed below.

# line plot of input vs result for a 1d objective function and show optima as line from numpy import arange from matplotlib import pyplot # objective function def objective(x): return x**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # compute targets results = objective(inputs) # create a line plot of input vs result pyplot.plot(inputs, results) # define the known function optima optima_x = 0.0 # draw a vertical line at the optimal input pyplot.axvline(x=optima_x, ls='--', color='red') # show the plot pyplot.show()

Running the example creates the same plot and this time draws a red line clearly marking the point in the input space that marks the optima.

Finally, we might want to draw the samples of the input space selected by an optimization algorithm.

We will simulate these samples with random points drawn from the input domain.

... # simulate a sample made by an optimization algorithm seed(1) sample = r_min + rand(10) * (r_max - r_min) # evaluate the sample sample_eval = objective(sample)

We can then plot this sample, in this case using small black circles.

... # plot the sample as black circles pyplot.plot(sample, sample_eval, 'o', color='black')

The complete example of creating a line plot of a function with the optima marked by a red line and an algorithm sample drawn with small black dots is listed below.

# line plot of domain for a 1d function with optima and algorithm sample from numpy import arange from numpy.random import seed from numpy.random import rand from matplotlib import pyplot # objective function def objective(x): return x**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # compute targets results = objective(inputs) # simulate a sample made by an optimization algorithm seed(1) sample = r_min + rand(10) * (r_max - r_min) # evaluate the sample sample_eval = objective(sample) # create a line plot of input vs result pyplot.plot(inputs, results) # define the known function optima optima_x = 0.0 # draw a vertical line at the optimal input pyplot.axvline(x=optima_x, ls='--', color='red') # plot the sample as black circles pyplot.plot(sample, sample_eval, 'o', color='black') # show the plot pyplot.show()

Running the example creates the line plot of the domain and marks the optima with a red line as before.

This time, the sample from the domain selected by an algorithm (really a random sample of points) is drawn with black dots.

We can imagine that a real optimization algorithm will show points narrowing in on the domain as it searches down-hill from a starting point.

Next, let’s look at how we might perform similar visualizations for the optimization of a two-dimensional function.

A two-dimensional function is a function that takes two input variables, e.g. *x* and *y*.

We can use the same *x^2* function and scale it up to be a two-dimensional function; for example:

- f(x, y) = x^2 + y^2

This has an optimal value with an input of [x=0.0, y=0.0], which equals 0.0.

The example below implements this objective function and evaluates a single input.

# example of a 2d objective function # objective function def objective(x, y): return x**2.0 + y**2.0 # evaluate inputs to the objective function x = 4.0 y = 4.0 result = objective(x, y) print('f(%.3f, %.3f) = %.3f' % (x, y, result))

Running the example evaluates the point [x=4, y=4], which equals 32.

f(4.000, 4.000) = 32.000

Next, we need a way to sample the domain so that we can, in turn, sample the objective function.

A common way for sampling a two-dimensional function is to first generate a uniform sample along each variable, *x* and *y*, then use these two uniform samples to create a grid of samples, called a mesh grid.

This is not a two-dimensional array across the input space; instead, it is two two-dimensional arrays that, when used together, define a grid across the two input variables.

This is achieved by duplicating the entire *x* sample array for each *y* sample point and similarly duplicating the entire *y* sample array for each *x* sample point.

This can be achieved using the meshgrid() NumPy function; for example:

... # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # summarize some of the input domain print(x[:5, :5])

We can then evaluate each pair of points using our objective function.

... # compute targets results = objective(x, y) # summarize some of the results print(results[:5, :5])

Finally, we can review the mapping of some of the inputs to their corresponding output values.

... # create a mapping of some inputs to some results for i in range(5): print('f(%.3f, %.3f) = %.3f' % (x[i,0], y[i,0], results[i,0]))

The example below demonstrates how we can create a uniform sample grid across the two-dimensional input space and objective function.

# sample 2d objective function from numpy import arange from numpy import meshgrid # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # summarize some of the input domain print(x[:5, :5]) # compute targets results = objective(x, y) # summarize some of the results print(results[:5, :5]) # create a mapping of some inputs to some results for i in range(5): print('f(%.3f, %.3f) = %.3f' % (x[i,0], y[i,0], results[i,0]))

Running the example first summarizes some points in the mesh grid, then the objective function evaluation for some points.

Finally, we enumerate coordinates in the two-dimensional input space and their corresponding function evaluation.

[[-5. -4.9 -4.8 -4.7 -4.6] [-5. -4.9 -4.8 -4.7 -4.6] [-5. -4.9 -4.8 -4.7 -4.6] [-5. -4.9 -4.8 -4.7 -4.6] [-5. -4.9 -4.8 -4.7 -4.6]] [[50. 49.01 48.04 47.09 46.16] [49.01 48.02 47.05 46.1 45.17] [48.04 47.05 46.08 45.13 44.2 ] [47.09 46.1 45.13 44.18 43.25] [46.16 45.17 44.2 43.25 42.32]] f(-5.000, -5.000) = 50.000 f(-5.000, -4.900) = 49.010 f(-5.000, -4.800) = 48.040 f(-5.000, -4.700) = 47.090 f(-5.000, -4.600) = 46.160

Now that we are familiar with how to sample the input space and evaluate points, let’s look at how we might plot the function.

A popular plot for two-dimensional functions is a contour plot.

This plot creates a flat representation of the objective function outputs for each x and y coordinate where the color and contour lines indicate the relative value or height of the output of the objective function.

This is just like a contour map of a landscape where mountains can be distinguished from valleys.

This can be achieved using the contour() Matplotlib function that takes the mesh grid and the evaluation of the mesh grid as input directly.

We can then specify the number of levels to draw on the contour and the color scheme to use. In this case, we will use 50 levels and a popular “*jet*” color scheme where low-levels use a cold color scheme (blue) and high-levels use a hot color scheme (red).

... # create a contour plot with 50 levels and jet color scheme pyplot.contour(x, y, results, 50, alpha=1.0, cmap='jet') # show the plot pyplot.show()

Tying this together, the complete example of creating a contour plot of the two-dimensional objective function is listed below.

# create a contour plot with 50 levels and jet color scheme pyplot.contour(x, y, results, 50, alpha=1.0, cmap='jet') # show the plot pyplot.show() Tying this together, the complete example of creating a contour plot of the two-dimensional objective function is listed below. # contour plot for 2d objective function from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a contour plot with 50 levels and jet color scheme pyplot.contour(x, y, results, 50, alpha=1.0, cmap='jet') # show the plot pyplot.show()

Running the example creates the contour plot.

We can see that the more curved parts of the surface around the edges have more contours to show the detail, and the less curved parts of the surface in the middle have fewer contours.

We can see that the lowest part of the domain is the middle, as expected.

It is also helpful to color the plot between the contours to show a more complete surface.

Again, the colors are just a simple linear interpolation, not the true function evaluation. This must be kept in mind on more complex functions where fine detail will not be shown.

We can fill the contour plot using the contourf() version of the function that takes the same arguments.

... # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet')

We can also show the optima on the plot, in this case as a white star that will stand out against the blue background color of the lowest part of the plot.

... # define the known function optima optima_x = [0.0, 0.0] # draw the function optima as a white star pyplot.plot([optima_x[0]], [optima_x[1]], '*', color='white')

Tying this together, the complete example of a filled contour plot with the optima marked is listed below.

# filled contour plot for 2d objective function and show the optima from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # define the known function optima optima_x = [0.0, 0.0] # draw the function optima as a white star pyplot.plot([optima_x[0]], [optima_x[1]], '*', color='white') # show the plot pyplot.show()

Running the example creates the filled contour plot that gives a better idea of the shape of the objective function.

The optima at [x=0, y=0] is then marked clearly with a white star.

We may want to show the progress of an optimization algorithm to get an idea of its behavior in the context of the shape of the objective function.

In this case, we can simulate the points chosen by an optimization algorithm with random coordinates in the input space.

... # simulate a sample made by an optimization algorithm seed(1) sample_x = r_min + rand(10) * (r_max - r_min) sample_y = r_min + rand(10) * (r_max - r_min)

These points can then be plotted directly as black circles and their context color can give an idea of their relative quality.

... # plot the sample as black circles pyplot.plot(sample_x, sample_y, 'o', color='black')

Tying this together, the complete example of a filled contour plot with optimal and input sample plotted is listed below.

# filled contour plot for 2d objective function and show the optima and sample from numpy import arange from numpy import meshgrid from numpy.random import seed from numpy.random import rand from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # simulate a sample made by an optimization algorithm seed(1) sample_x = r_min + rand(10) * (r_max - r_min) sample_y = r_min + rand(10) * (r_max - r_min) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # define the known function optima optima_x = [0.0, 0.0] # draw the function optima as a white star pyplot.plot([optima_x[0]], [optima_x[1]], '*', color='white') # plot the sample as black circles pyplot.plot(sample_x, sample_y, 'o', color='black') # show the plot pyplot.show()

Running the example, we can see the filled contour plot as before with the optima marked.

We can now see the sample drawn as black dots and their surrounding color and relative distance to the optima gives an idea of how close the algorithm (random points in this case) got to solving the problem.

Finally, we may want to create a three-dimensional plot of the objective function to get a fuller idea of the curvature of the function.

This can be achieved using the plot_surface() Matplotlib function, that, like the contour plot, takes the mesh grid and function evaluation directly.

... # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet')

The complete example of creating a surface plot is listed below.

# surface plot for 2d objective function from numpy import arange from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a three-dimensional surface plot of the objective function.

Additionally, the plot is interactive, meaning that you can use the mouse to drag the perspective on the surface around and view it from different angles.

This section provides more resources on the topic if you are looking to go deeper.

- Optimization and root finding (scipy.optimize)
- Optimization (scipy.optimize)
- numpy.meshgrid API.
- matplotlib.pyplot.contour API.
- matplotlib.pyplot.contourf API.
- mpl_toolkits.mplot3d.Axes3D.plot_surface API.

In this tutorial, you discovered how to create visualizations for function optimization in Python.

Specifically, you learned:

- Visualization is an important tool when studying function optimization algorithms.
- How to visualize one-dimensional functions and samples using line plots.
- How to visualize two-dimensional functions and samples using contour and surface plots.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Visualization for Function Optimization in Python appeared first on Machine Learning Mastery.

]]>The post Code Adam Gradient Descent Optimization From Scratch appeared first on Machine Learning Mastery.

]]>Last Updated on January 16, 2021

Gradient descent is an optimization algorithm that follows the negative gradient of an objective function in order to locate the minimum of the function.

A limitation of gradient descent is that a single step size (learning rate) is used for all input variables. Extensions to gradient descent like AdaGrad and RMSProp update the algorithm to use a separate step size for each input variable but may result in a step size that rapidly decreases to very small values.

The **Adaptive Movement Estimation** algorithm, or **Adam** for short, is an extension to gradient descent and a natural successor to techniques like AdaGrad and RMSProp that automatically adapts a learning rate for each input variable for the objective function and further smooths the search process by using an exponentially decreasing moving average of the gradient to make updates to variables.

In this tutorial, you will discover how to develop gradient descent with Adam optimization algorithm from scratch.

After completing this tutorial, you will know:

- Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
- Gradient descent can be updated to use an automatically adaptive step size for each input variable using a decaying average of partial derivatives, called Adam.
- How to implement the Adam optimization algorithm from scratch and apply it to an objective function and evaluate the results.

Let’s get started.

This tutorial is divided into three parts; they are:

- Gradient Descent
- Adam Optimization Algorithm
- Gradient Descent With Adam
- Two-Dimensional Test Problem
- Gradient Descent Optimization With Adam
- Visualization of Adam

Gradient descent is an optimization algorithm.

It is technically referred to as a first-order optimization algorithm as it explicitly makes use of the first-order derivative of the target objective function.

- First-order methods rely on gradient information to help direct the search for a minimum …

— Page 69, Algorithms for Optimization, 2019.

The first-order derivative, or simply the “*derivative*,” is the rate of change or slope of the target function at a specific point, e.g. for a specific input.

If the target function takes multiple input variables, it is referred to as a multivariate function and the input variables can be thought of as a vector. In turn, the derivative of a multivariate target function may also be taken as a vector and is referred to generally as the gradient.

**Gradient**: First-order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a specific input.

Gradient descent refers to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to locate the minimum of the function.

The gradient descent algorithm requires a target function that is being optimized and the derivative function for the objective function. The target function *f()* returns a score for a given set of inputs, and the derivative function *f'()* gives the derivative of the target function for a given set of inputs.

The gradient descent algorithm requires a starting point (*x*) in the problem, such as a randomly selected point in the input space.

The derivative is then calculated and a step is taken in the input space that is expected to result in a downhill movement in the target function, assuming we are minimizing the target function.

A downhill movement is made by first calculating how far to move in the input space, calculated as the step size (called alpha or the learning rate) multiplied by the gradient. This is then subtracted from the current point, ensuring we move against the gradient, or down the target function.

- x(t) = x(t-1) – step_size * f'(x(t-1))

The steeper the objective function at a given point, the larger the magnitude of the gradient and, in turn, the larger the step taken in the search space. The size of the step taken is scaled using a step size hyperparameter.

**Step Size**(*alpha*): Hyperparameter that controls how far to move in the search space against the gradient each iteration of the algorithm.

If the step size is too small, the movement in the search space will be small and the search will take a long time. If the step size is too large, the search may bounce around the search space and skip over the optima.

Now that we are familiar with the gradient descent optimization algorithm, let’s take a look at the Adam algorithm.

Adaptive Movement Estimation algorithm, or Adam for short, is an extension to the gradient descent optimization algorithm.

The algorithm was described in the 2014 paper by Diederik Kingma and Jimmy Lei Ba titled “Adam: A Method for Stochastic Optimization.”

Adam is designed to accelerate the optimization process, e.g. decrease the number of function evaluations required to reach the optima, or to improve the capability of the optimization algorithm, e.g. result in a better final result.

This is achieved by calculating a step size for each input parameter that is being optimized. Importantly, each step size is automatically adapted throughput the search process based on the gradients (partial derivatives) encountered for each variable.

We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients; the name Adam is derived from adaptive moment estimation

— Adam: A Method for Stochastic Optimization

This involves maintaining a first and second moment of the gradient, e.g. an exponentially decaying mean gradient (first moment) and variance (second moment) for each input variable.

The moving averages themselves are estimates of the 1st moment (the mean) and the 2nd raw moment (the uncentered variance) of the gradient.

— Adam: A Method for Stochastic Optimization

Let’s step through each element of the algorithm.

First, we must maintain a moment vector and exponentially weighted infinity norm for each parameter being optimized as part of the search, referred to as m and v (really the Greek letter nu) respectively. They are initialized to 0.0 at the start of the search.

- m = 0
- v = 0

The algorithm is executed iteratively over time t starting at *t=1*, and each iteration involves calculating a new set of parameter values *x*, e.g. going from *x(t-1)* to *x(t)*.

It is perhaps easy to understand the algorithm if we focus on updating one parameter, which generalizes to updating all parameters via vector operations.

First, the gradient (partial derivatives) are calculated for the current time step.

- g(t) = f'(x(t-1))

Next, the first moment is updated using the gradient and a hyperparameter *beta1*.

- m(t) = beta1 * m(t-1) + (1 – beta1) * g(t)

Then the second moment is updated using the squared gradient and a hyperparameter *beta2*.

- v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2

The first and second moments are biased because they are initialized with zero values.

… these moving averages are initialized as (vectors of) 0’s, leading to moment estimates that are biased towards zero, especially during the initial timesteps, and especially when the decay rates are small (i.e. the betas are close to 1). The good news is that this initialization bias can be easily counteracted, resulting in bias-corrected estimates …

— Adam: A Method for Stochastic Optimization

Next the first and second moments are bias-corrected, starring with the first moment:

- mhat(t) = m(t) / (1 – beta1(t))

And then the second moment:

- vhat(t) = v(t) / (1 – beta2(t))

Note, *beta1(t)* and *beta2(t)* refer to the beta1 and beta2 hyperparameters that are decayed on a schedule over the iterations of the algorithm. A static decay schedule can be used, although the paper recommend the following:

- beta1(t) = beta1^t
- beta2(t) = beta2^t

Finally, we can calculate the value for the parameter for this iteration.

- x(t) = x(t-1) – alpha * mhat(t) / (sqrt(vhat(t)) + eps)

Where *alpha* is the step size hyperparameter, *eps* is a small value (*epsilon*) such as 1e-8 that ensures we do not encounter a divide by zero error, and *sqrt()* is the square root function.

Note, a more efficient reordering of the update rule listed in the paper can be used:

- alpha(t) = alpha * sqrt(1 – beta2(t)) / (1 – beta1(t))
- x(t) = x(t-1) – alpha(t) * m(t) / (sqrt(v(t)) + eps)

To review, there are three hyperparameters for the algorithm, they are:

**alpha**: Initial step size (learning rate), a typical value is 0.001.**beta1**: Decay factor for first momentum, a typical value is 0.9.**beta2**: Decay factor for infinity norm, a typical value is 0.999.

And that’s it.

For full derivation of the Adam algorithm in the context of the Adam algorithm, I recommend reading the paper.

Next, let’s look at how we might implement the algorithm from scratch in Python.

In this section, we will explore how to implement the gradient descent optimization algorithm with Adam.

First, let’s define an optimization function.

We will use a simple two-dimensional function that squares the input of each dimension and define the range of valid inputs from -1.0 to 1.0.

The objective() function below implements this function

# objective function def objective(x, y): return x**2.0 + y**2.0

We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the response surface.

The complete example of plotting the objective function is listed below.

# 3d plot of the test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -1.0, 1.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a three-dimensional surface plot of the objective function.

We can see the familiar bowl shape with the global minima at f(0, 0) = 0.

We can also create a two-dimensional plot of the function. This will be helpful later when we want to plot the progress of the search.

The example below creates a contour plot of the objective function.

# contour plot of the test function from numpy import asarray from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # show the plot pyplot.show()

Running the example creates a two-dimensional contour plot of the objective function.

We can see the bowl shape compressed to contours shown with a color gradient. We will use this plot to plot the specific points explored during the progress of the search.

Now that we have a test objective function, let’s look at how we might implement the Adam optimization algorithm.

We can apply the gradient descent with Adam to the test problem.

First, we need a function that calculates the derivative for this function.

- f(x) = x^2
- f'(x) = x * 2

The derivative of x^2 is x * 2 in each dimension. The derivative() function implements this below.

# derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0])

Next, we can implement gradient descent optimization.

First, we can select a random point in the bounds of the problem as a starting point for the search.

This assumes we have an array that defines the bounds of the search with one row for each dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

... # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) score = objective(x[0], x[1])

Next, we need to initialize the first and second moments to zero.

... # initialize first and second moments m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])]

We then run a fixed number of iterations of the algorithm defined by the “*n_iter*” hyperparameter.

... # run iterations of gradient descent for t in range(n_iter): ...

The first step is to calculate the gradient for the current solution using the *derivative()* function.

... # calculate gradient gradient = derivative(solution[0], solution[1])

The first step is to calculate the derivative for the current set of parameters.

... # calculate gradient g(t) g = derivative(x[0], x[1])

Next, we need to perform the Adam update calculations. We will perform these calculations one variable at a time using an imperative programming style for readability.

In practice, I recommend using NumPy vector operations for efficiency.

... # build a solution one variable at a time for i in range(x.shape[0]): ...

First, we need to calculate the moment.

... # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]

Then the second moment.

... # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2

Then the bias correction for the first and second moments.

... # mhat(t) = m(t) / (1 - beta1(t)) mhat = m[i] / (1.0 - beta1**(t+1)) # vhat(t) = v(t) / (1 - beta2(t)) vhat = v[i] / (1.0 - beta2**(t+1))

Then finally the updated variable value.

... # x(t) = x(t-1) - alpha * mhat(t) / (sqrt(vhat(t)) + eps) x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps)

This is then repeated for each parameter that is being optimized.

At the end of the iteration we can evaluate the new parameter values and report the performance of the search.

... # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score))

We can tie all of this together into a function named *adam()* that takes the names of the objective and derivative functions as well as the algorithm hyperparameters, and returns the best solution found at the end of the search and its evaluation.

This complete function is listed below.

# gradient descent algorithm with adam def adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) score = objective(x[0], x[1]) # initialize first and second moments m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent updates for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(x.shape[0]): # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2 # mhat(t) = m(t) / (1 - beta1(t)) mhat = m[i] / (1.0 - beta1**(t+1)) # vhat(t) = v(t) / (1 - beta2(t)) vhat = v[i] / (1.0 - beta2**(t+1)) # x(t) = x(t-1) - alpha * mhat(t) / (sqrt(vhat(t)) + eps) x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps) # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return [x, score]

**Note**: we have intentionally used lists and imperative coding style instead of vectorized operations for readability. Feel free to adapt the implementation to a vectorized implementation with NumPy arrays for better performance.

We can then define our hyperparameters and call the *adam()* function to optimize our test objective function.

In this case, we will use 60 iterations of the algorithm with an initial steps size of 0.02 and beta1 and beta2 values of 0.8 and 0.999 respectively. These hyperparameter values were found after a little trial and error.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.999 # perform the gradient descent search with adam best, score = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2) print('Done!') print('f(%s) = %f' % (best, score))

Tying all of this together, the complete example of gradient descent optimization with Adam is listed below.

# gradient descent optimization with adam for a two-dimensional test function from math import sqrt from numpy import asarray from numpy.random import rand from numpy.random import seed # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with adam def adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) score = objective(x[0], x[1]) # initialize first and second moments m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent updates for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(x.shape[0]): # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2 # mhat(t) = m(t) / (1 - beta1(t)) mhat = m[i] / (1.0 - beta1**(t+1)) # vhat(t) = v(t) / (1 - beta2(t)) vhat = v[i] / (1.0 - beta2**(t+1)) # x(t) = x(t-1) - alpha * mhat(t) / (sqrt(vhat(t)) + eps) x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps) # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return [x, score] # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.999 # perform the gradient descent search with adam best, score = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2) print('Done!') print('f(%s) = %f' % (best, score))

Running the example applies the Adam optimization algorithm to our test problem and reports the performance of the search for each iteration of the algorithm.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a near-optimal solution was found after perhaps 53 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.

... >50 f([-0.00056912 -0.00321961]) = 0.00001 >51 f([-0.00052452 -0.00286514]) = 0.00001 >52 f([-0.00043908 -0.00251304]) = 0.00001 >53 f([-0.0003283 -0.00217044]) = 0.00000 >54 f([-0.00020731 -0.00184302]) = 0.00000 >55 f([-8.95352320e-05 -1.53514076e-03]) = 0.00000 >56 f([ 1.43050285e-05 -1.25002847e-03]) = 0.00000 >57 f([ 9.67123406e-05 -9.89850279e-04]) = 0.00000 >58 f([ 0.00015359 -0.00075587]) = 0.00000 >59 f([ 0.00018407 -0.00054858]) = 0.00000 Done! f([ 0.00018407 -0.00054858]) = 0.000000

We can plot the progress of the Adam search on a contour plot of the domain.

This can provide an intuition for the progress of the search over the iterations of the algorithm.

We must update the *adam()* function to maintain a list of all solutions found during the search, then return this list at the end of the search.

The updated version of the function with these changes is listed below.

# gradient descent algorithm with adam def adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) score = objective(x[0], x[1]) # initialize first and second moments m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent updates for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(bounds.shape[0]): # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2 # mhat(t) = m(t) / (1 - beta1(t)) mhat = m[i] / (1.0 - beta1**(t+1)) # vhat(t) = v(t) / (1 - beta2(t)) vhat = v[i] / (1.0 - beta2**(t+1)) # x(t) = x(t-1) - alpha * mhat(t) / (sqrt(vhat(t)) + ep) x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps) # evaluate candidate point score = objective(x[0], x[1]) # keep track of solutions solutions.append(x.copy()) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return solutions

We can then execute the search as before, and this time retrieve the list of solutions instead of the best final solution.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.999 # perform the gradient descent search with adam solutions = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

We can then create a contour plot of the objective function, as before.

... # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet')

Finally, we can plot each solution found during the search as a white dot connected by a line.

... # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

Tying this all together, the complete example of performing the Adam optimization on the test problem and plotting the results on a contour plot is listed below.

# example of plotting the adam search on a contour plot of the test function from math import sqrt from numpy import asarray from numpy import arange from numpy.random import rand from numpy.random import seed from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with adam def adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) score = objective(x[0], x[1]) # initialize first and second moments m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent updates for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(bounds.shape[0]): # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2 # mhat(t) = m(t) / (1 - beta1(t)) mhat = m[i] / (1.0 - beta1**(t+1)) # vhat(t) = v(t) / (1 - beta2(t)) vhat = v[i] / (1.0 - beta2**(t+1)) # x(t) = x(t-1) - alpha * mhat(t) / (sqrt(vhat(t)) + ep) x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps) # evaluate candidate point score = objective(x[0], x[1]) # keep track of solutions solutions.append(x.copy()) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return solutions # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.999 # perform the gradient descent search with adam solutions = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w') # show the plot pyplot.show()

Running the example performs the search as before, except in this case, a contour plot of the objective function is created.

In this case, we can see that a white dot is shown for each solution found during the search, starting above the optima and progressively getting closer to the optima at the center of the plot.

This section provides more resources on the topic if you are looking to go deeper.

- Algorithms for Optimization, 2019.
- Deep Learning, 2016.

- Gradient descent, Wikipedia.
- Stochastic gradient descent, Wikipedia.
- An overview of gradient descent optimization algorithms, 2016.

In this tutorial, you discovered how to develop gradient descent with Adam optimization algorithm from scratch.

Specifically, you learned:

- Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
- Gradient descent can be updated to use an automatically adaptive step size for each input variable using a decaying average of partial derivatives, called Adam.
- How to implement the Adam optimization algorithm from scratch and apply it to an objective function and evaluate the results.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Code Adam Gradient Descent Optimization From Scratch appeared first on Machine Learning Mastery.

]]>The post 3 Books on Optimization for Machine Learning appeared first on Machine Learning Mastery.

]]>**Optimization** is a field of mathematics concerned with finding a good or best solution among many candidates.

It is an important foundational topic required in machine learning as most machine learning algorithms are fit on historical data using an optimization algorithm. Additionally, broader problems, such as model selection and hyperparameter tuning, can also be framed as an optimization problem.

Although having some background in optimization is critical for machine learning practitioners, it can be a daunting topic given that it is often described using highly mathematical language.

In this post, you will discover top books on optimization that will be helpful to machine learning practitioners.

Let’s get started.

The field of optimization is enormous as it touches many other fields of study.

As such, there are hundreds of books on the topic, and most are textbooks filed with math and proofs. This is fair enough given that it is a highly mathematical subject.

Nevertheless, there are books that provide a more approachable description of optimization algorithms.

Not all optimization algorithms are relevant to machine learning; instead, it is useful to focus on a small subset of algorithms.

Frankly, it is hard to group optimization algorithms as there are many concerns. Nevertheless, it is important to have some idea of the optimization that underlies simpler algorithms, such as linear regression and logistic regression (e.g. convex optimization, least squares, newton methods, etc.), and neural networks (first-order methods, gradient descent, etc.).

These are foundational optimization algorithms covered in most optimization textbooks.

Not all optimization problems in machine learning are well behaved, such as optimization used in AutoML and hyperparameter tuning. Therefore, knowledge of stochastic optimization algorithms is required (simulated annealing, genetic algorithms, particle swarm, etc.). Although these are optimization algorithms, they are also a type of learning algorithm referred to as biologically inspired computation or computational intelligence.

Therefore, we will take a look at both books that cover classical optimization algorithms as well as books on alternate optimization algorithms.

In fact, the first book we will look at covers both types of algorithms, and much more.

This book was written by Mykel Kochenderfer and Tim Wheeler and was published in 2019.

This book might be one of the very few textbooks that I’ve seen that broadly covers the field of optimization techniques relevant to modern machine learning.

This book provides a broad introduction to optimization with a focus on practical algorithms for the design of engineering systems. We cover a wide variety of optimization topics, introducing the underlying mathematical problem formulations and the algorithms for solving them. Figures, examples, and exercises are provided to convey the intuition behind the various approaches.

— Page xiiix, Algorithms for Optimization, 2019.

Importantly the algorithms range from univariate methods (bisection, line search, etc.) to first-order methods (gradient descent), second-order methods (Newton’s method), direct methods (pattern search), stochastic methods (simulated annealing), and population methods (genetic algorithms, particle swarm), and so much more.

It includes both technical descriptions of algorithms with references and worked examples of algorithms in Julia. It’s a shame the examples are not in Python as this would make the book near perfect in my eyes.

The complete table of contents for the book is listed below.

- Chapter 01: Introduction
- Chapter 02: Derivatives and Gradients
- Chapter 03: Bracketing
- Chapter 04: Local Descent
- Chapter 05: First-Order Methods
- Chapter 06: Second-Order Methods
- Chapter 07: Direct Methods
- Chapter 08: Stochastic Methods
- Chapter 09: Population Methods
- Chapter 10: Constraints
- Chapter 11: Linear Constrained Optimization
- Chapter 12: Multiobjective Optimization
- Chapter 13: Sampling Plans
- Chapter 14: Surrogate Models
- Chapter 15: Probabilistic Surrogate Models
- Chapter 16: Surrogate Optimization
- Chapter 17: Optimization under Uncertainty
- Chapter 18: Uncertainty Propagation
- Chapter 19: Discrete Optimization
- Chapter 20: Expression Optimization
- Chapter 21: Multidisciplinary Optimization

I like this book a lot; it is full of valuable practical advice. I highly recommend it!

- Algorithms for Optimization, 2019.

This book was written by Jorge Nocedal and Stephen Wright and was published in 2006.

This book is focused on the math and theory of the optimization algorithms presented and does cover many of the foundational techniques used by common machine learning algorithms. It may be a little too heavy for the average practitioner.

The book is intended as a textbook for graduate students in mathematical subjects.

We intend that this book will be used in graduate-level courses in optimization, as offered in engineering, operations research, computer science, and mathematics departments.

— Page xviii, Numerical Optimization, 2006.

Even though it is highly mathematical, the descriptions of the algorithms are precise and may provide a useful alternative description to complement the other books listed.

The complete table of contents for the book is listed below.

- Chapter 01: Introduction
- Chapter 02: Fundamentals of Unconstrained Optimization
- Chapter 03: Line Search Methods
- Chapter 04: Trust-Region Methods
- Chapter 05: Conjugate Gradient Methods
- Chapter 06: Quasi-Newton Methods
- Chapter 07: Large-Scale Unconstrained Optimization
- Chapter 08: Calculating Derivatives
- Chapter 09: Derivative-Free Optimization
- Chapter 10: Least-Squares Problems
- Chapter 11: Nonlinear Equations
- Chapter 12: Theory of Constrained Optimization
- Chapter 13: Linear Programming: The Simplex Method
- Chapter 14: Linear Programming: Interior-Point Methods
- Chapter 15: Fundamentals of Algorithms for Nonlinear Constrained Optimization
- Chapter 16: Quadratic Programming
- Chapter 17: Penalty and Augmented Lagrangian Methods
- Chapter 18: Sequential Quadratic Programming
- Chapter 19: Interior-Point Methods for Nonlinear Programming

It’s a solid textbook on optimization.

- Numerical Optimization, 2006.

If you do prefer the theoretical approach to the subject, another widely used mathematical book on optimization is “Convex Optimization” written by Stephen Boyd and Lieven Vandenberghe and published in 2004.

This book was written by Andries Engelbrecht and published in 2007.

This book provides an excellent overview of the field of nature-inspired optimization algorithms, also referred to as computational intelligence. This includes fields such as evolutionary computation and swarm intelligence.

This book is far less mathematical than the previous textbooks and is more focused on the metaphor of the inspired system and how to configure and use the specific algorithms with lots of pseudocode explanations.

While the material is introductory in nature, it does not shy away from details, and does present the mathematical foundations to the interested reader. The intention of the book is not to provide thorough attention to all computational intelligence paradigms and algorithms, but to give an overview of the most popular and frequently used models.

— Page xxix, Computational Intelligence: An Introduction, 2007.

Algorithms like genetic algorithms, genetic programming, evolutionary strategies, differential evolution, and particle swarm optimization are useful to know for machine learning model hyperparameter tuning and perhaps even model selection. They also form the core of many modern AutoML systems.

The complete table of contents for the book is listed below.

- Part I Introduction
- Chapter 01: Introduction to Computational Intelligence

- Part II Artificial Neural Networks
- Chapter 02: The Artificial Neuron
- Chapter 03: Supervised Learning Neural Networks
- Chapter 04: Unsupervised Learning Neural Networks
- Chapter 05: Radial Basis Function Networks
- Chapter 06: Reinforcement Learning
- Chapter 07: Performance Issues (Supervised Learning)

- Part III Evolutionary Computation
- Chapter 08: Introduction to Evolutionary Computation
- Chapter 09: Genetic Algorithms
- Chapter 10: Genetic Programming
- Chapter 11: Evolutionary Programming
- Chapter 12: Evolution Strategies
- Chapter 13: Differential Evolution
- Chapter 14: Cultural Algorithms
- Chapter 15: Coevolution

- Part IV Computational Swarm Intelligence
- Chapter 16: Particle Swarm Optimization
- Chapter 17: Ant Algorithms

- Part V Artificial Immune Systems
- Chapter 18: Natural Immune System
- Chapter 19: Artificial Immune Models

- Part VI Fuzzy Systems
- Chapter 20: Fuzzy Sets
- Chapter 21: Fuzzy Logic and Reasoning

I’m a fan of this book and recommend it.

In this post, you discovered books on optimization algorithms that are helpful to know for applied machine learning.

**Did I miss a good book on optimization?**

Let me know in the comments below.

**Have you read any of the books listed?**

Let me know what you think of it in the comments.

The post 3 Books on Optimization for Machine Learning appeared first on Machine Learning Mastery.

]]>The post Univariate Function Optimization in Python appeared first on Machine Learning Mastery.

]]>Univariate function optimization involves finding the input to a function that results in the optimal output from an objective function.

This is a common procedure in machine learning when fitting a model with one parameter or tuning a model that has a single hyperparameter.

An efficient algorithm is required to solve optimization problems of this type that will find the best solution with the minimum number of evaluations of the objective function, given that each evaluation of the objective function could be computationally expensive, such as fitting and evaluating a model on a dataset.

This excludes expensive grid search and random search algorithms and in favor of efficient algorithms like Brent’s method.

In this tutorial, you will discover how to perform univariate function optimization in Python.

After completing this tutorial, you will know:

- Univariate function optimization involves finding an optimal input for an objective function that takes a single continuous argument.
- How to perform univariate function optimization for an unconstrained convex function.
- How to perform univariate function optimization for an unconstrained non-convex function.

Let’s get started.

This tutorial is divided into three parts; they are:

- Univariate Function Optimization
- Convex Univariate Function Optimization
- Non-Convex Univariate Function Optimization

We may need to find an optimal value of a function that takes a single parameter.

In machine learning, this may occur in many situations, such as:

- Finding the coefficient of a model to fit to a training dataset.
- Finding the value of a single hyperparameter that results in the best model performance.

This is called univariate function optimization.

We may be interested in the minimum outcome or maximum outcome of the function, although this can be simplified to minimization as a maximizing function can be made minimizing by adding a negative sign to all outcomes of the function.

There may or may not be limits on the inputs to the function, so-called unconstrained or constrained optimization, and we assume that small changes in input correspond to small changes in the output of the function, e.g. that it is smooth.

The function may or may not have a single optima, although we prefer that it does have a single optima and that shape of the function looks like a large basin. If this is the case, we know we can sample the function at one point and find the path down to the minima of the function. Technically, this is referred to as a convex function for minimization (concave for maximization), and functions that don’t have this basin shape are referred to as non-convex.

**Convex Target Function**: There is a single optima and the shape of the target function leads to this optima.

Nevertheless, the target function is sufficiently complex that we don’t know the derivative, meaning we cannot just use calculus to analytically compute the minimum or maximum of the function where the gradient is zero. This is referred to as a function that is non-differentiable.

Although we might be able to sample the function with candidate values, we don’t know the input that will result in the best outcome. This may be because of the many reasons it is expensive to evaluate candidate solutions.

Therefore, we require an algorithm that efficiently samples input values to the function.

One approach to solving univariate function optimization problems is to use Brent’s method.

Brent’s method is an optimization algorithm that combines a bisecting algorithm (Dekker’s method) and inverse quadratic interpolation. It can be used for constrained and unconstrained univariate function optimization.

The Brent-Dekker method is an extension of the bisection method. It is a root-finding algorithm that combines elements of the secant method and inverse quadratic interpolation. It has reliable and fast convergence properties, and it is the univariate optimization algorithm of choice in many popular numerical optimization packages.

— Pages 49-51, Algorithms for Optimization, 2019.

Bisecting algorithms use a bracket (lower and upper) of input values and split up the input domain, bisecting it in order to locate where in the domain the optima is located, much like a binary search. Dekker’s method is one way this is achieved efficiently for a continuous domain.

Dekker’s method gets stuck on non-convex problems. Brent’s method modifies Dekker’s method to avoid getting stuck and also approximates the second derivative of the objective function (called the Secant Method) in an effort to accelerate the search.

As such, Brent’s method for univariate function optimization is generally preferred over most other univariate function optimization algorithms given its efficiency.

Brent’s method is available in Python via the minimize_scalar() SciPy function that takes the name of the function to be minimized. If your target function is constrained to a range, it can be specified via the “*bounds*” argument.

It returns an OptimizeResult object that is a dictionary containing the solution. Importantly, the ‘*x*‘ key summarizes the input for the optima, the ‘*fun*‘ key summarizes the function output for the optima, and the ‘*nfev*‘ summarizes the number of evaluations of the target function that were performed.

... # minimize the function result = minimize_scalar(objective, method='brent')

Now that we know how to perform univariate function optimization in Python, let’s look at some examples.

In this section, we will explore how to solve a convex univariate function optimization problem.

First, we can define a function that implements our function.

In this case, we will use a simple offset version of the x^2 function e.g. a simple parabola (u-shape) function. It is a minimization objective function with an optima at -5.0.

# objective function def objective(x): return (5.0 + x)**2.0

We can plot a coarse grid of this function with input values from -10 to 10 to get an idea of the shape of the target function.

The complete example is listed below.

# plot a convex target function from numpy import arange from matplotlib import pyplot # objective function def objective(x): return (5.0 + x)**2.0 # define range r_min, r_max = -10.0, 10.0 # prepare inputs inputs = arange(r_min, r_max, 0.1) # compute targets targets = [objective(x) for x in inputs] # plot inputs vs target pyplot.plot(inputs, targets, '--') pyplot.show()

Running the example evaluates input values in our specified range using our target function and creates a plot of the function inputs to function outputs.

We can see the U-shape of the function and that the objective is at -5.0.

**Note**: in a real optimization problem, we would not be able to perform so many evaluations of the objective function so easily. This simple function is used for demonstration purposes so we can learn how to use the optimization algorithm.

Next, we can use the optimization algorithm to find the optima.

... # minimize the function result = minimize_scalar(objective, method='brent')

Once optimized, we can summarize the result, including the input and evaluation of the optima and the number of function evaluations required to locate the optima.

... # summarize the result opt_x, opt_y = result['x'], result['fun'] print('Optimal Input x: %.6f' % opt_x) print('Optimal Output f(x): %.6f' % opt_y) print('Total Evaluations n: %d' % result['nfev'])

Finally, we can plot the function again and mark the optima to confirm it was located in the place we expected for this function.

... # define the range r_min, r_max = -10.0, 10.0 # prepare inputs inputs = arange(r_min, r_max, 0.1) # compute targets targets = [objective(x) for x in inputs] # plot inputs vs target pyplot.plot(inputs, targets, '--') # plot the optima pyplot.plot([opt_x], [opt_y], 's', color='r') # show the plot pyplot.show()

The complete example of optimizing an unconstrained convex univariate function is listed below.

# optimize convex objective function from numpy import arange from scipy.optimize import minimize_scalar from matplotlib import pyplot # objective function def objective(x): return (5.0 + x)**2.0 # minimize the function result = minimize_scalar(objective, method='brent') # summarize the result opt_x, opt_y = result['x'], result['fun'] print('Optimal Input x: %.6f' % opt_x) print('Optimal Output f(x): %.6f' % opt_y) print('Total Evaluations n: %d' % result['nfev']) # define the range r_min, r_max = -10.0, 10.0 # prepare inputs inputs = arange(r_min, r_max, 0.1) # compute targets targets = [objective(x) for x in inputs] # plot inputs vs target pyplot.plot(inputs, targets, '--') # plot the optima pyplot.plot([opt_x], [opt_y], 's', color='r') # show the plot pyplot.show()

Running the example first solves the optimization problem and reports the result.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the optima was located after 10 evaluations of the objective function with an input of -5.0, achieving an objective function value of 0.0.

Optimal Input x: -5.000000 Optimal Output f(x): 0.000000 Total Evaluations n: 10

A plot of the function is created again and this time, the optima is marked as a red square.

A convex function is one that does not resemble a basin, meaning that it may have more than one hill or valley.

This can make it more challenging to locate the global optima as the multiple hills and valleys can cause the search to get stuck and report a false or local optima instead.

We can define a non-convex univariate function as follows.

# objective function def objective(x): return (x - 2.0) * x * (x + 2.0)**2.0

We can sample this function and create a line plot of input values to objective values.

The complete example is listed below.

# plot a non-convex univariate function from numpy import arange from matplotlib import pyplot # objective function def objective(x): return (x - 2.0) * x * (x + 2.0)**2.0 # define range r_min, r_max = -3.0, 2.5 # prepare inputs inputs = arange(r_min, r_max, 0.1) # compute targets targets = [objective(x) for x in inputs] # plot inputs vs target pyplot.plot(inputs, targets, '--') pyplot.show()

Running the example evaluates input values in our specified range using our target function and creates a plot of the function inputs to function outputs.

We can see a function with one false optima around -2.0 and a global optima around 1.2.

**Note**: in a real optimization problem, we would not be able to perform so many evaluations of the objective function so easily. This simple function is used for demonstration purposes so we can learn how to use the optimization algorithm.

Next, we can use the optimization algorithm to find the optima.

As before, we can call the minimize_scalar() function to optimize the function, then summarize the result and plot the optima on a line plot.

The complete example of optimization of an unconstrained non-convex univariate function is listed below.

# optimize non-convex objective function from numpy import arange from scipy.optimize import minimize_scalar from matplotlib import pyplot # objective function def objective(x): return (x - 2.0) * x * (x + 2.0)**2.0 # minimize the function result = minimize_scalar(objective, method='brent') # summarize the result opt_x, opt_y = result['x'], result['fun'] print('Optimal Input x: %.6f' % opt_x) print('Optimal Output f(x): %.6f' % opt_y) print('Total Evaluations n: %d' % result['nfev']) # define the range r_min, r_max = -3.0, 2.5 # prepare inputs inputs = arange(r_min, r_max, 0.1) # compute targets targets = [objective(x) for x in inputs] # plot inputs vs target pyplot.plot(inputs, targets, '--') # plot the optima pyplot.plot([opt_x], [opt_y], 's', color='r') # show the plot pyplot.show()

Running the example first solves the optimization problem and reports the result.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this case, we can see that the optima was located after 15 evaluations of the objective function with an input of about 1.28, achieving an objective function value of about -9.91.

Optimal Input x: 1.280776 Optimal Output f(x): -9.914950 Total Evaluations n: 15

A plot of the function is created again, and this time, the optima is marked as a red square.

We can see that the optimization was not deceived by the false optima and successfully located the global optima.

This section provides more resources on the topic if you are looking to go deeper.

- Algorithms for Optimization, 2019.

- Optimization (scipy.optimize).
- Optimization and root finding (scipy.optimize)
- scipy.optimize.minimize_scalar API.

In this tutorial, you discovered how to perform univariate function optimization in Python.

Specifically, you learned:

- Univariate function optimization involves finding an optimal input for an objective function that takes a single continuous argument.
- How to perform univariate function optimization for an unconstrained convex function.
- How to perform univariate function optimization for an unconstrained non-convex function.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Univariate Function Optimization in Python appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Machine Learning Modeling Pipelines appeared first on Machine Learning Mastery.

]]>Applied machine learning is typically focused on finding a single model that performs well or best on a given dataset.

Effective use of the model will require appropriate preparation of the input data and hyperparameter tuning of the model.

Collectively, the linear sequence of steps required to prepare the data, tune the model, and transform the predictions is called the **modeling pipeline**. Modern machine learning libraries like the scikit-learn Python library allow this sequence of steps to be defined and used correctly (without data leakage) and consistently (during evaluation and prediction).

Nevertheless, working with modeling pipelines can be confusing to beginners as it requires a shift in perspective of the applied machine learning process.

In this tutorial, you will discover modeling pipelines for applied machine learning.

After completing this tutorial, you will know:

- Applied machine learning is concerned with more than finding a good performing model; it also requires finding an appropriate sequence of data preparation steps and steps for the post-processing of predictions.
- Collectively, the operations required to address a predictive modeling problem can be considered an atomic unit called a modeling pipeline.
- Approaching applied machine learning through the lens of modeling pipelines requires a change in thinking from evaluating specific model configurations to sequences of transforms and algorithms.

Let’s get started.

This tutorial is divided into three parts; they are:

- Finding a Skillful Model Is Not Enough
- What Is a Modeling Pipeline?
- Implications of a Modeling Pipeline

Applied machine learning is the process of discovering the model that performs best for a given predictive modeling dataset.

In fact, it’s more than this.

In addition to discovering which model performs the best on your dataset, you must also discover:

**Data transforms**that best expose the unknown underlying structure of the problem to the learning algorithms.**Model hyperparameters**that result in a good or best configuration of a chosen model.

There may also be additional considerations such as techniques that transform the predictions made by the model, like threshold moving or model calibration for predicted probabilities.

As such, it is common to think of applied machine learning as a large combinatorial search problem across data transforms, models, and model configurations.

This can be quite challenging in practice as it requires that the sequence of one or more data preparation schemes, the model, the model configuration, and any prediction transform schemes must be evaluated consistently and correctly on a given test harness.

Although tricky, it may be manageable with a simple train-test split but becomes quite unmanageable when using k-fold cross-validation or even repeated k-fold cross-validation.

The solution is to use a modeling pipeline to keep everything straight.

A pipeline is a linear sequence of data preparation options, modeling operations, and prediction transform operations.

It allows the sequence of steps to be specified, evaluated, and used as an atomic unit.

**Pipeline**: A linear sequence of data preparation and modeling steps that can be treated as an atomic unit.

To make the idea clear, let’s look at two simple examples:

The first example uses data normalization for the input variables and fits a logistic regression model:

- [Input], [Normalization], [Logistic Regression], [Predictions]

The second example standardizes the input variables, applies RFE feature selection, and fits a support vector machine.

- [Input], [Standardization], [RFE], [SVM], [Predictions]

You can imagine other examples of modeling pipelines.

As an atomic unit, the pipeline can be evaluated using a preferred resampling scheme such as a train-test split or k-fold cross-validation.

This is important for two main reasons:

- Avoid data leakage.
- Consistency and reproducibility.

A modeling pipeline avoids the most common type of data leakage where data preparation techniques, such as scaling input values, are applied to the entire dataset. This is data leakage because it shares knowledge of the test dataset (such as observations that contribute to a mean or maximum known value) with the training dataset, and in turn, may result in overly optimistic model performance.

Instead, data transforms must be prepared on the training dataset only, then applied to the training dataset, test dataset, validation dataset, and any other datasets that require the transform prior to being used with the model.

A modeling pipeline ensures that the sequence of data preparation operations performed is reproducible.

Without a modeling pipeline, the data preparation steps may be performed manually twice: once for evaluating the model and once for making predictions. Any changes to the sequence must be kept consistent in both cases, otherwise differences will impact the capability and skill of the model.

A pipeline ensures that the sequence of operations is defined once and is consistent when used for model evaluation or making predictions.

The Python scikit-learn machine learning library provides a machine learning modeling pipeline via the Pipeline class.

You can learn more about how to use this Pipeline API in this tutorial:

The modeling pipeline is an important tool for machine learning practitioners.

Nevertheless, there are important implications that must be considered when using them.

The main confusion for beginners when using pipelines comes in understanding what the pipeline has learned or the specific configuration discovered by the pipeline.

For example, a pipeline may use a data transform that configures itself automatically, such as the RFECV technique for feature selection.

- When evaluating a pipeline that uses an automatically-configured data transform, what configuration does it choose? or When fitting this pipeline as a final model for making predictions, what configuration did it choose?

**The answer is, it doesn’t matter**.

Another example is the use of hyperparameter tuning as the final step of the pipeline.

The grid search will be performed on the data provided by any prior transform steps in the pipeline and will then search for the best combination of hyperparameters for the model using that data, then fit a model with those hyperparameters on the data.

- When evaluating a pipeline that grid searches model hyperparameters, what configuration does it choose? or When fitting this pipeline as a final model for making predictions, what configuration did it choose?

**The answer again is, it doesn’t matter**.

The answer applies when using a threshold moving or probability calibration step at the end of the pipeline.

The reason is the same reason that we are not concerned about the specific internal structure or coefficients of the chosen model.

For example, when evaluating a logistic regression model, we don’t need to inspect the coefficients chosen on each k-fold cross-validation round in order to choose the model. Instead, we focus on its out-of-fold predictive skill

Similarly, when using a logistic regression model as the final model for making predictions on new data, we do not need to inspect the coefficients chosen when fitting the model on the entire dataset before making predictions.

We can inspect and discover the coefficients used by the model as an exercise in analysis, but it does not impact the selection and use of the model.

This same answer generalizes when considering a modeling pipeline.

We are not concerned about which features may have been automatically selected by a data transform in the pipeline. We are also not concerned about which hyperparameters were chosen for the model when using a grid search as the final step in the modeling pipeline.

In all three cases: the single model, the pipeline with automatic feature selection, and the pipeline with a grid search, we are evaluating the “*model*” or “*modeling pipeline*” as an atomic unit.

The pipeline allows us as machine learning practitioners to move up one level of abstraction and be less concerned with the specific outcomes of the algorithms and more concerned with the capability of a sequence of procedures.

As such, we can focus on evaluating the capability of the algorithms on the dataset, not the product of the algorithms, i.e. the model. Once we have an estimate of the pipeline, we can apply it and be confident that we will get similar performance, on average.

It is a shift in thinking and may take some time to get used to.

It is also the philosophy behind modern AutoML (automatic machine learning) techniques that treat applied machine learning as a large combinatorial search problem.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered modeling pipelines for applied machine learning.

Specifically, you learned:

- Applied machine learning is concerned with more than finding a good performing model; it also requires finding an appropriate sequence of data preparation steps and steps for the post-processing of predictions.
- Collectively, the operations required to address a predictive modeling problem can be considered an atomic unit called a modeling pipeline.
- Approaching applied machine learning through the lens of modeling pipelines requires a change in thinking from evaluating specific model configurations to sequences of transforms and algorithms.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Machine Learning Modeling Pipelines appeared first on Machine Learning Mastery.

]]>The post Semi-Supervised Learning With Label Spreading appeared first on Machine Learning Mastery.

]]>**Semi-supervised learning** refers to algorithms that attempt to make use of both labeled and unlabeled training data.

Semi-supervised learning algorithms are unlike supervised learning algorithms that are only able to learn from labeled training data.

A popular approach to semi-supervised learning is to create a graph that connects examples in the training dataset and propagates known labels through the edges of the graph to label unlabeled examples. An example of this approach to semi-supervised learning is the **label spreading algorithm** for classification predictive modeling.

In this tutorial, you will discover how to apply the label spreading algorithm to a semi-supervised learning classification dataset.

After completing this tutorial, you will know:

- An intuition for how the label spreading semi-supervised learning algorithm works.
- How to develop a semi-supervised classification dataset and establish a baseline in performance with a supervised learning algorithm.
- How to develop and evaluate a label spreading algorithm and use the model output to train a supervised learning algorithm.

Let’s get started.

This tutorial is divided into three parts; they are:

- Label Spreading Algorithm
- Semi-Supervised Classification Dataset
- Label Spreading for Semi-Supervised Learning

Label Spreading is a semi-supervised learning algorithm.

The algorithm was introduced by Dengyong Zhou, et al. in their 2003 paper titled “Learning With Local And Global Consistency.”

The intuition for the broader approach of semi-supervised learning is that nearby points in the input space should have the same label, and points in the same structure or manifold in the input space should have the same label.

The key to semi-supervised learning problems is the prior assumption of consistency, which means: (1) nearby points are likely to have the same label; and (2) points on the same structure typically referred to as a cluster or a manifold) are likely to have the same label.

— Learning With Local And Global Consistency, 2003.

The label spreading is inspired by a technique from experimental psychology called spreading activation networks.

This algorithm can be understood intuitively in terms of spreading activation networks from experimental psychology.

— Learning With Local And Global Consistency, 2003.

Points in the dataset are connected in a graph based on their relative distances in the input space. The weight matrix of the graph is normalized symmetrically, much like spectral clustering. Information is passed through the graph, which is adapted to capture the structure in the input space.

The approach is very similar to the label propagation algorithm for semi-supervised learning.

Another similar label propagation algorithm was given by Zhou et al.: at each step a node i receives a contribution from its neighbors j (weighted by the normalized weight of the edge (i,j)), and an additional small contribution given by its initial value

— Page 196, Semi-Supervised Learning, 2006.

After convergence, labels are applied based on nodes that passed on the most information.

Finally, the label of each unlabeled point is set to be the class of which it has received most information during the iteration process.

— Learning With Local And Global Consistency, 2003.

Now that we are familiar with the label spreading algorithm, let’s look at how we might use it on a project. First, we must define a semi-supervised classification dataset.

In this section, we will define a dataset for semis-supervised learning and establish a baseline in performance on the dataset.

First, we can define a synthetic classification dataset using the make_classification() function.

We will define the dataset with two classes (binary classification) and two input variables and 1,000 examples.

... # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

Next, we will split the dataset into train and test datasets with an equal 50-50 split (e.g. 500 rows in each).

... # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

Finally, we will split the training dataset in half again into a portion that will have labels and a portion that we will pretend is unlabeled.

... # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

Tying this together, the complete example of preparing the semi-supervised learning dataset is listed below.

# prepare semi-supervised learning dataset from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # summarize training set size print('Labeled Train Set:', X_train_lab.shape, y_train_lab.shape) print('Unlabeled Train Set:', X_test_unlab.shape, y_test_unlab.shape) # summarize test set size print('Test Set:', X_test.shape, y_test.shape)

Running the example prepares the dataset and then summarizes the shape of each of the three portions.

The results confirm that we have a test dataset of 500 rows, a labeled training dataset of 250 rows, and 250 rows of unlabeled data.

Labeled Train Set: (250, 2) (250,) Unlabeled Train Set: (250, 2) (250,) Test Set: (500, 2) (500,)

A supervised learning algorithm will only have 250 rows from which to train a model.

A semi-supervised learning algorithm will have the 250 labeled rows as well as the 250 unlabeled rows that could be used in numerous ways to improve the labeled training dataset.

Next, we can establish a baseline in performance on the semi-supervised learning dataset using a supervised learning algorithm fit only on the labeled training data.

This is important because we would expect a semi-supervised learning algorithm to outperform a supervised learning algorithm fit on the labeled data alone. If this is not the case, then the semi-supervised learning algorithm does not have skill.

In this case, we will use a logistic regression algorithm fit on the labeled portion of the training dataset.

... # define model model = LogisticRegression() # fit model on labeled dataset model.fit(X_train_lab, y_train_lab)

The model can then be used to make predictions on the entire holdout test dataset and evaluated using classification accuracy.

... # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of evaluating a supervised learning algorithm on the semi-supervised learning dataset is listed below.

# baseline performance on the semi-supervised learning dataset from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # define model model = LogisticRegression() # fit model on labeled dataset model.fit(X_train_lab, y_train_lab) # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the labeled training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the algorithm achieved a classification accuracy of about 84.8 percent.

We would expect an effective semi-supervised learning algorithm to achieve a better accuracy than this.

Accuracy: 84.800

Next, let’s explore how to apply the label spreading algorithm to the dataset.

The label spreading algorithm is available in the scikit-learn Python machine learning library via the LabelSpreading class.

The model can be fit just like any other classification model by calling the *fit()* function and used to make predictions for new data via the *predict()* function.

... # define model model = LabelSpreading() # fit model on training dataset model.fit(..., ...) # make predictions on hold out test set yhat = model.predict(...)

Importantly, the training dataset provided to the *fit()* function must include labeled examples that are ordinal encoded (as per normal) and unlabeled examples marked with a label of -1.

The model will then determine a label for the unlabeled examples as part of fitting the model.

After the model is fit, the estimated labels for the labeled and unlabeled data in the training dataset is available via the “*transduction_*” attribute on the *LabelSpreading* class.

... # get labels for entire training dataset data tran_labels = model.transduction_

Now that we are familiar with how to use the label spreading algorithm in scikit-learn, let’s look at how we might apply it to our semi-supervised learning dataset.

First, we must prepare the training dataset.

We can concatenate the input data of the training dataset into a single array.

... # create the training dataset input X_train_mixed = concatenate((X_train_lab, X_test_unlab))

We can then create a list of -1 valued (unlabeled) for each row in the unlabeled portion of the training dataset.

... # create "no label" for unlabeled data nolabel = [-1 for _ in range(len(y_test_unlab))]

This list can then be concatenated with the labels from the labeled portion of the training dataset to correspond with the input array for the training dataset.

... # recombine training dataset labels y_train_mixed = concatenate((y_train_lab, nolabel))

We can now train the *LabelSpreading* model on the entire training dataset.

... # define model model = LabelSpreading() # fit model on training dataset model.fit(X_train_mixed, y_train_mixed)

Next, we can use the model to make predictions on the holdout dataset and evaluate the model using classification accuracy.

... # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of evaluating label spreading on the semi-supervised learning dataset is listed below.

# evaluate label spreading on the semi-supervised learning dataset from numpy import concatenate from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.semi_supervised import LabelSpreading # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # create the training dataset input X_train_mixed = concatenate((X_train_lab, X_test_unlab)) # create "no label" for unlabeled data nolabel = [-1 for _ in range(len(y_test_unlab))] # recombine training dataset labels y_train_mixed = concatenate((y_train_lab, nolabel)) # define model model = LabelSpreading() # fit model on training dataset model.fit(X_train_mixed, y_train_mixed) # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the entire training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the label spreading model achieves a classification accuracy of about 85.4 percent, which is slightly higher than a logistic regression fit only on the labeled training dataset that achieved an accuracy of about 84.8 percent.

Accuracy: 85.400

So far so good.

Another approach we can use with the semi-supervised model is to take the estimated labels for the training dataset and fit a supervised learning model.

Recall that we can retrieve the labels for the entire training dataset from the label spreading model as follows:

... # get labels for entire training dataset data tran_labels = model.transduction_

We can then use these labels, along with all of the input data, to train and evaluate a supervised learning algorithm, such as a logistic regression model.

The hope is that the supervised learning model fit on the entire training dataset would achieve even better performance than the semi-supervised learning model alone.

... # define supervised learning model model2 = LogisticRegression() # fit supervised learning model on entire training dataset model2.fit(X_train_mixed, tran_labels) # make predictions on hold out test set yhat = model2.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of using the estimated training set labels to train and evaluate a supervised learning model is listed below.

# evaluate logistic regression fit on label spreading for semi-supervised learning from numpy import concatenate from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.semi_supervised import LabelSpreading from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # create the training dataset input X_train_mixed = concatenate((X_train_lab, X_test_unlab)) # create "no label" for unlabeled data nolabel = [-1 for _ in range(len(y_test_unlab))] # recombine training dataset labels y_train_mixed = concatenate((y_train_lab, nolabel)) # define model model = LabelSpreading() # fit model on training dataset model.fit(X_train_mixed, y_train_mixed) # get labels for entire training dataset data tran_labels = model.transduction_ # define supervised learning model model2 = LogisticRegression() # fit supervised learning model on entire training dataset model2.fit(X_train_mixed, tran_labels) # make predictions on hold out test set yhat = model2.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the semi-supervised model on the entire training dataset, then fits a supervised learning model on the entire training dataset with inferred labels and evaluates it on the holdout dataset, printing the classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that this hierarchical approach of semi-supervised model followed by supervised model achieves a classification accuracy of about 85.8 percent on the holdout dataset, slightly better than the semi-supervised learning algorithm used alone that achieved an accuracy of about 85.6 percent.

Accuracy: 85.800

**Can you achieve better results by tuning the hyperparameters of the LabelSpreading model?**

Let me know what you discover in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- Introduction to Semi-Supervised Learning, 2009.
- Chapter 11: Label Propagation and Quadratic Criterion, Semi-Supervised Learning, 2006.

- sklearn.semi_supervised.LabelSpreading API.
- Section 1.14. Semi-Supervised, Scikit-Learn User Guide.
- sklearn.model_selection.train_test_split API.
- sklearn.linear_model.LogisticRegression API.
- sklearn.datasets.make_classification API.

In this tutorial, you discovered how to apply the label spreading algorithm to a semi-supervised learning classification dataset.

Specifically, you learned:

- An intuition for how the label spreading semi-supervised learning algorithm works.
- How to develop a semi-supervised classification dataset and establish a baseline in performance with a supervised learning algorithm.
- How to develop and evaluate a label spreading algorithm and use the model output to train a supervised learning algorithm.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Semi-Supervised Learning With Label Spreading appeared first on Machine Learning Mastery.

]]>The post Multinomial Logistic Regression With Python appeared first on Machine Learning Mastery.

]]>**Multinomial logistic regression** is an extension of logistic regression that adds native support for multi-class classification problems.

Logistic regression, by default, is limited to two-class classification problems. Some extensions like one-vs-rest can allow logistic regression to be used for multi-class classification problems, although they require that the classification problem first be transformed into multiple binary classification problems.

Instead, the multinomial logistic regression algorithm is an extension to the logistic regression model that involves changing the loss function to cross-entropy loss and predict probability distribution to a multinomial probability distribution to natively support multi-class classification problems.

In this tutorial, you will discover how to develop multinomial logistic regression models in Python.

After completing this tutorial, you will know:

- Multinomial logistic regression is an extension of logistic regression for multi-class classification.
- How to develop and evaluate multinomial logistic regression and develop a final model for making predictions on new data.
- How to tune the penalty hyperparameter for the multinomial logistic regression model.

Let’s get started.

This tutorial is divided into three parts; they are:

- Multinomial Logistic Regression
- Evaluate Multinomial Logistic Regression Model
- Tune Penalty for Multinomial Logistic Regression

Logistic regression is a classification algorithm.

It is intended for datasets that have numerical input variables and a categorical target variable that has two values or classes. Problems of this type are referred to as binary classification problems.

Logistic regression is designed for two-class problems, modeling the target using a binomial probability distribution function. The class labels are mapped to 1 for the positive class or outcome and 0 for the negative class or outcome. The fit model predicts the probability that an example belongs to class 1.

By default, logistic regression cannot be used for classification tasks that have more than two class labels, so-called multi-class classification.

Instead, it requires modification to support multi-class classification problems.

One popular approach for adapting logistic regression to multi-class classification problems is to split the multi-class classification problem into multiple binary classification problems and fit a standard logistic regression model on each subproblem. Techniques of this type include one-vs-rest and one-vs-one wrapper models.

An alternate approach involves changing the logistic regression model to support the prediction of multiple class labels directly. Specifically, to predict the probability that an input example belongs to each known class label.

The probability distribution that defines multi-class probabilities is called a multinomial probability distribution. A logistic regression model that is adapted to learn and predict a multinomial probability distribution is referred to as Multinomial Logistic Regression. Similarly, we might refer to default or standard logistic regression as Binomial Logistic Regression.

**Binomial Logistic Regression**: Standard logistic regression that predicts a binomial probability (i.e. for two classes) for each input example.**Multinomial Logistic Regression**: Modified version of logistic regression that predicts a multinomial probability (i.e. more than two classes) for each input example.

If you are new to binomial and multinomial probability distributions, you may want to read the tutorial:

Changing logistic regression from binomial to multinomial probability requires a change to the loss function used to train the model (e.g. log loss to cross-entropy loss), and a change to the output from a single probability value to one probability for each class label.

Now that we are familiar with multinomial logistic regression, let’s look at how we might develop and evaluate multinomial logistic regression models in Python.

In this section, we will develop and evaluate a multinomial logistic regression model using the scikit-learn Python machine learning library.

First, we will define a synthetic multi-class classification dataset to use as the basis of the investigation. This is a generic dataset that you can easily replace with your own loaded dataset later.

The make_classification() function can be used to generate a dataset with a given number of rows, columns, and classes. In this case, we will generate a dataset with 1,000 rows, 10 input variables or columns, and 3 classes.

The example below generates the dataset and summarizes the shape of the arrays and the distribution of examples across the three classes.

# test classification dataset from collections import Counter from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1) # summarize the dataset print(X.shape, y.shape) print(Counter(y))

Running the example confirms that the dataset has 1,000 rows and 10 columns, as we expected, and that the rows are distributed approximately evenly across the three classes, with about 334 examples in each class.

(1000, 10) (1000,) Counter({1: 334, 2: 334, 0: 332})

Logistic regression is supported in the scikit-learn library via the LogisticRegression class.

The *LogisticRegression* class can be configured for multinomial logistic regression by setting the “*multi_class*” argument to “*multinomial*” and the “*solver*” argument to a solver that supports multinomial logistic regression, such as “*lbfgs*“.

... # define the multinomial logistic regression model model = LogisticRegression(multi_class='multinomial', solver='lbfgs')

The multinomial logistic regression model will be fit using cross-entropy loss and will predict the integer value for each integer encoded class label.

Now that we are familiar with the multinomial logistic regression API, we can look at how we might evaluate a multinomial logistic regression model on our synthetic multi-class classification dataset.

It is a good practice to evaluate classification models using repeated stratified k-fold cross-validation. The stratification ensures that each cross-validation fold has approximately the same distribution of examples in each class as the whole training dataset.

We will use three repeats with 10 folds, which is a good default, and evaluate model performance using classification accuracy given that the classes are balanced.

The complete example of evaluating multinomial logistic regression for multi-class classification is listed below.

# evaluate multinomial logistic regression model from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1) # define the multinomial logistic regression model model = LogisticRegression(multi_class='multinomial', solver='lbfgs') # define the model evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the scores n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report the model performance print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean classification accuracy across all folds and repeats of the evaluation procedure.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the multinomial logistic regression model with default penalty achieved a mean classification accuracy of about 68.1 percent on our synthetic classification dataset.

Mean Accuracy: 0.681 (0.042)

We may decide to use the multinomial logistic regression model as our final model and make predictions on new data.

This can be achieved by first fitting the model on all available data, then calling the *predict()* function to make a prediction for new data.

The example below demonstrates how to make a prediction for new data using the multinomial logistic regression model.

# make a prediction with a multinomial logistic regression model from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1) # define the multinomial logistic regression model model = LogisticRegression(multi_class='multinomial', solver='lbfgs') # fit the model on the whole dataset model.fit(X, y) # define a single row of input data row = [1.89149379, -0.39847585, 1.63856893, 0.01647165, 1.51892395, -3.52651223, 1.80998823, 0.58810926, -0.02542177, -0.52835426] # predict the class label yhat = model.predict([row]) # summarize the predicted class print('Predicted Class: %d' % yhat[0])

Running the example first fits the model on all available data, then defines a row of data, which is provided to the model in order to make a prediction.

In this case, we can see that the model predicted the class “1” for the single row of data.

Predicted Class: 1

A benefit of multinomial logistic regression is that it can predict calibrated probabilities across all known class labels in the dataset.

This can be achieved by calling the *predict_proba()* function on the model.

The example below demonstrates how to predict a multinomial probability distribution for a new example using the multinomial logistic regression model.

# predict probabilities with a multinomial logistic regression model from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1) # define the multinomial logistic regression model model = LogisticRegression(multi_class='multinomial', solver='lbfgs') # fit the model on the whole dataset model.fit(X, y) # define a single row of input data row = [1.89149379, -0.39847585, 1.63856893, 0.01647165, 1.51892395, -3.52651223, 1.80998823, 0.58810926, -0.02542177, -0.52835426] # predict a multinomial probability distribution yhat = model.predict_proba([row]) # summarize the predicted probabilities print('Predicted Probabilities: %s' % yhat[0])

Running the example first fits the model on all available data, then defines a row of data, which is provided to the model in order to predict class probabilities.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that class 1 (e.g. the array index is mapped to the class integer value) has the largest predicted probability with about 0.50.

Predicted Probabilities: [0.16470456 0.50297138 0.33232406]

Now that we are familiar with evaluating and using multinomial logistic regression models, let’s explore how we might tune the model hyperparameters.

An important hyperparameter to tune for multinomial logistic regression is the penalty term.

This term imposes pressure on the model to seek smaller model weights. This is achieved by adding a weighted sum of the model coefficients to the loss function, encouraging the model to reduce the size of the weights along with the error while fitting the model.

A popular type of penalty is the L2 penalty that adds the (weighted) sum of the squared coefficients to the loss function. A weighting of the coefficients can be used that reduces the strength of the penalty from full penalty to a very slight penalty.

By default, the *LogisticRegression* class uses the L2 penalty with a weighting of coefficients set to 1.0. The type of penalty can be set via the “*penalty*” argument with values of “*l1*“, “*l2*“, “*elasticnet*” (e.g. both), although not all solvers support all penalty types. The weighting of the coefficients in the penalty can be set via the “*C*” argument.

... # define the multinomial logistic regression model with a default penalty LogisticRegression(multi_class='multinomial', solver='lbfgs', penalty='l2', C=1.0)

The weighting for the penalty is actually the inverse weighting, perhaps penalty = 1 – C.

From the documentation:

C : float, default=1.0

Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

This means that values close to 1.0 indicate very little penalty and values close to zero indicate a strong penalty. A C value of 1.0 may indicate no penalty at all.

**C close to 1.0**: Light penalty.**C close to 0.0**: Strong penalty.

The penalty can be disabled by setting the “*penalty*” argument to the string “*none*“.

... # define the multinomial logistic regression model without a penalty LogisticRegression(multi_class='multinomial', solver='lbfgs', penalty='none')

Now that we are familiar with the penalty, let’s look at how we might explore the effect of different penalty values on the performance of the multinomial logistic regression model.

It is common to test penalty values on a log scale in order to quickly discover the scale of penalty that works well for a model. Once found, further tuning at that scale may be beneficial.

We will explore the L2 penalty with weighting values in the range from 0.0001 to 1.0 on a log scale, in addition to no penalty or 0.0.

The complete example of evaluating L2 penalty values for multinomial logistic regression is listed below.

# tune regularization for multinomial logistic regression from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1, n_classes=3) return X, y # get a list of models to evaluate def get_models(): models = dict() for p in [0.0, 0.0001, 0.001, 0.01, 0.1, 1.0]: # create name for model key = '%.4f' % p # turn off penalty in some cases if p == 0.0: # no penalty in this case models[key] = LogisticRegression(multi_class='multinomial', solver='lbfgs', penalty='none') else: models[key] = LogisticRegression(multi_class='multinomial', solver='lbfgs', penalty='l2', C=p) return models # evaluate a give model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model and collect the scores scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize progress along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example reports the mean classification accuracy for each configuration along the way.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a C value of 1.0 has the best score of about 77.7 percent, which is the same as using no penalty that achieves the same score.

>0.0000 0.777 (0.037) >0.0001 0.683 (0.049) >0.0010 0.762 (0.044) >0.0100 0.775 (0.040) >0.1000 0.774 (0.038) >1.0000 0.777 (0.037)

A box and whisker plot is created for the accuracy scores for each configuration and all plots are shown side by side on a figure on the same scale for direct comparison.

In this case, we can see that the larger penalty we use on this dataset (i.e. the smaller the C value), the worse the performance of the model.

This section provides more resources on the topic if you are looking to go deeper.

- Logistic Regression Tutorial for Machine Learning
- Logistic Regression for Machine Learning
- A Gentle Introduction to Logistic Regression With Maximum Likelihood Estimation
- How To Implement Logistic Regression From Scratch in Python
- Cost-Sensitive Logistic Regression for Imbalanced Classification

In this tutorial, you discovered how to develop multinomial logistic regression models in Python.

Specifically, you learned:

- Multinomial logistic regression is an extension of logistic regression for multi-class classification.
- How to develop and evaluate multinomial logistic regression and develop a final model for making predictions on new data.
- How to tune the penalty hyperparameter for the multinomial logistic regression model.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Multinomial Logistic Regression With Python appeared first on Machine Learning Mastery.

]]>The post Semi-Supervised Learning With Label Propagation appeared first on Machine Learning Mastery.

]]>**Semi-supervised learning** refers to algorithms that attempt to make use of both labeled and unlabeled training data.

Semi-supervised learning algorithms are unlike supervised learning algorithms that are only able to learn from labeled training data.

A popular approach to semi-supervised learning is to create a graph that connects examples in the training dataset and propagate known labels through the edges of the graph to label unlabeled examples. An example of this approach to semi-supervised learning is the **label propagation algorithm** for classification predictive modeling.

In this tutorial, you will discover how to apply the label propagation algorithm to a semi-supervised learning classification dataset.

After completing this tutorial, you will know:

- An intuition for how the label propagation semi-supervised learning algorithm works.
- How to develop a semi-supervised classification dataset and establish a baseline in performance with a supervised learning algorithm.
- How to develop and evaluate a label propagation algorithm and use the model output to train a supervised learning algorithm.

Let’s get started.

This tutorial is divided into three parts; they are:

- Label Propagation Algorithm
- Semi-Supervised Classification Dataset
- Label Propagation for Semi-Supervised Learning

Label Propagation is a semi-supervised learning algorithm.

The algorithm was proposed in the 2002 technical report by Xiaojin Zhu and Zoubin Ghahramani titled “Learning From Labeled And Unlabeled Data With Label Propagation.”

The intuition for the algorithm is that a graph is created that connects all examples (rows) in the dataset based on their distance, such as Euclidean distance. Nodes in the graph then have label soft labels or label distribution based on the labels or label distributions of examples connected nearby in the graph.

Many semi-supervised learning algorithms rely on the geometry of the data induced by both labeled and unlabeled examples to improve on supervised methods that use only the labeled data. This geometry can be naturally represented by an empirical graph g = (V,E) where nodes V = {1,…,n} represent the training data and edges E represent similarities between them

— Page 193, Semi-Supervised Learning, 2006.

Propagation refers to the iterative nature that labels are assigned to nodes in the graph and propagate along the edges of the graph to connected nodes.

This procedure is sometimes called label propagation, as it “propagates” labels from the labeled vertices (which are fixed) gradually through the edges to all the unlabeled vertices.

— Page 48, Introduction to Semi-Supervised Learning, 2009.

The process is repeated for a fixed number of iterations to strengthen the labels assigned to unlabeled examples.

Starting with nodes 1, 2,…,l labeled with their known label (1 or −1) and nodes l + 1,…,n labeled with 0, each node starts to propagate its label to its neighbors, and the process is repeated until convergence.

— Page 194, Semi-Supervised Learning, 2006.

Now that we are familiar with the Label Propagation algorithm, let’s look at how we might use it on a project. First, we must define a semi-supervised classification dataset.

In this section, we will define a dataset for semis-supervised learning and establish a baseline in performance on the dataset.

First, we can define a synthetic classification dataset using the make_classification() function.

We will define the dataset with two classes (binary classification) and two input variables and 1,000 examples.

... # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

Next, we will split the dataset into train and test datasets with an equal 50-50 split (e.g. 500 rows in each).

... # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

Finally, we will split the training dataset in half again into a portion that will have labels and a portion that we will pretend is unlabeled.

... # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

Tying this together, the complete example of preparing the semi-supervised learning dataset is listed below.

# prepare semi-supervised learning dataset from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # summarize training set size print('Labeled Train Set:', X_train_lab.shape, y_train_lab.shape) print('Unlabeled Train Set:', X_test_unlab.shape, y_test_unlab.shape) # summarize test set size print('Test Set:', X_test.shape, y_test.shape)

Running the example prepares the dataset and then summarizes the shape of each of the three portions.

The results confirm that we have a test dataset of 500 rows, a labeled training dataset of 250 rows, and 250 rows of unlabeled data.

Labeled Train Set: (250, 2) (250,) Unlabeled Train Set: (250, 2) (250,) Test Set: (500, 2) (500,)

A supervised learning algorithm will only have 250 rows from which to train a model.

A semi-supervised learning algorithm will have the 250 labeled rows as well as the 250 unlabeled rows that could be used in numerous ways to improve the labeled training dataset.

Next, we can establish a baseline in performance on the semi-supervised learning dataset using a supervised learning algorithm fit only on the labeled training data.

This is important because we would expect a semi-supervised learning algorithm to outperform a supervised learning algorithm fit on the labeled data alone. If this is not the case, then the semi-supervised learning algorithm does not have skill.

In this case, we will use a logistic regression algorithm fit on the labeled portion of the training dataset.

... # define model model = LogisticRegression() # fit model on labeled dataset model.fit(X_train_lab, y_train_lab)

The model can then be used to make predictions on the entire hold out test dataset and evaluated using classification accuracy.

... # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of evaluating a supervised learning algorithm on the semi-supervised learning dataset is listed below.

# baseline performance on the semi-supervised learning dataset from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # define model model = LogisticRegression() # fit model on labeled dataset model.fit(X_train_lab, y_train_lab) # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the labeled training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the algorithm achieved a classification accuracy of about 84.8 percent.

We would expect an effective semi-supervised learning algorithm to achieve better accuracy than this.

Accuracy: 84.800

Next, let’s explore how to apply the label propagation algorithm to the dataset.

The Label Propagation algorithm is available in the scikit-learn Python machine learning library via the LabelPropagation class.

The model can be fit just like any other classification model by calling the *fit()* function and used to make predictions for new data via the *predict()* function.

... # define model model = LabelPropagation() # fit model on training dataset model.fit(..., ...) # make predictions on hold out test set yhat = model.predict(...)

Importantly, the training dataset provided to the *fit()* function must include labeled examples that are integer encoded (as per normal) and unlabeled examples marked with a label of -1.

The model will then determine a label for the unlabeled examples as part of fitting the model.

After the model is fit, the estimated labels for the labeled and unlabeled data in the training dataset is available via the “*transduction_*” attribute on the *LabelPropagation* class.

... # get labels for entire training dataset data tran_labels = model.transduction_

Now that we are familiar with how to use the Label Propagation algorithm in scikit-learn, let’s look at how we might apply it to our semi-supervised learning dataset.

First, we must prepare the training dataset.

We can concatenate the input data of the training dataset into a single array.

... # create the training dataset input X_train_mixed = concatenate((X_train_lab, X_test_unlab))

We can then create a list of -1 valued (unlabeled) for each row in the unlabeled portion of the training dataset.

... # create "no label" for unlabeled data nolabel = [-1 for _ in range(len(y_test_unlab))]

This list can then be concatenated with the labels from the labeled portion of the training dataset to correspond with the input array for the training dataset.

... # recombine training dataset labels y_train_mixed = concatenate((y_train_lab, nolabel))

We can now train the *LabelPropagation* model on the entire training dataset.

... # define model model = LabelPropagation() # fit model on training dataset model.fit(X_train_mixed, y_train_mixed)

Next, we can use the model to make predictions on the holdout dataset and evaluate the model using classification accuracy.

Tying this together, the complete example of evaluating label propagation on the semi-supervised learning dataset is listed below.

# evaluate label propagation on the semi-supervised learning dataset from numpy import concatenate from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.semi_supervised import LabelPropagation # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # create the training dataset input X_train_mixed = concatenate((X_train_lab, X_test_unlab)) # create "no label" for unlabeled data nolabel = [-1 for _ in range(len(y_test_unlab))] # recombine training dataset labels y_train_mixed = concatenate((y_train_lab, nolabel)) # define model model = LabelPropagation() # fit model on training dataset model.fit(X_train_mixed, y_train_mixed) # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the entire training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the label propagation model achieves a classification accuracy of about 85.6 percent, which is slightly higher than a logistic regression fit only on the labeled training dataset that achieved an accuracy of about 84.8 percent.

Accuracy: 85.600

So far, so good.

Another approach we can use with the semi-supervised model is to take the estimated labels for the training dataset and fit a supervised learning model.

Recall that we can retrieve the labels for the entire training dataset from the label propagation model as follows:

... # get labels for entire training dataset data tran_labels = model.transduction_

We can then use these labels along with all of the input data to train and evaluate a supervised learning algorithm, such as a logistic regression model.

The hope is that the supervised learning model fit on the entire training dataset would achieve even better performance than the semi-supervised learning model alone.

... # define supervised learning model model2 = LogisticRegression() # fit supervised learning model on entire training dataset model2.fit(X_train_mixed, tran_labels) # make predictions on hold out test set yhat = model2.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of using the estimated training set labels to train and evaluate a supervised learning model is listed below.

# evaluate logistic regression fit on label propagation for semi-supervised learning from numpy import concatenate from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.semi_supervised import LabelPropagation from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # create the training dataset input X_train_mixed = concatenate((X_train_lab, X_test_unlab)) # create "no label" for unlabeled data nolabel = [-1 for _ in range(len(y_test_unlab))] # recombine training dataset labels y_train_mixed = concatenate((y_train_lab, nolabel)) # define model model = LabelPropagation() # fit model on training dataset model.fit(X_train_mixed, y_train_mixed) # get labels for entire training dataset data tran_labels = model.transduction_ # define supervised learning model model2 = LogisticRegression() # fit supervised learning model on entire training dataset model2.fit(X_train_mixed, tran_labels) # make predictions on hold out test set yhat = model2.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the semi-supervised model on the entire training dataset, then fits a supervised learning model on the entire training dataset with inferred labels and evaluates it on the holdout dataset, printing the classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that this hierarchical approach of the semi-supervised model followed by supervised model achieves a classification accuracy of about 86.2 percent on the holdout dataset, even better than the semi-supervised learning used alone that achieved an accuracy of about 85.6 percent.

Accuracy: 86.200

**Can you achieve better results by tuning the hyperparameters of the LabelPropagation model?**

Let me know what you discover in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- Introduction to Semi-Supervised Learning, 2009.
- Chapter 11: Label Propagation and Quadratic Criterion, Semi-Supervised Learning, 2006.

- sklearn.semi_supervised.LabelPropagation API.
- Section 1.14. Semi-Supervised, Scikit-Learn User Guide.
- sklearn.model_selection.train_test_split API.
- sklearn.linear_model.LogisticRegression API.
- sklearn.datasets.make_classification API.

In this tutorial, you discovered how to apply the label propagation algorithm to a semi-supervised learning classification dataset.

Specifically, you learned:

- An intuition for how the label propagation semi-supervised learning algorithm works.
- How to develop and evaluate a label propagation algorithm and use the model output to train a supervised learning algorithm.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Semi-Supervised Learning With Label Propagation appeared first on Machine Learning Mastery.

]]>The post Histogram-Based Gradient Boosting Ensembles in Python appeared first on Machine Learning Mastery.

]]>Gradient boosting is an ensemble of decision trees algorithms.

It may be one of the most popular techniques for structured (tabular) classification and regression predictive modeling problems given that it performs so well across a wide range of datasets in practice.

A major problem of gradient boosting is that it is slow to train the model. This is particularly a problem when using the model on large datasets with tens of thousands of examples (rows).

Training the trees that are added to the ensemble can be dramatically accelerated by discretizing (binning) the continuous input variables to a few hundred unique values. Gradient boosting ensembles that implement this technique and tailor the training algorithm around input variables under this transform are referred to as **histogram-based gradient boosting ensembles**.

In this tutorial, you will discover how to develop histogram-based gradient boosting tree ensembles.

After completing this tutorial, you will know:

- Histogram-based gradient boosting is a technique for training faster decision trees used in the gradient boosting ensemble.
- How to use the experimental implementation of histogram-based gradient boosting in the scikit-learn library.
- How to use histogram-based gradient boosting ensembles with the XGBoost and LightGBM third-party libraries.

Let’s get started.

This tutorial is divided into four parts; they are:

- Histogram Gradient Boosting
- Histogram Gradient Boosting With Scikit-Learn
- Histogram Gradient Boosting With XGBoost
- Histogram Gradient Boosting With LightGBM

Gradient boosting is an ensemble machine learning algorithm.

Boosting refers to a class of ensemble learning algorithms that add tree models to an ensemble sequentially. Each tree model added to the ensemble attempts to correct the prediction errors made by the tree models already present in the ensemble.

Gradient boosting is a generalization of boosting algorithms like AdaBoost to a statistical framework that treats the training process as an additive model and allows arbitrary loss functions to be used, greatly improving the capability of the technique. As such, gradient boosting ensembles are the go-to technique for most structured (e.g. tabular data) predictive modeling tasks.

Although gradient boosting performs very well in practice, the models can be slow to train. This is because trees must be created and added sequentially, unlike other ensemble models like random forest where ensemble members can be trained in parallel, exploiting multiple CPU cores. As such, a lot of effort has been put into techniques that improve the efficiency of the gradient boosting training algorithm.

Two notable libraries that wrap up many modern efficiency techniques for training gradient boosting algorithms include the Extreme Gradient Boosting (XGBoost) and Light Gradient Boosting Machines (LightGBM).

One aspect of the training algorithm that can be accelerated is the construction of each decision tree, the speed of which is bounded by the number of examples (rows) and number of features (columns) in the training dataset. Large datasets, e.g. tens of thousands of examples or more, can result in the very slow construction of trees as split points on each value, for each feature must be considered during the construction of the trees.

If we can reduce #data or #feature, we will be able to substantially speed up the training of GBDT.

— LightGBM: A Highly Efficient Gradient Boosting Decision Tree, 2017.

The construction of decision trees can be sped up significantly by reducing the number of values for continuous input features. This can be achieved by discretization or binning values into a fixed number of buckets. This can reduce the number of unique values for each feature from tens of thousands down to a few hundred.

This allows the decision tree to operate upon the ordinal bucket (an integer) instead of specific values in the training dataset. This coarse approximation of the input data often has little impact on model skill, if not improves the model skill, and dramatically accelerates the construction of the decision tree.

Additionally, efficient data structures can be used to represent the binning of the input data; for example, histograms can be used and the tree construction algorithm can be further tailored for the efficient use of histograms in the construction of each tree.

These techniques were originally developed in the late 1990s for efficiency developing single decision trees on large datasets, but can be used in ensembles of decision trees, such as gradient boosting.

As such, it is common to refer to a gradient boosting algorithm supporting “*histograms*” in modern machine learning libraries as a **histogram-based gradient boosting**.

Instead of finding the split points on the sorted feature values, histogram-based algorithm buckets continuous feature values into discrete bins and uses these bins to construct feature histograms during training. Since the histogram-based algorithm is more efficient in both memory consumption and training speed, we will develop our work on its basis.

— LightGBM: A Highly Efficient Gradient Boosting Decision Tree, 2017.

Now that we are familiar with the idea of adding histograms to the construction of decision trees in gradient boosting, let’s review some common implementations we can use on our predictive modeling projects.

There are three main libraries that support the technique; they are Scikit-Learn, XGBoost, and LightGBM.

Let’s take a closer look at each in turn.

**Note**: We are not racing the algorithms; instead, we are just demonstrating how to configure each implementation to use the histogram method and hold all other unrelated hyperparameters constant at their default values.

The scikit-learn machine learning library provides an experimental implementation of gradient boosting that supports the histogram technique.

Specifically, this is provided in the HistGradientBoostingClassifier and HistGradientBoostingRegressor classes.

In order to use these classes, you must add an additional line to your project that indicates you are happy to use these experimental techniques and that their behavior may change with subsequent releases of the library.

... # explicitly require this experimental feature from sklearn.experimental import enable_hist_gradient_boosting

The scikit-learn documentation claims that these histogram-based implementations of gradient boosting are orders of magnitude faster than the default gradient boosting implementation provided by the library.

These histogram-based estimators can be orders of magnitude faster than GradientBoostingClassifier and GradientBoostingRegressor when the number of samples is larger than tens of thousands of samples.

— Histogram-Based Gradient Boosting, Scikit-Learn User Guide.

The classes can be used just like any other scikit-learn model.

By default, the ensemble uses 255 bins for each continuous input feature, and this can be set via the “*max_bins*” argument. Setting this to smaller values, such as 50 or 100, may result in further efficiency improvements, although perhaps at the cost of some model skill.

The number of trees can be set via the “*max_iter*” argument and defaults to 100.

... # define the model model = HistGradientBoostingClassifier(max_bins=255, max_iter=100)

The example below shows how to evaluate a histogram gradient boosting algorithm on a synthetic classification dataset with 10,000 examples and 100 features.

The model is evaluated using repeated stratified k-fold cross-validation and the mean accuracy across all folds and repeats is reported.

# evaluate sklearn histogram gradient boosting algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.experimental import enable_hist_gradient_boosting from sklearn.ensemble import HistGradientBoostingClassifier # define dataset X, y = make_classification(n_samples=10000, n_features=100, n_informative=50, n_redundant=50, random_state=1) # define the model model = HistGradientBoostingClassifier(max_bins=255, max_iter=100) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the scores n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the model performance on the synthetic dataset and reports the mean and standard deviation classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the scikit-learn histogram gradient boosting algorithm achieves a mean accuracy of about 94.3 percent on the synthetic dataset.

Accuracy: 0.943 (0.007)

We can also explore the effect of the number of bins on model performance.

The example below evaluates the performance of the model with a different number of bins for each continuous input feature from 50 to (about) 250 in increments of 50.

The complete example is listed below.

# compare number of bins for sklearn histogram gradient boosting from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.experimental import enable_hist_gradient_boosting from sklearn.ensemble import HistGradientBoostingClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=10000, n_features=100, n_informative=50, n_redundant=50, random_state=1) return X, y # get a list of models to evaluate def get_models(): models = dict() for i in [10, 50, 100, 150, 200, 255]: models[str(i)] = HistGradientBoostingClassifier(max_bins=i, max_iter=100) return models # evaluate a give model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the scores scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model and collect the scores scores = evaluate_model(model, X, y) # stores the results results.append(scores) names.append(name) # report performance along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example evaluates each configuration, reporting the mean and standard deviation classification accuracy along the way and finally creating a plot of the distribution of scores.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that increasing the number of bins may decrease the mean accuracy of the model on this dataset.

We might expect that an increase in the number of bins may also require an increase in the number of trees (*max_iter*) to ensure that the additional split points can be effectively explored and harnessed by the model.

Importantly, fitting an ensemble where trees use 10 or 50 bins per variable is dramatically faster than 255 bins per input variable.

>10 0.945 (0.009) >50 0.944 (0.007) >100 0.944 (0.008) >150 0.944 (0.008) >200 0.944 (0.007) >255 0.943 (0.007)

A figure is created comparing the distribution in accuracy scores for each configuration using box and whisker plots.

In this case, we can see that increasing the number of bins in the histogram appears to reduce the spread of the distribution, although it may lower the mean performance of the model.

Extreme Gradient Boosting, or XGBoost for short, is a library that provides a highly optimized implementation of gradient boosting.

One of the techniques implemented in the library is the use of histograms for the continuous input variables.

The XGBoost library can be installed using your favorite Python package manager, such as Pip; for example:

sudo pip install xgboost

We can develop XGBoost models for use with the scikit-learn library via the XGBClassifier and XGBRegressor classes.

The training algorithm can be configured to use the histogram method by setting the “*tree_method*” argument to ‘*approx*‘, and the number of bins can be set via the “*max_bin*” argument.

... # define the model model = XGBClassifier(tree_method='approx', max_bin=255, n_estimators=100)

The example below demonstrates evaluating an XGBoost model configured to use the histogram or approximate technique for constructing trees with 255 bins per continuous input feature and 100 trees in the model.

# evaluate xgboost histogram gradient boosting algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from xgboost import XGBClassifier # define dataset X, y = make_classification(n_samples=10000, n_features=100, n_informative=50, n_redundant=50, random_state=1) # define the model model = XGBClassifier(tree_method='approx', max_bin=255, n_estimators=100) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the scores n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the model performance on the synthetic dataset and reports the mean and standard deviation classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the XGBoost histogram gradient boosting algorithm achieves a mean accuracy of about 95.7 percent on the synthetic dataset.

Accuracy: 0.957 (0.007)

Light Gradient Boosting Machine or LightGBM for short is another third-party library like XGBoost that provides a highly optimized implementation of gradient boosting.

It may have implemented the histogram technique before XGBoost, but XGBoost later implemented the same technique, highlighting the “*gradient boosting efficiency*” competition between gradient boosting libraries.

The LightGBM library can be installed using your favorite Python package manager, such as Pip; for example:

sudo pip install lightgbm

We can develop LightGBM models for use with the scikit-learn library via the LGBMClassifier and LGBMRegressor classes.

The training algorithm uses histograms by default. The maximum bins per continuous input variable can be set via the “*max_bin*” argument.

... # define the model model = LGBMClassifier(max_bin=255, n_estimators=100)

The example below demonstrates evaluating a LightGBM model configured to use the histogram or approximate technique for constructing trees with 255 bins per continuous input feature and 100 trees in the model.

# evaluate lightgbm histogram gradient boosting algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from lightgbm import LGBMClassifier # define dataset X, y = make_classification(n_samples=10000, n_features=100, n_informative=50, n_redundant=50, random_state=1) # define the model model = LGBMClassifier(max_bin=255, n_estimators=100) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the scores n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the model performance on the synthetic dataset and reports the mean and standard deviation classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the LightGBM histogram gradient boosting algorithm achieves a mean accuracy of about 94.2 percent on the synthetic dataset.

Accuracy: 0.942 (0.006)

This section provides more resources on the topic if you are looking to go deeper.

- How to Develop a Gradient Boosting Machine Ensemble in Python
- Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost

- Sprint: A scalable parallel classifier for data mining, 1996.
- CLOUDS: A decision tree classifier for large datasets, 1998.
- Communication and memory efficient parallel decision tree construction, 2003.
- LightGBM: A Highly Efficient Gradient Boosting Decision Tree, 2017.
- XGBoost: A Scalable Tree Boosting System, 2016.

- sklearn.ensemble.HistGradientBoostingClassifier API.
- sklearn.ensemble.HistGradientBoostingRegressor API.
- XGBoost, Fast Histogram Optimized Grower, 8x to 10x Speedup
- xgboost.XGBClassifier API.
- xgboost.XGBRegressor API.
- lightgbm.LGBMClassifier API.
- lightgbm.LGBMRegressor API.

In this tutorial, you discovered how to develop histogram-based gradient boosting tree ensembles.

Specifically, you learned:

- Histogram-based gradient boosting is a technique for training faster decision trees used in the gradient boosting ensemble.
- How to use the experimental implementation of histogram-based gradient boosting in the scikit-learn library.
- How to use histogram-based gradient boosting ensembles with the XGBoost and LightGBM third-party libraries.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Histogram-Based Gradient Boosting Ensembles in Python appeared first on Machine Learning Mastery.

]]>