The post Visualization for Function Optimization in Python appeared first on Machine Learning Mastery.

]]>Optimization algorithms navigate the search space of input variables in order to locate the optima, and both the shape of the objective function and behavior of the algorithm in the search space are opaque on real-world problems.

As such, it is common to study optimization algorithms using simple low-dimensional functions that can be easily visualized directly. Additionally, the samples in the input space of these simple functions made by an optimization algorithm can be visualized with their appropriate context.

Visualization of lower-dimensional functions and algorithm behavior on those functions can help to develop the intuitions that can carry over to more complex higher-dimensional function optimization problems later.

In this tutorial, you will discover how to create visualizations for function optimization in Python.

After completing this tutorial, you will know:

- Visualization is an important tool when studying function optimization algorithms.
- How to visualize one-dimensional functions and samples using line plots.
- How to visualize two-dimensional functions and samples using contour and surface plots.

Let’s get started.

This tutorial is divided into three parts; they are:

- Visualization for Function Optimization
- Visualize 1D Function Optimization
- Test Function
- Sample Test Function
- Line Plot of Test Function
- Scatter Plot of Test Function
- Line Plot with Marked Optima
- Line Plot with Samples

- Visualize 2D Function Optimization
- Test Function
- Sample Test Function
- Contour Plot of Test Function
- Filled Contour Plot of Test Function
- Filled Contour Plot of Test Function with Samples
- Surface Plot of Test Function

Function optimization is a field of mathematics concerned with finding the inputs to a function that result in the optimal output for the function, typically a minimum or maximum value.

Optimization may be straightforward for simple differential functions where the solution can be calculated analytically. However, most functions we’re interested in solving in applied machine learning may or may not be well behaved and may be complex, nonlinear, multivariate, and non-differentiable.

As such, it is important to have an understanding of a wide range of different algorithms that can be used to address function optimization problems.

An important aspect of studying function optimization is understanding the objective function that is being optimized and understanding the behavior of an optimization algorithm over time.

Visualization plays an important role when getting started with function optimization.

We can select simple and well-understood test functions to study optimization algorithms. These simple functions can be plotted to understand the relationship between the input to the objective function and the output of the objective function and highlighting hills, valleys, and optima.

In addition, the samples selected from the search space by an optimization algorithm can also be plotted on top of plots of the objective function. These plots of algorithm behavior can provide insight and intuition into how specific optimization algorithms work and navigate a search space that can generalize to new problems in the future.

Typically, one-dimensional or two-dimensional functions are chosen to study optimization algorithms as they are easy to visualize using standard plots, like line plots and surface plots. We will explore both in this tutorial.

First, let’s explore how we might visualize a one-dimensional function optimization.

A one-dimensional function takes a single input variable and outputs the evaluation of that input variable.

Input variables are typically continuous, represented by a real-valued floating-point value. Often, the input domain is unconstrained, although for test problems we impose a domain of interest.

In this case we will explore function visualization with a simple x^2 objective function:

- f(x) = x^2

This has an optimal value with an input of x=0.0, which equals 0.0.

The example below implements this objective function and evaluates a single input.

# example of a 1d objective function # objective function def objective(x): return x**2.0 # evaluate inputs to the objective function x = 4.0 result = objective(x) print('f(%.3f) = %.3f' % (x, result))

Running the example evaluates the value 4.0 with the objective function, which equals 16.0.

f(4.000) = 16.000

The first thing we might want to do with a new function is define an input range of interest and sample the domain of interest using a uniform grid.

This sample will provide the basis for generating a plot later.

In this case, we will define a domain of interest around the optima of x=0.0 from x=-5.0 to x=5.0 and sample a grid of values in this range with 0.1 increments, such as -5.0, -4.9, -4.8, etc.

... # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # summarize some of the input domain print(inputs[:5])

We can then evaluate each of the x values in our sample.

... # compute targets results = objective(inputs) # summarize some of the results print(results[:5])

Finally, we can check some of the input and their corresponding outputs.

... # create a mapping of some inputs to some results for i in range(5): print('f(%.3f) = %.3f' % (inputs[i], results[i]))

Tying this together, the complete example of sampling the input space and evaluating all points in the sample is listed below.

# sample 1d objective function from numpy import arange # objective function def objective(x): return x**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # summarize some of the input domain print(inputs[:5]) # compute targets results = objective(inputs) # summarize some of the results print(results[:5]) # create a mapping of some inputs to some results for i in range(5): print('f(%.3f) = %.3f' % (inputs[i], results[i]))

Running the example first generates a uniform sample of input points as we expected.

The input points are then evaluated using the objective function and finally, we can see a simple mapping of inputs to outputs of the objective function.

[-5. -4.9 -4.8 -4.7 -4.6] [25. 24.01 23.04 22.09 21.16] f(-5.000) = 25.000 f(-4.900) = 24.010 f(-4.800) = 23.040 f(-4.700) = 22.090 f(-4.600) = 21.160

Now that we have some confidence in generating a sample of inputs and evaluating them with the objective function, we can look at generating plots of the function.

We could sample the input space randomly, but the benefit of a uniform line or grid of points is that it can be used to generate a smooth plot.

It is smooth because the points in the input space are ordered from smallest to largest. This ordering is important as we expect (hope) that the output of the objective function has a similar smooth relationship between values, e.g. small changes in input result in locally consistent (smooth) changes in the output of the function.

In this case, we can use the samples to generate a line plot of the objective function with the input points (x) on the x-axis of the plot and the objective function output (results) on the y-axis of the plot.

... # create a line plot of input vs result pyplot.plot(inputs, results) # show the plot pyplot.show()

Tying this together, the complete example is listed below.

# line plot of input vs result for a 1d objective function from numpy import arange from matplotlib import pyplot # objective function def objective(x): return x**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # compute targets results = objective(inputs) # create a line plot of input vs result pyplot.plot(inputs, results) # show the plot pyplot.show()

Running the example creates a line plot of the objective function.

We can see that the function has a large U-shape, called a parabola. This is a common shape when studying curves, e.g. the study of calculus.

The line is a construct. It is not really the function, just a smooth summary of the function. Always keep this in mind.

Recall that we, in fact, generated a sample of points in the input space and corresponding evaluation of those points.

As such, it would be more accurate to create a scatter plot of points; for example:

# scatter plot of input vs result for a 1d objective function from numpy import arange from matplotlib import pyplot # objective function def objective(x): return x**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # compute targets results = objective(inputs) # create a scatter plot of input vs result pyplot.scatter(inputs, results) # show the plot pyplot.show()

Running the example creates a scatter plot of the objective function.

We can see the familiar shape of the function, but we don’t gain anything from plotting the points directly.

The line and the smooth interpolation between the points it provides are more useful as we can draw other points on top of the line, such as the location of the optima or the points sampled by an optimization algorithm.

Next, let’s draw the line plot again and this time draw a point where the known optima of the function is located.

This can be helpful when studying an optimization algorithm as we might want to see how close an optimization algorithm can get to the optima.

First, we must define the input for the optima, then evaluate that point to give the x-axis and y-axis values for plotting.

... # define the known function optima optima_x = 0.0 optima_y = objective(optima_x)

We can then plot this point with any shape or color we like, in this case, a red square.

... # draw the function optima as a red square pyplot.plot([optima_x], [optima_y], 's', color='r')

Tying this together, the complete example of creating a line plot of the function with the optima highlighted by a point is listed below.

# line plot of input vs result for a 1d objective function and show optima from numpy import arange from matplotlib import pyplot # objective function def objective(x): return x**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # compute targets results = objective(inputs) # create a line plot of input vs result pyplot.plot(inputs, results) # define the known function optima optima_x = 0.0 optima_y = objective(optima_x) # draw the function optima as a red square pyplot.plot([optima_x], [optima_y], 's', color='r') # show the plot pyplot.show()

Running the example creates the familiar line plot of the function, and this time, the optima of the function, e.g. the input that results in the minimum output of the function, is marked with a red square.

This is a very simple function and the red square for the optima is easy to see.

Sometimes the function might be more complex, with lots of hills and valleys, and we might want to make the optima more visible.

In this case, we can draw a vertical line across the whole plot.

... # draw a vertical line at the optimal input pyplot.axvline(x=optima_x, ls='--', color='red')

Tying this together, the complete example is listed below.

# line plot of input vs result for a 1d objective function and show optima as line from numpy import arange from matplotlib import pyplot # objective function def objective(x): return x**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # compute targets results = objective(inputs) # create a line plot of input vs result pyplot.plot(inputs, results) # define the known function optima optima_x = 0.0 # draw a vertical line at the optimal input pyplot.axvline(x=optima_x, ls='--', color='red') # show the plot pyplot.show()

Running the example creates the same plot and this time draws a red line clearly marking the point in the input space that marks the optima.

Finally, we might want to draw the samples of the input space selected by an optimization algorithm.

We will simulate these samples with random points drawn from the input domain.

... # simulate a sample made by an optimization algorithm seed(1) sample = r_min + rand(10) * (r_max - r_min) # evaluate the sample sample_eval = objective(sample)

We can then plot this sample, in this case using small black circles.

... # plot the sample as black circles pyplot.plot(sample, sample_eval, 'o', color='black')

The complete example of creating a line plot of a function with the optima marked by a red line and an algorithm sample drawn with small black dots is listed below.

# line plot of domain for a 1d function with optima and algorithm sample from numpy import arange from numpy.random import seed from numpy.random import rand from matplotlib import pyplot # objective function def objective(x): return x**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # compute targets results = objective(inputs) # simulate a sample made by an optimization algorithm seed(1) sample = r_min + rand(10) * (r_max - r_min) # evaluate the sample sample_eval = objective(sample) # create a line plot of input vs result pyplot.plot(inputs, results) # define the known function optima optima_x = 0.0 # draw a vertical line at the optimal input pyplot.axvline(x=optima_x, ls='--', color='red') # plot the sample as black circles pyplot.plot(sample, sample_eval, 'o', color='black') # show the plot pyplot.show()

Running the example creates the line plot of the domain and marks the optima with a red line as before.

This time, the sample from the domain selected by an algorithm (really a random sample of points) is drawn with black dots.

We can imagine that a real optimization algorithm will show points narrowing in on the domain as it searches down-hill from a starting point.

Next, let’s look at how we might perform similar visualizations for the optimization of a two-dimensional function.

A two-dimensional function is a function that takes two input variables, e.g. *x* and *y*.

We can use the same *x^2* function and scale it up to be a two-dimensional function; for example:

- f(x, y) = x^2 + y^2

This has an optimal value with an input of [x=0.0, y=0.0], which equals 0.0.

The example below implements this objective function and evaluates a single input.

# example of a 2d objective function # objective function def objective(x, y): return x**2.0 + y**2.0 # evaluate inputs to the objective function x = 4.0 y = 4.0 result = objective(x, y) print('f(%.3f, %.3f) = %.3f' % (x, y, result))

Running the example evaluates the point [x=4, y=4], which equals 32.

f(4.000, 4.000) = 32.000

Next, we need a way to sample the domain so that we can, in turn, sample the objective function.

A common way for sampling a two-dimensional function is to first generate a uniform sample along each variable, *x* and *y*, then use these two uniform samples to create a grid of samples, called a mesh grid.

This is not a two-dimensional array across the input space; instead, it is two two-dimensional arrays that, when used together, define a grid across the two input variables.

This is achieved by duplicating the entire *x* sample array for each *y* sample point and similarly duplicating the entire *y* sample array for each *x* sample point.

This can be achieved using the meshgrid() NumPy function; for example:

... # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # summarize some of the input domain print(x[:5, :5])

We can then evaluate each pair of points using our objective function.

... # compute targets results = objective(x, y) # summarize some of the results print(results[:5, :5])

Finally, we can review the mapping of some of the inputs to their corresponding output values.

... # create a mapping of some inputs to some results for i in range(5): print('f(%.3f, %.3f) = %.3f' % (x[i,0], y[i,0], results[i,0]))

The example below demonstrates how we can create a uniform sample grid across the two-dimensional input space and objective function.

# sample 2d objective function from numpy import arange from numpy import meshgrid # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # summarize some of the input domain print(x[:5, :5]) # compute targets results = objective(x, y) # summarize some of the results print(results[:5, :5]) # create a mapping of some inputs to some results for i in range(5): print('f(%.3f, %.3f) = %.3f' % (x[i,0], y[i,0], results[i,0]))

Running the example first summarizes some points in the mesh grid, then the objective function evaluation for some points.

Finally, we enumerate coordinates in the two-dimensional input space and their corresponding function evaluation.

[[-5. -4.9 -4.8 -4.7 -4.6] [-5. -4.9 -4.8 -4.7 -4.6] [-5. -4.9 -4.8 -4.7 -4.6] [-5. -4.9 -4.8 -4.7 -4.6] [-5. -4.9 -4.8 -4.7 -4.6]] [[50. 49.01 48.04 47.09 46.16] [49.01 48.02 47.05 46.1 45.17] [48.04 47.05 46.08 45.13 44.2 ] [47.09 46.1 45.13 44.18 43.25] [46.16 45.17 44.2 43.25 42.32]] f(-5.000, -5.000) = 50.000 f(-5.000, -4.900) = 49.010 f(-5.000, -4.800) = 48.040 f(-5.000, -4.700) = 47.090 f(-5.000, -4.600) = 46.160

Now that we are familiar with how to sample the input space and evaluate points, let’s look at how we might plot the function.

A popular plot for two-dimensional functions is a contour plot.

This plot creates a flat representation of the objective function outputs for each x and y coordinate where the color and contour lines indicate the relative value or height of the output of the objective function.

This is just like a contour map of a landscape where mountains can be distinguished from valleys.

This can be achieved using the contour() Matplotlib function that takes the mesh grid and the evaluation of the mesh grid as input directly.

We can then specify the number of levels to draw on the contour and the color scheme to use. In this case, we will use 50 levels and a popular “*jet*” color scheme where low-levels use a cold color scheme (blue) and high-levels use a hot color scheme (red).

... # create a contour plot with 50 levels and jet color scheme pyplot.contour(x, y, results, 50, alpha=1.0, cmap='jet') # show the plot pyplot.show()

Tying this together, the complete example of creating a contour plot of the two-dimensional objective function is listed below.

# create a contour plot with 50 levels and jet color scheme pyplot.contour(x, y, results, 50, alpha=1.0, cmap='jet') # show the plot pyplot.show() Tying this together, the complete example of creating a contour plot of the two-dimensional objective function is listed below. # contour plot for 2d objective function from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a contour plot with 50 levels and jet color scheme pyplot.contour(x, y, results, 50, alpha=1.0, cmap='jet') # show the plot pyplot.show()

Running the example creates the contour plot.

We can see that the more curved parts of the surface around the edges have more contours to show the detail, and the less curved parts of the surface in the middle have fewer contours.

We can see that the lowest part of the domain is the middle, as expected.

It is also helpful to color the plot between the contours to show a more complete surface.

Again, the colors are just a simple linear interpolation, not the true function evaluation. This must be kept in mind on more complex functions where fine detail will not be shown.

We can fill the contour plot using the contourf() version of the function that takes the same arguments.

... # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet')

We can also show the optima on the plot, in this case as a white star that will stand out against the blue background color of the lowest part of the plot.

... # define the known function optima optima_x = [0.0, 0.0] # draw the function optima as a white star pyplot.plot([optima_x[0]], [optima_x[1]], '*', color='white')

Tying this together, the complete example of a filled contour plot with the optima marked is listed below.

# filled contour plot for 2d objective function and show the optima from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # define the known function optima optima_x = [0.0, 0.0] # draw the function optima as a white star pyplot.plot([optima_x[0]], [optima_x[1]], '*', color='white') # show the plot pyplot.show()

Running the example creates the filled contour plot that gives a better idea of the shape of the objective function.

The optima at [x=0, y=0] is then marked clearly with a white star.

We may want to show the progress of an optimization algorithm to get an idea of its behavior in the context of the shape of the objective function.

In this case, we can simulate the points chosen by an optimization algorithm with random coordinates in the input space.

... # simulate a sample made by an optimization algorithm seed(1) sample_x = r_min + rand(10) * (r_max - r_min) sample_y = r_min + rand(10) * (r_max - r_min)

These points can then be plotted directly as black circles and their context color can give an idea of their relative quality.

... # plot the sample as black circles pyplot.plot(sample_x, sample_y, 'o', color='black')

Tying this together, the complete example of a filled contour plot with optimal and input sample plotted is listed below.

# filled contour plot for 2d objective function and show the optima and sample from numpy import arange from numpy import meshgrid from numpy.random import seed from numpy.random import rand from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # simulate a sample made by an optimization algorithm seed(1) sample_x = r_min + rand(10) * (r_max - r_min) sample_y = r_min + rand(10) * (r_max - r_min) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # define the known function optima optima_x = [0.0, 0.0] # draw the function optima as a white star pyplot.plot([optima_x[0]], [optima_x[1]], '*', color='white') # plot the sample as black circles pyplot.plot(sample_x, sample_y, 'o', color='black') # show the plot pyplot.show()

Running the example, we can see the filled contour plot as before with the optima marked.

We can now see the sample drawn as black dots and their surrounding color and relative distance to the optima gives an idea of how close the algorithm (random points in this case) got to solving the problem.

Finally, we may want to create a three-dimensional plot of the objective function to get a fuller idea of the curvature of the function.

This can be achieved using the plot_surface() Matplotlib function, that, like the contour plot, takes the mesh grid and function evaluation directly.

... # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet')

The complete example of creating a surface plot is listed below.

# surface plot for 2d objective function from numpy import arange from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a three-dimensional surface plot of the objective function.

Additionally, the plot is interactive, meaning that you can use the mouse to drag the perspective on the surface around and view it from different angles.

This section provides more resources on the topic if you are looking to go deeper.

- Optimization and root finding (scipy.optimize)
- Optimization (scipy.optimize)
- numpy.meshgrid API.
- matplotlib.pyplot.contour API.
- matplotlib.pyplot.contourf API.
- mpl_toolkits.mplot3d.Axes3D.plot_surface API.

In this tutorial, you discovered how to create visualizations for function optimization in Python.

Specifically, you learned:

- Visualization is an important tool when studying function optimization algorithms.
- How to visualize one-dimensional functions and samples using line plots.
- How to visualize two-dimensional functions and samples using contour and surface plots.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Visualization for Function Optimization in Python appeared first on Machine Learning Mastery.

]]>The post Code Adam Gradient Descent Optimization From Scratch appeared first on Machine Learning Mastery.

]]>A limitation of gradient descent is that a single step size (learning rate) is used for all input variables. Extensions to gradient descent like AdaGrad and RMSProp update the algorithm to use a separate step size for each input variable but may result in a step size that rapidly decreases to very small values.

The **Adaptive Movement Estimation** algorithm, or **Adam** for short, is an extension to gradient descent and a natural successor to techniques like AdaGrad and RMSProp that automatically adapts a learning rate for each input variable for the objective function and further smooths the search process by using an exponentially decreasing moving average of the gradient to make updates to variables.

In this tutorial, you will discover how to develop gradient descent with Adam optimization algorithm from scratch.

After completing this tutorial, you will know:

- Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
- Gradient descent can be updated to use an automatically adaptive step size for each input variable using a decaying average of partial derivatives, called Adam.
- How to implement the Adam optimization algorithm from scratch and apply it to an objective function and evaluate the results.

Let’s get started.

This tutorial is divided into three parts; they are:

- Gradient Descent
- Adam Optimization Algorithm
- Gradient Descent With Adam
- Two-Dimensional Test Problem
- Gradient Descent Optimization With Adam
- Visualization of Adam

Gradient descent is an optimization algorithm.

It is technically referred to as a first-order optimization algorithm as it explicitly makes use of the first-order derivative of the target objective function.

- First-order methods rely on gradient information to help direct the search for a minimum …

— Page 69, Algorithms for Optimization, 2019.

The first-order derivative, or simply the “*derivative*,” is the rate of change or slope of the target function at a specific point, e.g. for a specific input.

If the target function takes multiple input variables, it is referred to as a multivariate function and the input variables can be thought of as a vector. In turn, the derivative of a multivariate target function may also be taken as a vector and is referred to generally as the gradient.

**Gradient**: First-order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a specific input.

Gradient descent refers to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to locate the minimum of the function.

The gradient descent algorithm requires a target function that is being optimized and the derivative function for the objective function. The target function *f()* returns a score for a given set of inputs, and the derivative function *f'()* gives the derivative of the target function for a given set of inputs.

The gradient descent algorithm requires a starting point (*x*) in the problem, such as a randomly selected point in the input space.

The derivative is then calculated and a step is taken in the input space that is expected to result in a downhill movement in the target function, assuming we are minimizing the target function.

A downhill movement is made by first calculating how far to move in the input space, calculated as the step size (called alpha or the learning rate) multiplied by the gradient. This is then subtracted from the current point, ensuring we move against the gradient, or down the target function.

- x(t) = x(t-1) – step_size * f'(x(t-1))

The steeper the objective function at a given point, the larger the magnitude of the gradient and, in turn, the larger the step taken in the search space. The size of the step taken is scaled using a step size hyperparameter.

**Step Size**(*alpha*): Hyperparameter that controls how far to move in the search space against the gradient each iteration of the algorithm.

If the step size is too small, the movement in the search space will be small and the search will take a long time. If the step size is too large, the search may bounce around the search space and skip over the optima.

Now that we are familiar with the gradient descent optimization algorithm, let’s take a look at the Adam algorithm.

Adaptive Movement Estimation algorithm, or Adam for short, is an extension to the gradient descent optimization algorithm.

The algorithm was described in the 2014 paper by Diederik Kingma and Jimmy Lei Ba titled “Adam: A Method for Stochastic Optimization.”

Adam is designed to accelerate the optimization process, e.g. decrease the number of function evaluations required to reach the optima, or to improve the capability of the optimization algorithm, e.g. result in a better final result.

This is achieved by calculating a step size for each input parameter that is being optimized. Importantly, each step size is automatically adapted throughput the search process based on the gradients (partial derivatives) encountered for each variable.

We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients; the name Adam is derived from adaptive moment estimation

— Adam: A Method for Stochastic Optimization

This involves maintaining a first and second moment of the gradient, e.g. an exponentially decaying mean gradient (first moment) and variance (second moment) for each input variable.

The moving averages themselves are estimates of the 1st moment (the mean) and the 2nd raw moment (the uncentered variance) of the gradient.

— Adam: A Method for Stochastic Optimization

Let’s step through each element of the algorithm.

First, we must maintain a moment vector and exponentially weighted infinity norm for each parameter being optimized as part of the search, referred to as m and v (really the Greek letter nu) respectively. They are initialized to 0.0 at the start of the search.

- m = 0
- v = 0

The algorithm is executed iteratively over time t starting at *t=1*, and each iteration involves calculating a new set of parameter values *x*, e.g. going from *x(t-1)* to *x(t)*.

It is perhaps easy to understand the algorithm if we focus on updating one parameter, which generalizes to updating all parameters via vector operations.

First, the gradient (partial derivatives) are calculated for the current time step.

- g(t) = f'(x(t-1))

Next, the first moment is updated using the gradient and a hyperparameter *beta1*.

- m(t) = beta1 * m(t-1) + (1 – beta1) * g(t)

Then the second moment is updated using the squared gradient and a hyperparameter *beta2*.

- v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2

The first and second moments are biased because they are initialized with zero values.

… these moving averages are initialized as (vectors of) 0’s, leading to moment estimates that are biased towards zero, especially during the initial timesteps, and especially when the decay rates are small (i.e. the betas are close to 1). The good news is that this initialization bias can be easily counteracted, resulting in bias-corrected estimates …

— Adam: A Method for Stochastic Optimization

Next the first and second moments are bias-corrected, starring with the first moment:

- mhat(t) = m(t) / (1 – beta1(t))

And then the second moment:

- vhat(t) = v(t) / (1 – beta2(t))

Note, *beta1(t)* and *beta2(t)* refer to the beta1 and beta2 hyperparameters that are decayed on a schedule over the iterations of the algorithm. A static decay schedule can be used, although the paper recommend the following:

- beta1(t) = beta1^t
- beta2(t) = beta2^t

Finally, we can calculate the value for the parameter for this iteration.

- x(t) = x(t-1) – alpha * mhat(t) / (sqrt(vhat(t)) + eps)

Where *alpha* is the step size hyperparameter, *eps* is a small value (*epsilon*) such as 1e-8 that ensures we do not encounter a divide by zero error, and *sqrt()* is the square root function.

Note, a more efficient reordering of the update rule listed in the paper can be used:

- alpha(t) = alpha * sqrt(1 – beta2(t)) / (1 – beta1(t))
- x(t) = x(t-1) – alpha(t) * m(t) / (sqrt(v(t)) + eps)

To review, there are three hyperparameters for the algorithm, they are:

**alpha**: Initial step size (learning rate), a typical value is 0.001.**beta1**: Decay factor for first momentum, a typical value is 0.9.**beta2**: Decay factor for infinity norm, a typical value is 0.999.

And that’s it.

For full derivation of the Adam algorithm in the context of the Adam algorithm, I recommend reading the paper.

Next, let’s look at how we might implement the algorithm from scratch in Python.

In this section, we will explore how to implement the gradient descent optimization algorithm with Adam.

First, let’s define an optimization function.

We will use a simple two-dimensional function that squares the input of each dimension and define the range of valid inputs from -1.0 to 1.0.

The objective() function below implements this function

# objective function def objective(x, y): return x**2.0 + y**2.0

We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the response surface.

The complete example of plotting the objective function is listed below.

# 3d plot of the test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -1.0, 1.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a three-dimensional surface plot of the objective function.

We can see the familiar bowl shape with the global minima at f(0, 0) = 0.

We can also create a two-dimensional plot of the function. This will be helpful later when we want to plot the progress of the search.

The example below creates a contour plot of the objective function.

# contour plot of the test function from numpy import asarray from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # show the plot pyplot.show()

Running the example creates a two-dimensional contour plot of the objective function.

We can see the bowl shape compressed to contours shown with a color gradient. We will use this plot to plot the specific points explored during the progress of the search.

Now that we have a test objective function, let’s look at how we might implement the Adam optimization algorithm.

We can apply the gradient descent with Adam to the test problem.

First, we need a function that calculates the derivative for this function.

- f(x) = x^2
- f'(x) = x * 2

The derivative of x^2 is x * 2 in each dimension. The derivative() function implements this below.

# derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0])

Next, we can implement gradient descent optimization.

First, we can select a random point in the bounds of the problem as a starting point for the search.

This assumes we have an array that defines the bounds of the search with one row for each dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

... # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) score = objective(x[0], x[1])

Next, we need to initialize the first and second moments to zero.

... # initialize first and second moments m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])]

We then run a fixed number of iterations of the algorithm defined by the “*n_iter*” hyperparameter.

... # run iterations of gradient descent for t in range(n_iter): ...

The first step is to calculate the gradient for the current solution using the *derivative()* function.

... # calculate gradient gradient = derivative(solution[0], solution[1])

The first step is to calculate the derivative for the current set of parameters.

... # calculate gradient g(t) g = derivative(x[0], x[1])

Next, we need to perform the Adam update calculations. We will perform these calculations one variable at a time using an imperative programming style for readability.

In practice, I recommend using NumPy vector operations for efficiency.

... # build a solution one variable at a time for i in range(x.shape[0]): ...

First, we need to calculate the moment.

... # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]

Then the second moment.

... # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2

Then the bias correction for the first and second moments.

... # mhat(t) = m(t) / (1 - beta1(t)) mhat = m[i] / (1.0 - beta1**(t+1)) # vhat(t) = v(t) / (1 - beta2(t)) vhat = v[i] / (1.0 - beta2**(t+1))

Then finally the updated variable value.

... # x(t) = x(t-1) - alpha * mhat(t) / (sqrt(vhat(t)) + eps) x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps)

This is then repeated for each parameter that is being optimized.

At the end of the iteration we can evaluate the new parameter values and report the performance of the search.

... # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score))

We can tie all of this together into a function named *adam()* that takes the names of the objective and derivative functions as well as the algorithm hyperparameters, and returns the best solution found at the end of the search and its evaluation.

This complete function is listed below.

# gradient descent algorithm with adam def adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) score = objective(x[0], x[1]) # initialize first and second moments m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent updates for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(x.shape[0]): # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2 # mhat(t) = m(t) / (1 - beta1(t)) mhat = m[i] / (1.0 - beta1**(t+1)) # vhat(t) = v(t) / (1 - beta2(t)) vhat = v[i] / (1.0 - beta2**(t+1)) # x(t) = x(t-1) - alpha * mhat(t) / (sqrt(vhat(t)) + eps) x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps) # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return [x, score]

**Note**: we have intentionally used lists and imperative coding style instead of vectorized operations for readability. Feel free to adapt the implementation to a vectorized implementation with NumPy arrays for better performance.

We can then define our hyperparameters and call the *adam()* function to optimize our test objective function.

In this case, we will use 60 iterations of the algorithm with an initial steps size of 0.02 and beta1 and beta2 values of 0.8 and 0.999 respectively. These hyperparameter values were found after a little trial and error.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.999 # perform the gradient descent search with adam best, score = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2) print('Done!') print('f(%s) = %f' % (best, score))

Tying all of this together, the complete example of gradient descent optimization with Adam is listed below.

# gradient descent optimization with adam for a two-dimensional test function from math import sqrt from numpy import asarray from numpy.random import rand from numpy.random import seed # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with adam def adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) score = objective(x[0], x[1]) # initialize first and second moments m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent updates for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(x.shape[0]): # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2 # mhat(t) = m(t) / (1 - beta1(t)) mhat = m[i] / (1.0 - beta1**(t+1)) # vhat(t) = v(t) / (1 - beta2(t)) vhat = v[i] / (1.0 - beta2**(t+1)) # x(t) = x(t-1) - alpha * mhat(t) / (sqrt(vhat(t)) + eps) x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps) # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return [x, score] # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.999 # perform the gradient descent search with adam best, score = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2) print('Done!') print('f(%s) = %f' % (best, score))

Running the example applies the Adam optimization algorithm to our test problem and reports the performance of the search for each iteration of the algorithm.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a near-optimal solution was found after perhaps 53 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.

... >50 f([-0.00056912 -0.00321961]) = 0.00001 >51 f([-0.00052452 -0.00286514]) = 0.00001 >52 f([-0.00043908 -0.00251304]) = 0.00001 >53 f([-0.0003283 -0.00217044]) = 0.00000 >54 f([-0.00020731 -0.00184302]) = 0.00000 >55 f([-8.95352320e-05 -1.53514076e-03]) = 0.00000 >56 f([ 1.43050285e-05 -1.25002847e-03]) = 0.00000 >57 f([ 9.67123406e-05 -9.89850279e-04]) = 0.00000 >58 f([ 0.00015359 -0.00075587]) = 0.00000 >59 f([ 0.00018407 -0.00054858]) = 0.00000 Done! f([ 0.00018407 -0.00054858]) = 0.000000

We can plot the progress of the Adam search on a contour plot of the domain.

This can provide an intuition for the progress of the search over the iterations of the algorithm.

We must update the *adam()* function to maintain a list of all solutions found during the search, then return this list at the end of the search.

The updated version of the function with these changes is listed below.

# gradient descent algorithm with adam def adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) score = objective(x[0], x[1]) # initialize first and second moments m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent updates for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(bounds.shape[0]): # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2 # mhat(t) = m(t) / (1 - beta1(t)) mhat = m[i] / (1.0 - beta1**(t+1)) # vhat(t) = v(t) / (1 - beta2(t)) vhat = v[i] / (1.0 - beta2**(t+1)) # x(t) = x(t-1) - alpha * mhat(t) / (sqrt(vhat(t)) + ep) x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps) # evaluate candidate point score = objective(x[0], x[1]) # keep track of solutions solutions.append(x.copy()) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return solutions

We can then execute the search as before, and this time retrieve the list of solutions instead of the best final solution.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.999 # perform the gradient descent search with adam solutions = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

We can then create a contour plot of the objective function, as before.

... # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet')

Finally, we can plot each solution found during the search as a white dot connected by a line.

... # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

Tying this all together, the complete example of performing the Adam optimization on the test problem and plotting the results on a contour plot is listed below.

# example of plotting the adam search on a contour plot of the test function from math import sqrt from numpy import asarray from numpy import arange from numpy.random import rand from numpy.random import seed from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with adam def adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) score = objective(x[0], x[1]) # initialize first and second moments m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent updates for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(bounds.shape[0]): # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2 # mhat(t) = m(t) / (1 - beta1(t)) mhat = m[i] / (1.0 - beta1**(t+1)) # vhat(t) = v(t) / (1 - beta2(t)) vhat = v[i] / (1.0 - beta2**(t+1)) # x(t) = x(t-1) - alpha * mhat(t) / (sqrt(vhat(t)) + ep) x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps) # evaluate candidate point score = objective(x[0], x[1]) # keep track of solutions solutions.append(x.copy()) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return solutions # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.999 # perform the gradient descent search with adam solutions = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w') # show the plot pyplot.show()

Running the example performs the search as before, except in this case, a contour plot of the objective function is created.

In this case, we can see that a white dot is shown for each solution found during the search, starting above the optima and progressively getting closer to the optima at the center of the plot.

This section provides more resources on the topic if you are looking to go deeper.

- Algorithms for Optimization, 2019.
- Deep Learning, 2016.

- Gradient descent, Wikipedia.
- Stochastic gradient descent, Wikipedia.
- An overview of gradient descent optimization algorithms, 2016.

In this tutorial, you discovered how to develop gradient descent with Adam optimization algorithm from scratch.

Specifically, you learned:

- Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
- Gradient descent can be updated to use an automatically adaptive step size for each input variable using a decaying average of partial derivatives, called Adam.
- How to implement the Adam optimization algorithm from scratch and apply it to an objective function and evaluate the results.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Code Adam Gradient Descent Optimization From Scratch appeared first on Machine Learning Mastery.

]]>The post 3 Books on Optimization for Machine Learning appeared first on Machine Learning Mastery.

]]>It is an important foundational topic required in machine learning as most machine learning algorithms are fit on historical data using an optimization algorithm. Additionally, broader problems, such as model selection and hyperparameter tuning, can also be framed as an optimization problem.

Although having some background in optimization is critical for machine learning practitioners, it can be a daunting topic given that it is often described using highly mathematical language.

In this post, you will discover top books on optimization that will be helpful to machine learning practitioners.

Let’s get started.

The field of optimization is enormous as it touches many other fields of study.

As such, there are hundreds of books on the topic, and most are textbooks filed with math and proofs. This is fair enough given that it is a highly mathematical subject.

Nevertheless, there are books that provide a more approachable description of optimization algorithms.

Not all optimization algorithms are relevant to machine learning; instead, it is useful to focus on a small subset of algorithms.

Frankly, it is hard to group optimization algorithms as there are many concerns. Nevertheless, it is important to have some idea of the optimization that underlies simpler algorithms, such as linear regression and logistic regression (e.g. convex optimization, least squares, newton methods, etc.), and neural networks (first-order methods, gradient descent, etc.).

These are foundational optimization algorithms covered in most optimization textbooks.

Not all optimization problems in machine learning are well behaved, such as optimization used in AutoML and hyperparameter tuning. Therefore, knowledge of stochastic optimization algorithms is required (simulated annealing, genetic algorithms, particle swarm, etc.). Although these are optimization algorithms, they are also a type of learning algorithm referred to as biologically inspired computation or computational intelligence.

Therefore, we will take a look at both books that cover classical optimization algorithms as well as books on alternate optimization algorithms.

In fact, the first book we will look at covers both types of algorithms, and much more.

This book was written by Mykel Kochenderfer and Tim Wheeler and was published in 2019.

This book might be one of the very few textbooks that I’ve seen that broadly covers the field of optimization techniques relevant to modern machine learning.

This book provides a broad introduction to optimization with a focus on practical algorithms for the design of engineering systems. We cover a wide variety of optimization topics, introducing the underlying mathematical problem formulations and the algorithms for solving them. Figures, examples, and exercises are provided to convey the intuition behind the various approaches.

— Page xiiix, Algorithms for Optimization, 2019.

Importantly the algorithms range from univariate methods (bisection, line search, etc.) to first-order methods (gradient descent), second-order methods (Newton’s method), direct methods (pattern search), stochastic methods (simulated annealing), and population methods (genetic algorithms, particle swarm), and so much more.

It includes both technical descriptions of algorithms with references and worked examples of algorithms in Julia. It’s a shame the examples are not in Python as this would make the book near perfect in my eyes.

The complete table of contents for the book is listed below.

- Chapter 01: Introduction
- Chapter 02: Derivatives and Gradients
- Chapter 03: Bracketing
- Chapter 04: Local Descent
- Chapter 05: First-Order Methods
- Chapter 06: Second-Order Methods
- Chapter 07: Direct Methods
- Chapter 08: Stochastic Methods
- Chapter 09: Population Methods
- Chapter 10: Constraints
- Chapter 11: Linear Constrained Optimization
- Chapter 12: Multiobjective Optimization
- Chapter 13: Sampling Plans
- Chapter 14: Surrogate Models
- Chapter 15: Probabilistic Surrogate Models
- Chapter 16: Surrogate Optimization
- Chapter 17: Optimization under Uncertainty
- Chapter 18: Uncertainty Propagation
- Chapter 19: Discrete Optimization
- Chapter 20: Expression Optimization
- Chapter 21: Multidisciplinary Optimization

I like this book a lot; it is full of valuable practical advice. I highly recommend it!

- Algorithms for Optimization, 2019.

This book was written by Jorge Nocedal and Stephen Wright and was published in 2006.

This book is focused on the math and theory of the optimization algorithms presented and does cover many of the foundational techniques used by common machine learning algorithms. It may be a little too heavy for the average practitioner.

The book is intended as a textbook for graduate students in mathematical subjects.

We intend that this book will be used in graduate-level courses in optimization, as offered in engineering, operations research, computer science, and mathematics departments.

— Page xviii, Numerical Optimization, 2006.

Even though it is highly mathematical, the descriptions of the algorithms are precise and may provide a useful alternative description to complement the other books listed.

The complete table of contents for the book is listed below.

- Chapter 01: Introduction
- Chapter 02: Fundamentals of Unconstrained Optimization
- Chapter 03: Line Search Methods
- Chapter 04: Trust-Region Methods
- Chapter 05: Conjugate Gradient Methods
- Chapter 06: Quasi-Newton Methods
- Chapter 07: Large-Scale Unconstrained Optimization
- Chapter 08: Calculating Derivatives
- Chapter 09: Derivative-Free Optimization
- Chapter 10: Least-Squares Problems
- Chapter 11: Nonlinear Equations
- Chapter 12: Theory of Constrained Optimization
- Chapter 13: Linear Programming: The Simplex Method
- Chapter 14: Linear Programming: Interior-Point Methods
- Chapter 15: Fundamentals of Algorithms for Nonlinear Constrained Optimization
- Chapter 16: Quadratic Programming
- Chapter 17: Penalty and Augmented Lagrangian Methods
- Chapter 18: Sequential Quadratic Programming
- Chapter 19: Interior-Point Methods for Nonlinear Programming

It’s a solid textbook on optimization.

- Numerical Optimization, 2006.

If you do prefer the theoretical approach to the subject, another widely used mathematical book on optimization is “Convex Optimization” written by Stephen Boyd and Lieven Vandenberghe and published in 2004.

This book was written by Andries Engelbrecht and published in 2007.

This book provides an excellent overview of the field of nature-inspired optimization algorithms, also referred to as computational intelligence. This includes fields such as evolutionary computation and swarm intelligence.

This book is far less mathematical than the previous textbooks and is more focused on the metaphor of the inspired system and how to configure and use the specific algorithms with lots of pseudocode explanations.

While the material is introductory in nature, it does not shy away from details, and does present the mathematical foundations to the interested reader. The intention of the book is not to provide thorough attention to all computational intelligence paradigms and algorithms, but to give an overview of the most popular and frequently used models.

— Page xxix, Computational Intelligence: An Introduction, 2007.

Algorithms like genetic algorithms, genetic programming, evolutionary strategies, differential evolution, and particle swarm optimization are useful to know for machine learning model hyperparameter tuning and perhaps even model selection. They also form the core of many modern AutoML systems.

The complete table of contents for the book is listed below.

- Part I Introduction
- Chapter 01: Introduction to Computational Intelligence

- Part II Artificial Neural Networks
- Chapter 02: The Artificial Neuron
- Chapter 03: Supervised Learning Neural Networks
- Chapter 04: Unsupervised Learning Neural Networks
- Chapter 05: Radial Basis Function Networks
- Chapter 06: Reinforcement Learning
- Chapter 07: Performance Issues (Supervised Learning)

- Part III Evolutionary Computation
- Chapter 08: Introduction to Evolutionary Computation
- Chapter 09: Genetic Algorithms
- Chapter 10: Genetic Programming
- Chapter 11: Evolutionary Programming
- Chapter 12: Evolution Strategies
- Chapter 13: Differential Evolution
- Chapter 14: Cultural Algorithms
- Chapter 15: Coevolution

- Part IV Computational Swarm Intelligence
- Chapter 16: Particle Swarm Optimization
- Chapter 17: Ant Algorithms

- Part V Artificial Immune Systems
- Chapter 18: Natural Immune System
- Chapter 19: Artificial Immune Models

- Part VI Fuzzy Systems
- Chapter 20: Fuzzy Sets
- Chapter 21: Fuzzy Logic and Reasoning

I’m a fan of this book and recommend it.

In this post, you discovered books on optimization algorithms that are helpful to know for applied machine learning.

**Did I miss a good book on optimization?**

Let me know in the comments below.

**Have you read any of the books listed?**

Let me know what you think of it in the comments.

The post 3 Books on Optimization for Machine Learning appeared first on Machine Learning Mastery.

]]>The post Univariate Function Optimization in Python appeared first on Machine Learning Mastery.

]]>Univariate function optimization involves finding the input to a function that results in the optimal output from an objective function.

This is a common procedure in machine learning when fitting a model with one parameter or tuning a model that has a single hyperparameter.

An efficient algorithm is required to solve optimization problems of this type that will find the best solution with the minimum number of evaluations of the objective function, given that each evaluation of the objective function could be computationally expensive, such as fitting and evaluating a model on a dataset.

This excludes expensive grid search and random search algorithms and in favor of efficient algorithms like Brent’s method.

In this tutorial, you will discover how to perform univariate function optimization in Python.

After completing this tutorial, you will know:

- Univariate function optimization involves finding an optimal input for an objective function that takes a single continuous argument.
- How to perform univariate function optimization for an unconstrained convex function.
- How to perform univariate function optimization for an unconstrained non-convex function.

Let’s get started.

This tutorial is divided into three parts; they are:

- Univariate Function Optimization
- Convex Univariate Function Optimization
- Non-Convex Univariate Function Optimization

We may need to find an optimal value of a function that takes a single parameter.

In machine learning, this may occur in many situations, such as:

- Finding the coefficient of a model to fit to a training dataset.
- Finding the value of a single hyperparameter that results in the best model performance.

This is called univariate function optimization.

We may be interested in the minimum outcome or maximum outcome of the function, although this can be simplified to minimization as a maximizing function can be made minimizing by adding a negative sign to all outcomes of the function.

There may or may not be limits on the inputs to the function, so-called unconstrained or constrained optimization, and we assume that small changes in input correspond to small changes in the output of the function, e.g. that it is smooth.

The function may or may not have a single optima, although we prefer that it does have a single optima and that shape of the function looks like a large basin. If this is the case, we know we can sample the function at one point and find the path down to the minima of the function. Technically, this is referred to as a convex function for minimization (concave for maximization), and functions that don’t have this basin shape are referred to as non-convex.

**Convex Target Function**: There is a single optima and the shape of the target function leads to this optima.

Nevertheless, the target function is sufficiently complex that we don’t know the derivative, meaning we cannot just use calculus to analytically compute the minimum or maximum of the function where the gradient is zero. This is referred to as a function that is non-differentiable.

Although we might be able to sample the function with candidate values, we don’t know the input that will result in the best outcome. This may be because of the many reasons it is expensive to evaluate candidate solutions.

Therefore, we require an algorithm that efficiently samples input values to the function.

One approach to solving univariate function optimization problems is to use Brent’s method.

Brent’s method is an optimization algorithm that combines a bisecting algorithm (Dekker’s method) and inverse quadratic interpolation. It can be used for constrained and unconstrained univariate function optimization.

The Brent-Dekker method is an extension of the bisection method. It is a root-finding algorithm that combines elements of the secant method and inverse quadratic interpolation. It has reliable and fast convergence properties, and it is the univariate optimization algorithm of choice in many popular numerical optimization packages.

— Pages 49-51, Algorithms for Optimization, 2019.

Bisecting algorithms use a bracket (lower and upper) of input values and split up the input domain, bisecting it in order to locate where in the domain the optima is located, much like a binary search. Dekker’s method is one way this is achieved efficiently for a continuous domain.

Dekker’s method gets stuck on non-convex problems. Brent’s method modifies Dekker’s method to avoid getting stuck and also approximates the second derivative of the objective function (called the Secant Method) in an effort to accelerate the search.

As such, Brent’s method for univariate function optimization is generally preferred over most other univariate function optimization algorithms given its efficiency.

Brent’s method is available in Python via the minimize_scalar() SciPy function that takes the name of the function to be minimized. If your target function is constrained to a range, it can be specified via the “*bounds*” argument.

It returns an OptimizeResult object that is a dictionary containing the solution. Importantly, the ‘*x*‘ key summarizes the input for the optima, the ‘*fun*‘ key summarizes the function output for the optima, and the ‘*nfev*‘ summarizes the number of evaluations of the target function that were performed.

... # minimize the function result = minimize_scalar(objective, method='brent')

Now that we know how to perform univariate function optimization in Python, let’s look at some examples.

In this section, we will explore how to solve a convex univariate function optimization problem.

First, we can define a function that implements our function.

In this case, we will use a simple offset version of the x^2 function e.g. a simple parabola (u-shape) function. It is a minimization objective function with an optima at -5.0.

# objective function def objective(x): return (5.0 + x)**2.0

We can plot a coarse grid of this function with input values from -10 to 10 to get an idea of the shape of the target function.

The complete example is listed below.

# plot a convex target function from numpy import arange from matplotlib import pyplot # objective function def objective(x): return (5.0 + x)**2.0 # define range r_min, r_max = -10.0, 10.0 # prepare inputs inputs = arange(r_min, r_max, 0.1) # compute targets targets = [objective(x) for x in inputs] # plot inputs vs target pyplot.plot(inputs, targets, '--') pyplot.show()

Running the example evaluates input values in our specified range using our target function and creates a plot of the function inputs to function outputs.

We can see the U-shape of the function and that the objective is at -5.0.

**Note**: in a real optimization problem, we would not be able to perform so many evaluations of the objective function so easily. This simple function is used for demonstration purposes so we can learn how to use the optimization algorithm.

Next, we can use the optimization algorithm to find the optima.

... # minimize the function result = minimize_scalar(objective, method='brent')

Once optimized, we can summarize the result, including the input and evaluation of the optima and the number of function evaluations required to locate the optima.

... # summarize the result opt_x, opt_y = result['x'], result['fun'] print('Optimal Input x: %.6f' % opt_x) print('Optimal Output f(x): %.6f' % opt_y) print('Total Evaluations n: %d' % result['nfev'])

Finally, we can plot the function again and mark the optima to confirm it was located in the place we expected for this function.

... # define the range r_min, r_max = -10.0, 10.0 # prepare inputs inputs = arange(r_min, r_max, 0.1) # compute targets targets = [objective(x) for x in inputs] # plot inputs vs target pyplot.plot(inputs, targets, '--') # plot the optima pyplot.plot([opt_x], [opt_y], 's', color='r') # show the plot pyplot.show()

The complete example of optimizing an unconstrained convex univariate function is listed below.

# optimize convex objective function from numpy import arange from scipy.optimize import minimize_scalar from matplotlib import pyplot # objective function def objective(x): return (5.0 + x)**2.0 # minimize the function result = minimize_scalar(objective, method='brent') # summarize the result opt_x, opt_y = result['x'], result['fun'] print('Optimal Input x: %.6f' % opt_x) print('Optimal Output f(x): %.6f' % opt_y) print('Total Evaluations n: %d' % result['nfev']) # define the range r_min, r_max = -10.0, 10.0 # prepare inputs inputs = arange(r_min, r_max, 0.1) # compute targets targets = [objective(x) for x in inputs] # plot inputs vs target pyplot.plot(inputs, targets, '--') # plot the optima pyplot.plot([opt_x], [opt_y], 's', color='r') # show the plot pyplot.show()

Running the example first solves the optimization problem and reports the result.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the optima was located after 10 evaluations of the objective function with an input of -5.0, achieving an objective function value of 0.0.

Optimal Input x: -5.000000 Optimal Output f(x): 0.000000 Total Evaluations n: 10

A plot of the function is created again and this time, the optima is marked as a red square.

A convex function is one that does not resemble a basin, meaning that it may have more than one hill or valley.

This can make it more challenging to locate the global optima as the multiple hills and valleys can cause the search to get stuck and report a false or local optima instead.

We can define a non-convex univariate function as follows.

# objective function def objective(x): return (x - 2.0) * x * (x + 2.0)**2.0

We can sample this function and create a line plot of input values to objective values.

The complete example is listed below.

# plot a non-convex univariate function from numpy import arange from matplotlib import pyplot # objective function def objective(x): return (x - 2.0) * x * (x + 2.0)**2.0 # define range r_min, r_max = -3.0, 2.5 # prepare inputs inputs = arange(r_min, r_max, 0.1) # compute targets targets = [objective(x) for x in inputs] # plot inputs vs target pyplot.plot(inputs, targets, '--') pyplot.show()

Running the example evaluates input values in our specified range using our target function and creates a plot of the function inputs to function outputs.

We can see a function with one false optima around -2.0 and a global optima around 1.2.

**Note**: in a real optimization problem, we would not be able to perform so many evaluations of the objective function so easily. This simple function is used for demonstration purposes so we can learn how to use the optimization algorithm.

Next, we can use the optimization algorithm to find the optima.

As before, we can call the minimize_scalar() function to optimize the function, then summarize the result and plot the optima on a line plot.

The complete example of optimization of an unconstrained non-convex univariate function is listed below.

# optimize non-convex objective function from numpy import arange from scipy.optimize import minimize_scalar from matplotlib import pyplot # objective function def objective(x): return (x - 2.0) * x * (x + 2.0)**2.0 # minimize the function result = minimize_scalar(objective, method='brent') # summarize the result opt_x, opt_y = result['x'], result['fun'] print('Optimal Input x: %.6f' % opt_x) print('Optimal Output f(x): %.6f' % opt_y) print('Total Evaluations n: %d' % result['nfev']) # define the range r_min, r_max = -3.0, 2.5 # prepare inputs inputs = arange(r_min, r_max, 0.1) # compute targets targets = [objective(x) for x in inputs] # plot inputs vs target pyplot.plot(inputs, targets, '--') # plot the optima pyplot.plot([opt_x], [opt_y], 's', color='r') # show the plot pyplot.show()

Running the example first solves the optimization problem and reports the result.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this case, we can see that the optima was located after 15 evaluations of the objective function with an input of about 1.28, achieving an objective function value of about -9.91.

Optimal Input x: 1.280776 Optimal Output f(x): -9.914950 Total Evaluations n: 15

A plot of the function is created again, and this time, the optima is marked as a red square.

We can see that the optimization was not deceived by the false optima and successfully located the global optima.

This section provides more resources on the topic if you are looking to go deeper.

- Algorithms for Optimization, 2019.

- Optimization (scipy.optimize).
- Optimization and root finding (scipy.optimize)
- scipy.optimize.minimize_scalar API.

In this tutorial, you discovered how to perform univariate function optimization in Python.

Specifically, you learned:

- Univariate function optimization involves finding an optimal input for an objective function that takes a single continuous argument.
- How to perform univariate function optimization for an unconstrained convex function.
- How to perform univariate function optimization for an unconstrained non-convex function.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Univariate Function Optimization in Python appeared first on Machine Learning Mastery.

]]>The post Feature Selection with Stochastic Optimization Algorithms appeared first on Machine Learning Mastery.

]]>This is called feature selection and there are many different types of algorithms that can be used.

It is possible to frame the problem of feature selection as an optimization problem. In the case that there are few input features, all possible combinations of input features can be evaluated and the best subset found definitively. In the case of a vast number of input features, a stochastic optimization algorithm can be used to explore the search space and find an effective subset of features.

In this tutorial, you will discover how to use optimization algorithms for feature selection in machine learning.

After completing this tutorial, you will know:

- The problem of feature selection can be broadly defined as an optimization problem.
- How to enumerate all possible subsets of input features for a dataset.
- How to apply stochastic optimization to select an optimal subset of input features.

Let’s get started.

This tutorial is divided into three parts; they are:

- Optimization for Feature Selection
- Enumerate All Feature Subsets
- Optimize Feature Subsets

Feature selection is the process of reducing the number of input variables when developing a predictive model.

It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model. There are many different types of feature selection algorithms, although they can broadly be grouped into two main types: wrapper and filter methods.

Wrapper feature selection methods create many models with different subsets of input features and select those features that result in the best performing model according to a performance metric. These methods are unconcerned with the variable types, although they can be computationally expensive. RFE is a good example of a wrapper feature selection method.

Filter feature selection methods use statistical techniques to evaluate the relationship between each input variable and the target variable, and these scores are used as the basis to choose (filter) those input variables that will be used in the model.

**Wrapper Feature Selection**: Search for well-performing subsets of features.**Filter Feature Selection**: Select subsets of features based on their relationship with the target.

For more on choosing feature selection algorithms, see the tutorial:

A popular wrapper method is the Recursive Feature Elimination, or RFE, algorithm.

RFE works by searching for a subset of features by starting with all features in the training dataset and successfully removing features until the desired number remains.

This is achieved by fitting the given machine learning algorithm used in the core of the model, ranking features by importance, discarding the least important features, and re-fitting the model. This process is repeated until a specified number of features remains.

For more on RFE, see the tutorial:

The problem of wrapper feature selection can be framed as an optimization problem. That is, find a subset of input features that result in the best model performance.

RFE is one approach to solving this problem systematically, although it may be limited by a large number of features.

An alternative approach would be to use a stochastic optimization algorithm, such as a stochastic hill climbing algorithm, when the number of features is very large. When the number of features is relatively small, it may be possible to enumerate all possible subsets of features.

**Few Input Variables**: Enumerate all possible subsets of features.**Many Input Features**: Stochastic optimization algorithm to find good subsets of features.

Now that we are familiar with the idea that feature selection may be explored as an optimization problem, let’s look at how we might enumerate all possible feature subsets.

When the number of input variables is relatively small and the model evaluation is relatively fast, then it may be possible to enumerate all possible subsets of input variables.

This means evaluating the performance of a model using a test harness given every possible unique group of input variables.

We will explore how to do this with a worked example.

First, let’s define a small binary classification dataset with few input features. We can use the make_classification() function to define a dataset with five input variables, two of which are informative, and 1,000 rows.

The example below defines the dataset and summarizes its shape.

# define a small classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=3, random_state=1) # summarize the shape of the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms that it has the desired shape.

(1000, 5) (1000,)

Next, we can establish a baseline in performance using a model evaluated on the entire dataset.

We will use a DecisionTreeClassifier as the model because its performance is quite sensitive to the choice of input variables.

We will evaluate the model using good practices, such as repeated stratified k-fold cross-validation with three repeats and 10 folds.

The complete example is listed below.

# evaluate a decision tree on the entire small dataset from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.tree import DecisionTreeClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=3, n_informative=2, n_redundant=1, random_state=1) # define model model = DecisionTreeClassifier() # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report result print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the decision tree on the entire dataset and reports the mean and standard deviation classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved an accuracy of about 80.5 percent.

Mean Accuracy: 0.805 (0.030)

Next, we can try to improve model performance by using a subset of the input features.

First, we must choose a representation to enumerate.

In this case, we will enumerate a list of boolean values, with one value for each input feature: *True* if the feature is to be used and *False* if the feature is not to be used as input.

For example, with the five input features the sequence [*True, True, True, True, True*] would use all input features, and [*True, False, False, False, False*] would only use the first input feature as input.

We can enumerate all sequences of boolean values with the *length=5* using the product() Python function. We must specify the valid values [*True, False*] and the number of steps in the sequence, which is equal to the number of input variables.

The function returns an iterable that we can enumerate directly for each sequence.

... # determine the number of columns n_cols = X.shape[1] best_subset, best_score = None, 0.0 # enumerate all combinations of input features for subset in product([True, False], repeat=n_cols): ...

For a given sequence of boolean values, we can enumerate it and transform it into a sequence of column indexes for each *True* in the sequence.

... # convert into column indexes ix = [i for i, x in enumerate(subset) if x]

If the sequence has no column indexes (in the case of all *False* values), then we can skip that sequence.

# check for now column (all False) if len(ix) == 0: continue

We can then use the column indexes to choose the columns in the dataset.

... # select columns X_new = X[:, ix]

And this subset of the dataset can then be evaluated as we did before.

... # define model model = DecisionTreeClassifier() # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X_new, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize scores result = mean(scores)

If the accuracy for the model is better than the best sequence found so far, we can store it.

... # check if it is better than the best so far if best_score is None or result >= best_score: # better result best_subset, best_score = ix, result

And that’s it.

Tying this together, the complete example of feature selection by enumerating all possible feature subsets is listed below.

# feature selection by enumerating all possible subsets of features from itertools import product from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.tree import DecisionTreeClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=3, random_state=1) # determine the number of columns n_cols = X.shape[1] best_subset, best_score = None, 0.0 # enumerate all combinations of input features for subset in product([True, False], repeat=n_cols): # convert into column indexes ix = [i for i, x in enumerate(subset) if x] # check for now column (all False) if len(ix) == 0: continue # select columns X_new = X[:, ix] # define model model = DecisionTreeClassifier() # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X_new, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize scores result = mean(scores) # report progress print('>f(%s) = %f ' % (ix, result)) # check if it is better than the best so far if best_score is None or result >= best_score: # better result best_subset, best_score = ix, result # report best print('Done!') print('f(%s) = %f' % (best_subset, best_score))

Running the example reports the mean classification accuracy of the model for each subset of features considered. The best subset is then reported at the end of the run.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the best subset of features involved features at indexes [2, 3, 4] that resulted in a mean classification accuracy of about 83.0 percent, which is better than the result reported previously using all input features.

>f([0, 1, 2, 3, 4]) = 0.813667 >f([0, 1, 2, 3]) = 0.827667 >f([0, 1, 2, 4]) = 0.815333 >f([0, 1, 2]) = 0.824000 >f([0, 1, 3, 4]) = 0.821333 >f([0, 1, 3]) = 0.825667 >f([0, 1, 4]) = 0.807333 >f([0, 1]) = 0.817667 >f([0, 2, 3, 4]) = 0.830333 >f([0, 2, 3]) = 0.819000 >f([0, 2, 4]) = 0.828000 >f([0, 2]) = 0.818333 >f([0, 3, 4]) = 0.830333 >f([0, 3]) = 0.821333 >f([0, 4]) = 0.816000 >f([0]) = 0.639333 >f([1, 2, 3, 4]) = 0.823667 >f([1, 2, 3]) = 0.821667 >f([1, 2, 4]) = 0.823333 >f([1, 2]) = 0.818667 >f([1, 3, 4]) = 0.818000 >f([1, 3]) = 0.820667 >f([1, 4]) = 0.809000 >f([1]) = 0.797000 >f([2, 3, 4]) = 0.827667 >f([2, 3]) = 0.755000 >f([2, 4]) = 0.827000 >f([2]) = 0.516667 >f([3, 4]) = 0.824000 >f([3]) = 0.514333 >f([4]) = 0.777667 Done! f([0, 3, 4]) = 0.830333

Now that we know how to enumerate all possible feature subsets, let’s look at how we might use a stochastic optimization algorithm to choose a subset of features.

We can apply a stochastic optimization algorithm to the search space of subsets of input features.

First, let’s define a larger problem that has many more features, making model evaluation too slow and the search space too large for enumerating all subsets.

We will define a classification problem with 10,000 rows and 500 input features, 10 of which are relevant and the remaining 490 are redundant.

# define a large classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1) # summarize the shape of the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms that it has the desired shape.

(10000, 500) (10000,)

We can establish a baseline in performance by evaluating a model on the dataset with all input features.

Because the dataset is large and the model is slow to evaluate, we will modify the evaluation of the model to use 3-fold cross-validation, e.g. fewer folds and no repeats.

The complete example is listed below.

# evaluate a decision tree on the entire larger dataset from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import StratifiedKFold from sklearn.tree import DecisionTreeClassifier # define dataset X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1) # define model model = DecisionTreeClassifier() # define evaluation procedure cv = StratifiedKFold(n_splits=3) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report result print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the decision tree on the entire dataset and reports the mean and standard deviation classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved an accuracy of about 91.3 percent.

This provides a baseline that we would expect to outperform using feature selection.

Mean Accuracy: 0.913 (0.001)

We will use a simple stochastic hill climbing algorithm as the optimization algorithm.

First, we must define the objective function. It will take the dataset and a subset of features to use as input and return an estimated model accuracy from 0 (worst) to 1 (best). It is a maximizing optimization problem.

This objective function is simply the decoding of the sequence and model evaluation step from the previous section.

The *objective()* function below implements this and returns both the score and the decoded subset of columns used for helpful reporting.

# objective function def objective(X, y, subset): # convert into column indexes ix = [i for i, x in enumerate(subset) if x] # check for now column (all False) if len(ix) == 0: return 0.0 # select columns X_new = X[:, ix] # define model model = DecisionTreeClassifier() # evaluate model scores = cross_val_score(model, X_new, y, scoring='accuracy', cv=3, n_jobs=-1) # summarize scores result = mean(scores) return result, ix

We also need a function that can take a step in the search space.

Given an existing solution, it must modify it and return a new solution in close proximity. In this case, we will achieve this by randomly flipping the inclusion/exclusion of columns in subsequence.

Each position in the sequence will be considered independently and will be flipped probabilistically where the probability of flipping is a hyperparameter.

The *mutate()* function below implements this given a candidate solution (sequence of booleans) and a mutation hyperparameter, creating and returning a modified solution (a step in the search space).

The larger the *p_mutate* value (in the range 0 to 1), the larger the step in the search space.

# mutation operator def mutate(solution, p_mutate): # make a copy child = solution.copy() for i in range(len(child)): # check for a mutation if rand() < p_mutate: # flip the inclusion child[i] = not child[i] return child

We can now implement the hill climbing algorithm.

The initial solution is a randomly generated sequence, which is then evaluated.

... # generate an initial point solution = choice([True, False], size=X.shape[1]) # evaluate the initial point solution_eval, ix = objective(X, y, solution)

We then loop for a fixed number of iterations, creating mutated versions of the current solution, evaluating them, and saving them if the score is better.

... # run the hill climb for i in range(n_iter): # take a step candidate = mutate(solution, p_mutate) # evaluate candidate point candidate_eval, ix = objective(X, y, candidate) # check if we should keep the new point if candidate_eval >= solution_eval: # store the new point solution, solution_eval = candidate, candidate_eval # report progress print('>%d f(%s) = %f' % (i+1, len(ix), solution_eval))

The *hillclimbing()* function below implements this, taking the dataset, objective function, and hyperparameters as arguments and returns the best subset of dataset columns and the estimated performance of the model.

# hill climbing local search algorithm def hillclimbing(X, y, objective, n_iter, p_mutate): # generate an initial point solution = choice([True, False], size=X.shape[1]) # evaluate the initial point solution_eval, ix = objective(X, y, solution) # run the hill climb for i in range(n_iter): # take a step candidate = mutate(solution, p_mutate) # evaluate candidate point candidate_eval, ix = objective(X, y, candidate) # check if we should keep the new point if candidate_eval >= solution_eval: # store the new point solution, solution_eval = candidate, candidate_eval # report progress print('>%d f(%s) = %f' % (i+1, len(ix), solution_eval)) return solution, solution_eval

We can then call this function and pass in our synthetic dataset to perform optimization for feature selection.

In this case, we will run the algorithm for 100 iterations and make about five flips to the sequence for a given mutation, which is quite conservative.

... # define dataset X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1) # define the total iterations n_iter = 100 # probability of including/excluding a column p_mut = 10.0 / 500.0 # perform the hill climbing search subset, score = hillclimbing(X, y, objective, n_iter, p_mut)

At the end of the run, we will convert the boolean sequence into column indexes (so we could fit a final model if we wanted) and report the performance of the best subsequence.

... # convert into column indexes ix = [i for i, x in enumerate(subset) if x] print('Done!') print('Best: f(%d) = %f' % (len(ix), score))

Tying this all together, the complete example is listed below.

# stochastic optimization for feature selection from numpy import mean from numpy.random import rand from numpy.random import choice from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.tree import DecisionTreeClassifier # objective function def objective(X, y, subset): # convert into column indexes ix = [i for i, x in enumerate(subset) if x] # check for now column (all False) if len(ix) == 0: return 0.0 # select columns X_new = X[:, ix] # define model model = DecisionTreeClassifier() # evaluate model scores = cross_val_score(model, X_new, y, scoring='accuracy', cv=3, n_jobs=-1) # summarize scores result = mean(scores) return result, ix # mutation operator def mutate(solution, p_mutate): # make a copy child = solution.copy() for i in range(len(child)): # check for a mutation if rand() < p_mutate: # flip the inclusion child[i] = not child[i] return child # hill climbing local search algorithm def hillclimbing(X, y, objective, n_iter, p_mutate): # generate an initial point solution = choice([True, False], size=X.shape[1]) # evaluate the initial point solution_eval, ix = objective(X, y, solution) # run the hill climb for i in range(n_iter): # take a step candidate = mutate(solution, p_mutate) # evaluate candidate point candidate_eval, ix = objective(X, y, candidate) # check if we should keep the new point if candidate_eval >= solution_eval: # store the new point solution, solution_eval = candidate, candidate_eval # report progress print('>%d f(%s) = %f' % (i+1, len(ix), solution_eval)) return solution, solution_eval # define dataset X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1) # define the total iterations n_iter = 100 # probability of including/excluding a column p_mut = 10.0 / 500.0 # perform the hill climbing search subset, score = hillclimbing(X, y, objective, n_iter, p_mut) # convert into column indexes ix = [i for i, x in enumerate(subset) if x] print('Done!') print('Best: f(%d) = %f' % (len(ix), score))

Running the example reports the mean classification accuracy of the model for each subset of features considered. The best subset is then reported at the end of the run.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the best performance was achieved with a subset of 239 features and a classification accuracy of approximately 91.8 percent.

This is better than a model evaluated on all input features.

Although the result is better, we know we can do a lot better, perhaps with tuning of the hyperparameters of the optimization algorithm or perhaps by using an alternate optimization algorithm.

... >80 f(240) = 0.918099 >81 f(236) = 0.918099 >82 f(238) = 0.918099 >83 f(236) = 0.918099 >84 f(239) = 0.918099 >85 f(240) = 0.918099 >86 f(239) = 0.918099 >87 f(245) = 0.918099 >88 f(241) = 0.918099 >89 f(239) = 0.918099 >90 f(239) = 0.918099 >91 f(241) = 0.918099 >92 f(243) = 0.918099 >93 f(245) = 0.918099 >94 f(239) = 0.918099 >95 f(245) = 0.918099 >96 f(244) = 0.918099 >97 f(242) = 0.918099 >98 f(238) = 0.918099 >99 f(248) = 0.918099 >100 f(238) = 0.918099 Done! Best: f(239) = 0.918099

This section provides more resources on the topic if you are looking to go deeper.

- Recursive Feature Elimination (RFE) for Feature Selection in Python
- How to Choose a Feature Selection Method For Machine Learning

In this tutorial, you discovered how to use optimization algorithms for feature selection in machine learning.

Specifically, you learned:

- The problem of feature selection can be broadly defined as an optimization problem.
- How to enumerate all possible subsets of input features for a dataset.
- How to apply stochastic optimization to select an optimal subset of input features.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Feature Selection with Stochastic Optimization Algorithms appeared first on Machine Learning Mastery.

]]>The post How to Choose an Optimization Algorithm appeared first on Machine Learning Mastery.

]]>It is the challenging problem that underlies many machine learning algorithms, from fitting logistic regression models to training artificial neural networks.

There are perhaps hundreds of popular optimization algorithms, and perhaps tens of algorithms to choose from in popular scientific code libraries. This can make it challenging to know which algorithms to consider for a given optimization problem.

In this tutorial, you will discover a guided tour of different optimization algorithms.

After completing this tutorial, you will know:

- Optimization algorithms may be grouped into those that use derivatives and those that do not.
- Classical algorithms use the first and sometimes second derivative of the objective function.
- Direct search and stochastic algorithms are designed for objective functions where function derivatives are unavailable.

Let’s get started.

This tutorial is divided into three parts; they are:

- Optimization Algorithms
- Differentiable Objective Function
- Non-Differential Objective Function

Optimization refers to a procedure for finding the input parameters or arguments to a function that result in the minimum or maximum output of the function.

The most common type of optimization problems encountered in machine learning are **continuous function optimization**, where the input arguments to the function are real-valued numeric values, e.g. floating point values. The output from the function is also a real-valued evaluation of the input values.

We might refer to problems of this type as continuous function optimization, to distinguish from functions that take discrete variables and are referred to as combinatorial optimization problems.

There are many different types of optimization algorithms that can be used for continuous function optimization problems, and perhaps just as many ways to group and summarize them.

One approach to grouping optimization algorithms is based on the amount of information available about the target function that is being optimized that, in turn, can be used and harnessed by the optimization algorithm.

Generally, the more information that is available about the target function, the easier the function is to optimize if the information can effectively be used in the search.

Perhaps the major division in optimization algorithms is whether the objective function can be differentiated at a point or not. That is, whether the first derivative (gradient or slope) of the function can be calculated for a given candidate solution or not. This partitions algorithms into those that can make use of the calculated gradient information and those that do not.

- Differentiable Target Function?
- Algorithms that use derivative information.
- Algorithms that do not use derivative information.

We will use this as the major division for grouping optimization algorithms in this tutorial and look at algorithms for differentiable and non-differentiable objective functions.

**Note**: this is not an exhaustive coverage of algorithms for continuous function optimization, although it does cover the major methods that you are likely to encounter as a regular practitioner.

A differentiable function is a function where the derivative can be calculated for any given point in the input space.

The derivative of a function for a value is the rate or amount of change in the function at that point. It is often called the slope.

**First-Order Derivative**: Slope or rate of change of an objective function at a given point.

The derivative of the function with more than one input variable (e.g. multivariate inputs) is commonly referred to as the gradient.

**Gradient**: Derivative of a multivariate continuous objective function.

A derivative for a multivariate objective function is a vector, and each element in the vector is called a partial derivative, or the rate of change for a given variable at the point assuming all other variables are held constant.

**Partial Derivative**: Element of a derivative of a multivariate objective function.

We can calculate the derivative of the derivative of the objective function, that is the rate of change of the rate of change in the objective function. This is called the second derivative.

**Second-Order Derivative**: Rate at which the derivative of the objective function changes.

For a function that takes multiple input variables, this is a matrix and is referred to as the Hessian matrix.

**Hessian matrix**: Second derivative of a function with two or more input variables.

Simple differentiable functions can be optimized analytically using calculus. Typically, the objective functions that we are interested in cannot be solved analytically.

Optimization is significantly easier if the gradient of the objective function can be calculated, and as such, there has been a lot more research into optimization algorithms that use the derivative than those that do not.

Some groups of algorithms that use gradient information include:

- Bracketing Algorithms
- Local Descent Algorithms
- First-Order Algorithms
- Second-Order Algorithms

**Note**: this taxonomy is inspired by the 2019 book “Algorithms for Optimization.”

Let’s take a closer look at each in turn.

Bracketing optimization algorithms are intended for optimization problems with one input variable where the optima is known to exist within a specific range.

Bracketing algorithms are able to efficiently navigate the known range and locate the optima, although they assume only a single optima is present (referred to as unimodal objective functions).

Some bracketing algorithms may be able to be used without derivative information if it is not available.

Examples of bracketing algorithms include:

- Fibonacci Search
- Golden Section Search
- Bisection Method

Local descent optimization algorithms are intended for optimization problems with more than one input variable and a single global optima (e.g. unimodal objective function).

Perhaps the most common example of a local descent algorithm is the line search algorithm.

- Line Search

There are many variations of the line search (e.g. the Brent-Dekker algorithm), but the procedure generally involves choosing a direction to move in the search space, then performing a bracketing type search in a line or hyperplane in the chosen direction.

This process is repeated until no further improvements can be made.

The limitation is that it is computationally expensive to optimize each directional move in the search space.

First-order optimization algorithms explicitly involve using the first derivative (gradient) to choose the direction to move in the search space.

The procedures involve first calculating the gradient of the function, then following the gradient in the opposite direction (e.g. downhill to the minimum for minimization problems) using a step size (also called the learning rate).

The step size is a hyperparameter that controls how far to move in the search space, unlike “local descent algorithms” that perform a full line search for each directional move.

A step size that is too small results in a search that takes a long time and can get stuck, whereas a step size that is too large will result in zig-zagging or bouncing around the search space, missing the optima completely.

First-order algorithms are generally referred to as gradient descent, with more specific names referring to minor extensions to the procedure, e.g.:

- Gradient Descent
- Momentum
- Adagrad
- RMSProp
- Adam

The gradient descent algorithm also provides the template for the popular stochastic version of the algorithm, named Stochastic Gradient Descent (SGD) that is used to train artificial neural networks (deep learning) models.

The important difference is that the gradient is appropriated rather than calculated directly, using prediction error on training data, such as one sample (stochastic), all examples (batch), or a small subset of training data (mini-batch).

The extensions designed to accelerate the gradient descent algorithm (momentum, etc.) can be and are commonly used with SGD.

- Stochastic Gradient Descent
- Batch Gradient Descent
- Mini-Batch Gradient Descent

Second-order optimization algorithms explicitly involve using the second derivative (Hessian) to choose the direction to move in the search space.

These algorithms are only appropriate for those objective functions where the Hessian matrix can be calculated or approximated.

Examples of second-order optimization algorithms for univariate objective functions include:

- Newton’s Method
- Secant Method

Second-order methods for multivariate objective functions are referred to as Quasi-Newton Methods.

- Quasi-Newton Method

There are many Quasi-Newton Methods, and they are typically named for the developers of the algorithm, such as:

- Davidson-Fletcher-Powell
- Broyden-Fletcher-Goldfarb-Shanno (BFGS)
- Limited-memory BFGS (L-BFGS)

Now that we are familiar with the so-called classical optimization algorithms, let’s look at algorithms used when the objective function is not differentiable.

Optimization algorithms that make use of the derivative of the objective function are fast and efficient.

Nevertheless, there are objective functions where the derivative cannot be calculated, typically because the function is complex for a variety of real-world reasons. Or the derivative can be calculated in some regions of the domain, but not all, or is not a good guide.

Some difficulties on objective functions for the classical algorithms described in the previous section include:

- No analytical description of the function (e.g. simulation).
- Multiple global optima (e.g. multimodal).
- Stochastic function evaluation (e.g. noisy).
- Discontinuous objective function (e.g. regions with invalid solutions).

As such, there are optimization algorithms that do not expect first- or second-order derivatives to be available.

These algorithms are sometimes referred to as black-box optimization algorithms as they assume little or nothing (relative to the classical methods) about the objective function.

A grouping of these algorithms include:

- Direct Algorithms
- Stochastic Algorithms
- Population Algorithms

Let’s take a closer look at each in turn.

Direct optimization algorithms are for objective functions for which derivatives cannot be calculated.

The algorithms are deterministic procedures and often assume the objective function has a single global optima, e.g. unimodal.

Direct search methods are also typically referred to as a “*pattern search*” as they may navigate the search space using geometric shapes or decisions, e.g. patterns.

Gradient information is approximated directly (hence the name) from the result of the objective function comparing the relative difference between scores for points in the search space. These direct estimates are then used to choose a direction to move in the search space and triangulate the region of the optima.

Examples of direct search algorithms include:

- Cyclic Coordinate Search
- Powell’s Method
- Hooke-Jeeves Method
- Nelder-Mead Simplex Search

Stochastic optimization algorithms are algorithms that make use of randomness in the search procedure for objective functions for which derivatives cannot be calculated.

Unlike the deterministic direct search methods, stochastic algorithms typically involve a lot more sampling of the objective function, but are able to handle problems with deceptive local optima.

Stochastic optimization algorithms include:

- Simulated Annealing
- Evolution Strategy
- Cross-Entropy Method

Population optimization algorithms are stochastic optimization algorithms that maintain a pool (a population) of candidate solutions that together are used to sample, explore, and hone in on an optima.

Algorithms of this type are intended for more challenging objective problems that may have noisy function evaluations and many global optima (multimodal), and finding a good or good enough solution is challenging or infeasible using other methods.

The pool of candidate solutions adds robustness to the search, increasing the likelihood of overcoming local optima.

Examples of population optimization algorithms include:

- Genetic Algorithm
- Differential Evolution
- Particle Swarm Optimization

This section provides more resources on the topic if you are looking to go deeper.

- Algorithms for Optimization, 2019.
- Essentials of Metaheuristics, 2011.
- Computational Intelligence: An Introduction, 2007.
- Introduction to Stochastic Search and Optimization, 2003.

In this tutorial, you discovered a guided tour of different optimization algorithms.

Specifically, you learned:

- Optimization algorithms may be grouped into those that use derivatives and those that do not.
- Classical algorithms use the first and sometimes second derivative of the objective function.
- Direct search and stochastic algorithms are designed for objective functions where function derivatives are unavailable.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Choose an Optimization Algorithm appeared first on Machine Learning Mastery.

]]>The post How to Manually Optimize Neural Network Models appeared first on Machine Learning Mastery.

]]>Updates to the weights of the model are made, using the backpropagation of error algorithm. The combination of the optimization and weight update algorithm was carefully chosen and is the most efficient approach known to fit neural networks.

Nevertheless, it is possible to use alternate optimization algorithms to fit a neural network model to a training dataset. This can be a useful exercise to learn more about how neural networks function and the central nature of optimization in applied machine learning. It may also be required for neural networks with unconventional model architectures and non-differentiable transfer functions.

In this tutorial, you will discover how to manually optimize the weights of neural network models.

After completing this tutorial, you will know:

- How to develop the forward inference pass for neural network models from scratch.
- How to optimize the weights of a Perceptron model for binary classification.
- How to optimize the weights of a Multilayer Perceptron model using stochastic hill climbing.

Let’s get started.

This tutorial is divided into three parts; they are:

- Optimize Neural Networks
- Optimize a Perceptron Model
- Optimize a Multilayer Perceptron

Deep learning or neural networks are a flexible type of machine learning.

They are models composed of nodes and layers inspired by the structure and function of the brain. A neural network model works by propagating a given input vector through one or more layers to produce a numeric output that can be interpreted for classification or regression predictive modeling.

Models are trained by repeatedly exposing the model to examples of input and output and adjusting the weights to minimize the error of the model’s output compared to the expected output. This is called the stochastic gradient descent optimization algorithm. The weights of the model are adjusted using a specific rule from calculus that assigns error proportionally to each weight in the network. This is called the backpropagation algorithm.

The stochastic gradient descent optimization algorithm with weight updates made using backpropagation is the best way to train neural network models. However, it is not the only way to train a neural network.

It is possible to use any arbitrary optimization algorithm to train a neural network model.

That is, we can define a neural network model architecture and use a given optimization algorithm to find a set of weights for the model that results in a minimum of prediction error or a maximum of classification accuracy.

Using alternate optimization algorithms is expected to be less efficient on average than using stochastic gradient descent with backpropagation. Nevertheless, it may be more efficient in some specific cases, such as non-standard network architectures or non-differential transfer functions.

It can also be an interesting exercise to demonstrate the central nature of optimization in training machine learning algorithms, and specifically neural networks.

Next, let’s explore how to train a simple one-node neural network called a Perceptron model using stochastic hill climbing.

The Perceptron algorithm is the simplest type of artificial neural network.

It is a model of a single neuron that can be used for two-class classification problems and provides the foundation for later developing much larger networks.

In this section, we will optimize the weights of a Perceptron neural network model.

First, let’s define a synthetic binary classification problem that we can use as the focus of optimizing the model.

We can use the make_classification() function to define a binary classification problem with 1,000 rows and five input variables.

The example below creates the dataset and summarizes the shape of the data.

# define a binary classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1) # summarize the shape of the dataset print(X.shape, y.shape)

Running the example prints the shape of the created dataset, confirming our expectations.

(1000, 5) (1000,)

Next, we need to define a Perceptron model.

The Perceptron model has a single node that has one input weight for each column in the dataset.

Each input is multiplied by its corresponding weight to give a weighted sum and a bias weight is then added, like an intercept coefficient in a regression model. This weighted sum is called the activation. Finally, the activation is interpreted and used to predict the class label, 1 for a positive activation and 0 for a negative activation.

Before we optimize the model weights, we must develop the model and our confidence in how it works.

Let’s start by defining a function for interpreting the activation of the model.

This is called the activation function, or the transfer function; the latter name is more traditional and is my preference.

The *transfer()* function below takes the activation of the model and returns a class label, class=1 for a positive or zero activation and class=0 for a negative activation. This is called a step transfer function.

# transfer function def transfer(activation): if activation >= 0.0: return 1 return 0

Next, we can develop a function that calculates the activation of the model for a given input row of data from the dataset.

This function will take the row of data and the weights for the model and calculate the weighted sum of the input with the addition of the bias weight. The *activate()* function below implements this.

**Note**: We are using simple Python lists and imperative programming style instead of NumPy arrays or list compressions intentionally to make the code more readable for Python beginners. Feel free to optimize it and post your code in the comments below.

# activation function def activate(row, weights): # add the bias, the last weight activation = weights[-1] # add the weighted input for i in range(len(row)): activation += weights[i] * row[i] return activation

Next, we can use the *activate()* and *transfer()* functions together to generate a prediction for a given row of data. The *predict_row()* function below implements this.

# use model weights to predict 0 or 1 for a given row of data def predict_row(row, weights): # activate for input activation = activate(row, weights) # transfer for activation return transfer(activation)

Next, we can call the *predict_row()* function for each row in a given dataset. The *predict_dataset()* function below implements this.

Again, we are intentionally using simple imperative coding style for readability instead of list compressions.

# use model weights to generate predictions for a dataset of rows def predict_dataset(X, weights): yhats = list() for row in X: yhat = predict_row(row, weights) yhats.append(yhat) return yhats

Finally, we can use the model to make predictions on our synthetic dataset to confirm it is all working correctly.

We can generate a random set of model weights using the rand() function.

Recall that we need one weight for each input (five inputs in this dataset) plus an extra weight for the bias weight.

... # define dataset X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1) # determine the number of weights n_weights = X.shape[1] + 1 # generate random weights weights = rand(n_weights)

We can then use these weights with the dataset to make predictions.

... # generate predictions for dataset yhat = predict_dataset(X, weights)

We can evaluate the classification accuracy of these predictions.

... # calculate accuracy score = accuracy_score(y, yhat) print(score)

That’s it.

We can tie all of this together and demonstrate our simple Perceptron model for classification. The complete example is listed below.

# simple perceptron model for binary classification from numpy.random import rand from sklearn.datasets import make_classification from sklearn.metrics import accuracy_score # transfer function def transfer(activation): if activation >= 0.0: return 1 return 0 # activation function def activate(row, weights): # add the bias, the last weight activation = weights[-1] # add the weighted input for i in range(len(row)): activation += weights[i] * row[i] return activation # use model weights to predict 0 or 1 for a given row of data def predict_row(row, weights): # activate for input activation = activate(row, weights) # transfer for activation return transfer(activation) # use model weights to generate predictions for a dataset of rows def predict_dataset(X, weights): yhats = list() for row in X: yhat = predict_row(row, weights) yhats.append(yhat) return yhats # define dataset X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1) # determine the number of weights n_weights = X.shape[1] + 1 # generate random weights weights = rand(n_weights) # generate predictions for dataset yhat = predict_dataset(X, weights) # calculate accuracy score = accuracy_score(y, yhat) print(score)

Running the example generates a prediction for each example in the training dataset then prints the classification accuracy for the predictions.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We would expect about 50 percent accuracy given a set of random weights and a dataset with an equal number of examples in each class, and that is approximately what we see in this case.

0.548

We can now optimize the weights of the dataset to achieve good accuracy on this dataset.

First, we need to split the dataset into train and test sets. It is important to hold back some data not used in optimizing the model so that we can prepare a reasonable estimate of the performance of the model when used to make predictions on new data.

We will use 67 percent of the data for training and the remaining 33 percent as a test set for evaluating the performance of the model.

... # split into train test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

Next, we can develop a stochastic hill climbing algorithm.

The optimization algorithm requires an objective function to optimize. It must take a set of weights and return a score that is to be minimized or maximized corresponding to a better model.

In this case, we will evaluate the accuracy of the model with a given set of weights and return the classification accuracy, which must be maximized.

The *objective()* function below implements this, given the dataset and a set of weights, and returns the accuracy of the model

# objective function def objective(X, y, weights): # generate predictions for dataset yhat = predict_dataset(X, weights) # calculate accuracy score = accuracy_score(y, yhat) return score

Next, we can define the stochastic hill climbing algorithm.

The algorithm will require an initial solution (e.g. random weights) and will iteratively keep making small changes to the solution and checking if it results in a better performing model. The amount of change made to the current solution is controlled by a *step_size* hyperparameter. This process will continue for a fixed number of iterations, also provided as a hyperparameter.

The *hillclimbing()* function below implements this, taking the dataset, objective function, initial solution, and hyperparameters as arguments and returns the best set of weights found and the estimated performance.

# hill climbing local search algorithm def hillclimbing(X, y, objective, solution, n_iter, step_size): # evaluate the initial point solution_eval = objective(X, y, solution) # run the hill climb for i in range(n_iter): # take a step candidate = solution + randn(len(solution)) * step_size # evaluate candidate point candidte_eval = objective(X, y, candidate) # check if we should keep the new point if candidte_eval >= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval # report progress print('>%d %.5f' % (i, solution_eval)) return [solution, solution_eval]

We can then call this function, passing in a set of weights as the initial solution and the training dataset as the dataset to optimize the model against.

... # define the total iterations n_iter = 1000 # define the maximum step size step_size = 0.05 # determine the number of weights n_weights = X.shape[1] + 1 # define the initial solution solution = rand(n_weights) # perform the hill climbing search weights, score = hillclimbing(X_train, y_train, objective, solution, n_iter, step_size) print('Done!') print('f(%s) = %f' % (weights, score))

Finally, we can evaluate the best model on the test dataset and report the performance.

... # generate predictions for the test dataset yhat = predict_dataset(X_test, weights) # calculate accuracy score = accuracy_score(y_test, yhat) print('Test Accuracy: %.5f' % (score * 100))

Tying this together, the complete example of optimizing the weights of a Perceptron model on the synthetic binary optimization dataset is listed below.

# hill climbing to optimize weights of a perceptron model for classification from numpy import asarray from numpy.random import randn from numpy.random import rand from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # transfer function def transfer(activation): if activation >= 0.0: return 1 return 0 # activation function def activate(row, weights): # add the bias, the last weight activation = weights[-1] # add the weighted input for i in range(len(row)): activation += weights[i] * row[i] return activation # # use model weights to predict 0 or 1 for a given row of data def predict_row(row, weights): # activate for input activation = activate(row, weights) # transfer for activation return transfer(activation) # use model weights to generate predictions for a dataset of rows def predict_dataset(X, weights): yhats = list() for row in X: yhat = predict_row(row, weights) yhats.append(yhat) return yhats # objective function def objective(X, y, weights): # generate predictions for dataset yhat = predict_dataset(X, weights) # calculate accuracy score = accuracy_score(y, yhat) return score # hill climbing local search algorithm def hillclimbing(X, y, objective, solution, n_iter, step_size): # evaluate the initial point solution_eval = objective(X, y, solution) # run the hill climb for i in range(n_iter): # take a step candidate = solution + randn(len(solution)) * step_size # evaluate candidate point candidte_eval = objective(X, y, candidate) # check if we should keep the new point if candidte_eval >= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval # report progress print('>%d %.5f' % (i, solution_eval)) return [solution, solution_eval] # define dataset X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1) # split into train test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) # define the total iterations n_iter = 1000 # define the maximum step size step_size = 0.05 # determine the number of weights n_weights = X.shape[1] + 1 # define the initial solution solution = rand(n_weights) # perform the hill climbing search weights, score = hillclimbing(X_train, y_train, objective, solution, n_iter, step_size) print('Done!') print('f(%s) = %f' % (weights, score)) # generate predictions for the test dataset yhat = predict_dataset(X_test, weights) # calculate accuracy score = accuracy_score(y_test, yhat) print('Test Accuracy: %.5f' % (score * 100))

Running the example will report the iteration number and classification accuracy each time there is an improvement made to the model.

At the end of the search, the performance of the best set of weights on the training dataset is reported and the performance of the same model on the test dataset is calculated and reported.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the optimization algorithm found a set of weights that achieved about 88.5 percent accuracy on the training dataset and about 81.8 percent accuracy on the test dataset.

... >111 0.88060 >119 0.88060 >126 0.88209 >134 0.88209 >205 0.88209 >262 0.88209 >280 0.88209 >293 0.88209 >297 0.88209 >336 0.88209 >373 0.88209 >437 0.88358 >463 0.88507 >630 0.88507 >701 0.88507 Done! f([ 0.0097317 0.13818088 1.17634326 -0.04296336 0.00485813 -0.14767616]) = 0.885075 Test Accuracy: 81.81818

Now that we are familiar with how to manually optimize the weights of a Perceptron model, let’s look at how we can extend the example to optimize the weights of a Multilayer Perceptron (MLP) model.

A Multilayer Perceptron (MLP) model is a neural network with one or more layers, where each layer has one or more nodes.

It is an extension of a Perceptron model and is perhaps the most widely used neural network (deep learning) model.

In this section, we will build on what we learned in the previous section to optimize the weights of MLP models with an arbitrary number of layers and nodes per layer.

First, we will develop the model and test it with random weights, then use stochastic hill climbing to optimize the model weights.

When using MLPs for binary classification, it is common to use a sigmoid transfer function (also called the logistic function) instead of the step transfer function used in the Perceptron.

This function outputs a real-value between 0-1 that represents a binomial probability distribution, e.g. the probability that an example belongs to class=1. The *transfer()* function below implements this.

# transfer function def transfer(activation): # sigmoid transfer function return 1.0 / (1.0 + exp(-activation))

We can use the same *activate()* function from the previous section. Here, we will use it to calculate the activation for each node in a given layer.

The *predict_row()* function must be replaced with a more elaborate version.

The function takes a row of data and the network and returns the output of the network.

We will define our network as a list of lists. Each layer will be a list of nodes and each node will be a list or array of weights.

To calculate the prediction of the network, we simply enumerate the layers, then enumerate nodes, then calculate the activation and transfer output for each node. In this case, we will use the same transfer function for all nodes in the network, although this does not have to be the case.

For networks with more than one layer, the output from the previous layer is used as input to each node in the next layer. The output from the final layer in the network is then returned.

The *predict_row()* function below implements this.

# activation function for a network def predict_row(row, network): inputs = row # enumerate the layers in the network from input to output for layer in network: new_inputs = list() # enumerate nodes in the layer for node in layer: # activate the node activation = activate(inputs, node) # transfer activation output = transfer(activation) # store output new_inputs.append(output) # output from this layer is input to the next layer inputs = new_inputs return inputs[0]

That’s about it.

Finally, we need to define a network to use.

For example, we can define an MLP with a single hidden layer with a single node as follows:

... # create a one node network node = rand(n_inputs + 1) layer = [node] network = [layer]

This is practically a Perceptron, although with a sigmoid transfer function. Quite boring.

Let’s define an MLP with one hidden layer and one output layer. The first hidden layer will have 10 nodes, and each node will take the input pattern from the dataset (e.g. five inputs). The output layer will have a single node that takes inputs from the outputs of the first hidden layer and then outputs a prediction.

... # one hidden layer and an output layer n_hidden = 10 hidden1 = [rand(n_inputs + 1) for _ in range(n_hidden)] output1 = [rand(n_hidden + 1)] network = [hidden1, output1]

We can then use the model to make predictions on the dataset.

... # generate predictions for dataset yhat = predict_dataset(X, network)

Before we calculate the classification accuracy, we must round the predictions to class labels 0 and 1.

... # round the predictions yhat = [round(y) for y in yhat] # calculate accuracy score = accuracy_score(y, yhat) print(score)

Tying this all together, the complete example of evaluating an MLP with random initial weights on our synthetic binary classification dataset is listed below.

# develop an mlp model for classification from math import exp from numpy.random import rand from sklearn.datasets import make_classification from sklearn.metrics import accuracy_score # transfer function def transfer(activation): # sigmoid transfer function return 1.0 / (1.0 + exp(-activation)) # activation function def activate(row, weights): # add the bias, the last weight activation = weights[-1] # add the weighted input for i in range(len(row)): activation += weights[i] * row[i] return activation # activation function for a network def predict_row(row, network): inputs = row # enumerate the layers in the network from input to output for layer in network: new_inputs = list() # enumerate nodes in the layer for node in layer: # activate the node activation = activate(inputs, node) # transfer activation output = transfer(activation) # store output new_inputs.append(output) # output from this layer is input to the next layer inputs = new_inputs return inputs[0] # use model weights to generate predictions for a dataset of rows def predict_dataset(X, network): yhats = list() for row in X: yhat = predict_row(row, network) yhats.append(yhat) return yhats # define dataset X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1) # determine the number of inputs n_inputs = X.shape[1] # one hidden layer and an output layer n_hidden = 10 hidden1 = [rand(n_inputs + 1) for _ in range(n_hidden)] output1 = [rand(n_hidden + 1)] network = [hidden1, output1] # generate predictions for dataset yhat = predict_dataset(X, network) # round the predictions yhat = [round(y) for y in yhat] # calculate accuracy score = accuracy_score(y, yhat) print(score)

Running the example generates a prediction for each example in the training dataset, then prints the classification accuracy for the predictions.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Again, we would expect about 50 percent accuracy given a set of random weights and a dataset with an equal number of examples in each class, and that is approximately what we see in this case.

0.499

Next, we can apply the stochastic hill climbing algorithm to the dataset.

It is very much the same as applying hill climbing to the Perceptron model, except in this case, a step requires a modification to all weights in the network.

For this, we will develop a new function that creates a copy of the network and mutates each weight in the network while making the copy.

The *step()* function below implements this.

# take a step in the search space def step(network, step_size): new_net = list() # enumerate layers in the network for layer in network: new_layer = list() # enumerate nodes in this layer for node in layer: # mutate the node new_node = node.copy() + randn(len(node)) * step_size # store node in layer new_layer.append(new_node) # store layer in network new_net.append(new_layer) return new_net

Modifying all weight in the network is aggressive.

A less aggressive step in the search space might be to make a small change to a subset of the weights in the model, perhaps controlled by a hyperparameter. This is left as an extension.

We can then call this new *step()* function from the hillclimbing() function.

# hill climbing local search algorithm def hillclimbing(X, y, objective, solution, n_iter, step_size): # evaluate the initial point solution_eval = objective(X, y, solution) # run the hill climb for i in range(n_iter): # take a step candidate = step(solution, step_size) # evaluate candidate point candidte_eval = objective(X, y, candidate) # check if we should keep the new point if candidte_eval >= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval # report progress print('>%d %f' % (i, solution_eval)) return [solution, solution_eval]

Tying this together, the complete example of applying stochastic hill climbing to optimize the weights of an MLP model for binary classification is listed below.

# stochastic hill climbing to optimize a multilayer perceptron for classification from math import exp from numpy.random import randn from numpy.random import rand from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # transfer function def transfer(activation): # sigmoid transfer function return 1.0 / (1.0 + exp(-activation)) # activation function def activate(row, weights): # add the bias, the last weight activation = weights[-1] # add the weighted input for i in range(len(row)): activation += weights[i] * row[i] return activation # activation function for a network def predict_row(row, network): inputs = row # enumerate the layers in the network from input to output for layer in network: new_inputs = list() # enumerate nodes in the layer for node in layer: # activate the node activation = activate(inputs, node) # transfer activation output = transfer(activation) # store output new_inputs.append(output) # output from this layer is input to the next layer inputs = new_inputs return inputs[0] # use model weights to generate predictions for a dataset of rows def predict_dataset(X, network): yhats = list() for row in X: yhat = predict_row(row, network) yhats.append(yhat) return yhats # objective function def objective(X, y, network): # generate predictions for dataset yhat = predict_dataset(X, network) # round the predictions yhat = [round(y) for y in yhat] # calculate accuracy score = accuracy_score(y, yhat) return score # take a step in the search space def step(network, step_size): new_net = list() # enumerate layers in the network for layer in network: new_layer = list() # enumerate nodes in this layer for node in layer: # mutate the node new_node = node.copy() + randn(len(node)) * step_size # store node in layer new_layer.append(new_node) # store layer in network new_net.append(new_layer) return new_net # hill climbing local search algorithm def hillclimbing(X, y, objective, solution, n_iter, step_size): # evaluate the initial point solution_eval = objective(X, y, solution) # run the hill climb for i in range(n_iter): # take a step candidate = step(solution, step_size) # evaluate candidate point candidte_eval = objective(X, y, candidate) # check if we should keep the new point if candidte_eval >= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval # report progress print('>%d %f' % (i, solution_eval)) return [solution, solution_eval] # define dataset X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1) # split into train test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) # define the total iterations n_iter = 1000 # define the maximum step size step_size = 0.1 # determine the number of inputs n_inputs = X.shape[1] # one hidden layer and an output layer n_hidden = 10 hidden1 = [rand(n_inputs + 1) for _ in range(n_hidden)] output1 = [rand(n_hidden + 1)] network = [hidden1, output1] # perform the hill climbing search network, score = hillclimbing(X_train, y_train, objective, network, n_iter, step_size) print('Done!') print('Best: %f' % (score)) # generate predictions for the test dataset yhat = predict_dataset(X_test, network) # round the predictions yhat = [round(y) for y in yhat] # calculate accuracy score = accuracy_score(y_test, yhat) print('Test Accuracy: %.5f' % (score * 100))

Running the example will report the iteration number and classification accuracy each time there is an improvement made to the model.

At the end of the search, the performance of the best set of weights on the training dataset is reported and the performance of the same model on the test dataset is calculated and reported.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the optimization algorithm found a set of weights that achieved about 87.3 percent accuracy on the training dataset and about 85.1 percent accuracy on the test dataset.

... >55 0.755224 >56 0.765672 >59 0.794030 >66 0.805970 >77 0.835821 >120 0.838806 >165 0.840299 >188 0.841791 >218 0.846269 >232 0.852239 >237 0.852239 >239 0.855224 >292 0.867164 >368 0.868657 >823 0.868657 >852 0.871642 >889 0.871642 >892 0.871642 >992 0.873134 Done! Best: 0.873134 Test Accuracy: 85.15152

This section provides more resources on the topic if you are looking to go deeper.

- Train-Test Split for Evaluating Machine Learning Algorithms
- How To Implement The Perceptron Algorithm From Scratch In Python
- How to Code a Neural Network with Backpropagation In Python (from scratch)

- sklearn.datasets.make_classification APIs.
- sklearn.metrics.accuracy_score APIs.
- numpy.random.rand API.

In this tutorial, you discovered how to manually optimize the weights of neural network models.

Specifically, you learned:

- How to develop the forward inference pass for neural network models from scratch.
- How to optimize the weights of a Perceptron model for binary classification.
- How to optimize the weights of a Multilayer Perceptron model using stochastic hill climbing.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Manually Optimize Neural Network Models appeared first on Machine Learning Mastery.

]]>The post Books on Genetic Programming appeared first on Machine Learning Mastery.

]]>It is a type of automatic programming intended for challenging problems where the task is well defined and solutions can be checked easily at a low cost, although the search space of possible solutions is vast, and there is little intuition as to the best way to solve the problem.

This often includes open problems such as controller design, circuit design, as well as predictive modeling tasks such as feature selection, classification, and regression.

It can be difficult for a beginner to get started in the field as there is a vast amount of literature going back decades.

In this tutorial, you will discover the top books on genetic programming.

Let’s get started.

There are a number of books on genetic programming, which can be grouped by type.

We will explore the top books on genetic programming divided into three main groups; they are:

- Genetic Programming (Koza)
- Textbooks
- Conference Proceedings

John Koza is a computer scientist that studied under John Holland, the inventor of the genetic algorithm.

Koza is typically credited with unifying the nascent field of genetic programming in the late 1980s and early 1990s.

He is famous for his application of genetic algorithms towards circuit designs that resulted in new patentable inventions and describing genetic algorithms as being about to routinely generate “human competitive” results.

He wrote a series of four textbooks on genetic programming, as follows:

- Genetic Programming: On the Programming of Computers by Means of Natural Selection, 1992.
- Genetic Programming II: Automatic Discovery of Reusable Programs, 1994.
- Genetic Programming III: Darwinian Invention and Problem Solving, 1999.
- Genetic Programming IV: Routine Human-Competitive Machine Intelligence, 2003.

His most recent book, “Genetic Programming IV,” is an excellent place to get started.

A table at the beginning of the book summarizes the four key takeaways; they are:

1. Genetic programming now routinely delivers high-return human-competitive machine intelligence.

2. Genetic programming is an automated invention machine.

3. Genetic programming can automatically create a general solution to a problem in the form of a parameterized topology.

4. Genetic programming has delivered a progression of qualitatively more substantial results in synchrony with five approximately order-of-magnitude increases in the expenditure of computer time.

— Page 1, Genetic Programming IV: Routine Human-Competitive Machine Intelligence, 2003.

The table of contents for this book is as follows:

- Chapter 01: Introduction
- Chapter 02: Background on Genetic Programming
- Chapter 03: Automatic Synthesis of Controllers
- Chapter 04: Automatic Synthesis of Circuits
- Chapter 05: Automatic Synthesis of Circuit Topology, Sizing, Placement, and Routing
- Chapter 06: Automatic Synthesis of Antennas
- Chapter 07: Automatic Synthesis of Genetic Networks
- Chapter 08: Automatic Synthesis of Metabolic Pathways
- Chapter 09: Automatic Synthesis of Parameterized Topologies for Controllers
- Chapter 10: Automatic Synthesis of Parameterized Topologies for Circuits
- Chapter 11: Automatic Synthesis of Parameterized Topologies with Conditional Developmental Operators for Circuits
- Chapter 12: Automatic Synthesis of Improved Tuning Rules for PID Controllers
- Chapter 13: Automatic Synthesis of Parameterized Topologies for Improved Controllers
- Chapter 14: Reinvention of Negative Feedback
- Chapter 15: Automated Reinvention of Six Post-2000 Patented Circuits
- Chapter 16: Problems for Which Genetic Programming May Be Well Suited
- Chapter 17: Parallel Implementation and Computer Time
- Chapter 18: Historical Perspective on Moore’s Law and the Progression of Qualitatively More Substantial Results Produced by Genetic Programming
- Chapter 19: Conclusion

A number of textbooks have been published on genetic programming designed for undergraduate and postgraduate students interested in the field.

Perhaps the most popular books include the following:

- Genetic Programming: An Introduction, 1997.
- Genetic Programming and Data Structures, 1998.
- Foundations of Genetic Programming, 2002.

I would recommend the more recent “*Foundations of Genetic Programming*.”

So Foundations of Genetic Programming should not be viewed only as a collection of techniques that one needs to know in order to be able to do GP well but also as a first attempt to chart and explore the mechanisms and fundamental principles behind genetic programming as a search algorithm. In writing this book we hoped to cast a tiny bit of light onto the theoretical foundations of Artificial Intelligence as a whole.

— Page IIX, Foundations of Genetic Programming, 2002.

The table of contents for this book is as follows:

- Chapter 01: Introduction
- Chapter 02: Fitness Landscapes
- Chapter 03: Program Component Schema Theories
- Chapter 04: Pessimistic GP Schema Theories
- Chapter 05: Exact GP Schema Theorems
- Chapter 06: Lessons from GP Schema Theory
- Chapter 07: The Genetic Programming Search Space
- Chapter 08: The GP Search Space: Theoretical Analysis
- Chapter 09: Example I: The Artificial Ant
- Chapter 10: Example II: The Max Problem
- Chapter 11: GP Convergence and Bloat
- Chapter 12: Conclusions

Perhaps one of the more popular books on GP was self-published by top academics in the field and is intended for student and developers interested in applying genetic programming to their projects.

Here’s a snippet from the book:

Many books have been written which describe aspects of GP. Some provide general introductions to the field as a whole. However, no new introductory book on GP has been produced in the last decade, and anyone wanting to learn about GP is forced to map the terrain painfully on their own. This book attempts to fill that gap, by providing a modern field guide to GP for both newcomers and old-timers.

— A Field Guide to Genetic Programming, 2008.

The table of contents for this book is as follows:

- Chapter 01: Introduction
- Chapter 02: Representation, Initialization and Operations in Tree-based GP
- Chapter 03: Getting Ready to Run Genetic Programming
- Chapter 04: Example Genetic Programming Run
- Chapter 05: Alternative Initializations and Operations in Tree-based GP
- Chapter 06: Modular, Grammatical and Developmental Tree-based GP
- Chapter 07: Linear and Graph Genetic Programming
- Chapter 08: Probabilistic Genetic Programming
- Chapter 09: Multi-objective Genetic Programming
- Chapter 10: Fast and Distributed Genetic Programming
- Chapter 11: GP Theory and its Applications
- Chapter 12: Applications
- Chapter 13: Troubleshooting GP
- Chapter 14: Conclusions

It is common to refer to versions of genetic programming algorithms specialized for different applications and representations by new names, such as “*Linear Genetic Programming*,” “*Cartesian Genetic Programming*,” and “*Grammatical Evolution*.”

Some textbooks on these specialized types of genetic programming algorithms include the following:

- Linear Genetic Programming, 2006.
- Cartesian Genetic Programming, 2011.
- Grammatical Evolution: Evolutionary Automatic Programming in an Arbitrary Language, 2003.
- Foundations in Grammatical Evolution for Dynamic Environments, 2009.
- Handbook of Grammatical Evolution, 2018.

The main way that findings are shared in machine learning is via conferences, and conference proceedings provide a collection of top papers from a conference.

The papers presented at any given conference can jump around topics and be challenging to follow without some grounding in the field. Nevertheless, they can quickly get you up to speed with current and popular techniques.

I recommend focusing on the most recent issues of any proceedings. No need to go trawling back through the years.

There are three conference proceedings you may want to look at; they are:

- Genetic Programming Theory and Practice
- Genetic Programming European Conference
- Advances in Genetic Programming

Let’s take a closer look at each in turn:

The Genetic Programming Theory and Practice conference is held annually, and the proceedings are printed by Springer.

It is probably the premier conference on GP. It is up to issue 17 (XVII) at the time of writing.

The last three issues are as follows:

- Genetic Programming Theory and Practice XV, 2018.
- Genetic Programming Theory and Practice XVI, 2019.
- Genetic Programming Theory and Practice XVII, 2020.

The Genetic Programming European Conference, or EuroGP, is another major genetic programming conference.

Like Genetic Programming Theory and Practice, this conference and its published proceedings have been going for decades and are in their 23rd year at the time of writing.

The last three issues are as follows:

- Genetic Programming: 21st European Conference, EuroGP 2018
- Genetic Programming: 22nd European Conference, EuroGP 2019
- Genetic Programming: 23rd European Conference, EuroGP 2020

“*Advances in Genetic Programming*” is a volume published by MIT press containing collected papers.

It was only published three times in the mid to late 1990s. Nevertheless, the contents may be useful for developing a deeper understanding of the field.

- Advances in Genetic Programming, 1994.
- Advances in Genetic Programming 2, 1996.
- Advances in Genetic Programming 3, 1999.

I have read most of the books listed.

If you are looking to get a single book on genetic programming, I would recommend the following:

It will introduce the field and show you how to get results quickly.

If you are looking for a fuller library of books, I would recommend the following three:

- Genetic Programming IV: Routine Human-Competitive Machine Intelligence, 2003.
- Foundations of Genetic Programming, 2002.
- A Field Guide to Genetic Programming, 2008.

I have these three on my bookshelf.

With these three books, you will have a solid theoretical foundation, an idea of how to apply the technique in practice, and an idea of the types of human competitive results that have been achieved and the algorithms used to achieve them.

In this tutorial, you discovered the top books on genetic programming.

**Have you read any of the above books?**

What did you think?

**Did I miss your favorite book?**

Let me know in the comments below.

The post Books on Genetic Programming appeared first on Machine Learning Mastery.

]]>The post Stochastic Hill Climbing in Python from Scratch appeared first on Machine Learning Mastery.

]]>It makes use of randomness as part of the search process. This makes the algorithm appropriate for nonlinear objective functions where other local search algorithms do not operate well.

It is also a local search algorithm, meaning that it modifies a single solution and searches the relatively local area of the search space until the local optima is located. This means that it is appropriate on unimodal optimization problems or for use after the application of a global optimization algorithm.

In this tutorial, you will discover the hill climbing optimization algorithm for function optimization

After completing this tutorial, you will know:

- Hill climbing is a stochastic local search algorithm for function optimization.
- How to implement the hill climbing algorithm from scratch in Python.
- How to apply the hill climbing algorithm and inspect the results of the algorithm.

Let’s get started.

This tutorial is divided into three parts; they are:

- Hill Climbing Algorithm
- Hill Climbing Algorithm Implementation
- Example of Applying the Hill Climbing Algorithm

The stochastic hill climbing algorithm is a stochastic local search optimization algorithm.

It takes an initial point as input and a step size, where the step size is a distance within the search space.

The algorithm takes the initial point as the current best candidate solution and generates a new point within the step size distance of the provided point. The generated point is evaluated, and if it is equal or better than the current point, it is taken as the current point.

The generation of the new point uses randomness, often referred to as Stochastic Hill Climbing. This means that the algorithm can skip over bumpy, noisy, discontinuous, or deceptive regions of the response surface as part of the search.

Stochastic hill climbing chooses at random from among the uphill moves; the probability of selection can vary with the steepness of the uphill move.

— Page 124, Artificial Intelligence: A Modern Approach, 2009.

It is important that different points with equal evaluation are accepted as it allows the algorithm to continue to explore the search space, such as across flat regions of the response surface. It may also be helpful to put a limit on these so-called “*sideways*” moves to avoid an infinite loop.

If we always allow sideways moves when there are no uphill moves, an infinite loop will occur whenever the algorithm reaches a flat local maximum that is not a shoulder. One common solution is to put a limit on the number of consecutive sideways moves allowed. For example, we could allow up to, say, 100 consecutive sideways moves

— Page 123, Artificial Intelligence: A Modern Approach, 2009.

This process continues until a stop condition is met, such as a maximum number of function evaluations or no improvement within a given number of function evaluations.

The algorithm takes its name from the fact that it will (stochastically) climb the hill of the response surface to the local optima. This does not mean it can only be used for maximizing objective functions; it is just a name. In fact, typically, we minimize functions instead of maximize them.

The hill-climbing search algorithm (steepest-ascent version) […] is simply a loop that continually moves in the direction of increasing value—that is, uphill. It terminates when it reaches a “peak” where no neighbor has a higher value.

— Page 122, Artificial Intelligence: A Modern Approach, 2009.

As a local search algorithm, it can get stuck in local optima. Nevertheless, multiple restarts may allow the algorithm to locate the global optimum.

Random-restart hill climbing […] conducts a series of hill-climbing searches from randomly generated initial states, until a goal is found.

— Page 124, Artificial Intelligence: A Modern Approach, 2009.

The step size must be large enough to allow better nearby points in the search space to be located, but not so large that the search jumps over out of the region that contains the local optima.

At the time of writing, the SciPy library does not provide an implementation of stochastic hill climbing.

Nevertheless, we can implement it ourselves.

First, we must define our objective function and the bounds on each input variable to the objective function. The objective function is just a Python function we will name *objective()*. The bounds will be a 2D array with one dimension for each input variable that defines the minimum and maximum for the variable.

For example, a one-dimensional objective function and bounds would be defined as follows:

# objective function def objective(x): return 0 # define range for input bounds = asarray([[-5.0, 5.0]])

Next, we can generate our initial solution as a random point within the bounds of the problem, then evaluate it using the objective function.

... # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # evaluate the initial point solution_eval = objective(solution)

Now we can loop over a predefined number of iterations of the algorithm defined as “*n_iterations*“, such as 100 or 1,000.

... # run the hill climb for i in range(n_iterations): ...

The first step of the algorithm iteration is to take a step.

This requires a predefined “*step_size*” parameter, which is relative to the bounds of the search space. We will take a random step with a Gaussian distribution where the mean is our current point and the standard deviation is defined by the “*step_size*“. That means that about 99 percent of the steps taken will be within (3 * step_size) of the current point.

... # take a step candidate = solution + randn(len(bounds)) * step_size

We don’t have to take steps in this way. You may wish to use a uniform distribution between 0 and the step size. For example:

... # take a step candidate = solution + rand(len(bounds)) * step_size

Next we need to evaluate the new candidate solution with the objective function.

... # evaluate candidate point candidte_eval = objective(candidate)

We then need to check if the evaluation of this new point is as good as or better than the current best point, and if it is, replace our current best point with this new point.

... # check if we should keep the new point if candidte_eval <= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval # report progress print('>%d f(%s) = %.5f' % (i, solution, solution_eval))

And that’s it.

We can implement this hill climbing algorithm as a reusable function that takes the name of the objective function, the bounds of each input variable, the total iterations and steps as arguments, and returns the best solution found and its evaluation.

# hill climbing local search algorithm def hillclimbing(objective, bounds, n_iterations, step_size): # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # evaluate the initial point solution_eval = objective(solution) # run the hill climb for i in range(n_iterations): # take a step candidate = solution + randn(len(bounds)) * step_size # evaluate candidate point candidte_eval = objective(candidate) # check if we should keep the new point if candidte_eval <= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval # report progress print('>%d f(%s) = %.5f' % (i, solution, solution_eval)) return [solution, solution_eval]

Now that we know how to implement the hill climbing algorithm in Python, let’s look at how we might use it to optimize an objective function.

In this section, we will apply the hill climbing optimization algorithm to an objective function.

First, let’s define our objective function.

We will use a simple one-dimensional x^2 objective function with the bounds [-5, 5].

The example below defines the function, then creates a line plot of the response surface of the function for a grid of input values and marks the optima at f(0.0) = 0.0 with a red line.

# convex unimodal optimization function from numpy import arange from matplotlib import pyplot # objective function def objective(x): return x[0]**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # compute targets results = [objective([x]) for x in inputs] # create a line plot of input vs result pyplot.plot(inputs, results) # define optimal input value x_optima = 0.0 # draw a vertical line at the optimal input pyplot.axvline(x=x_optima, ls='--', color='red') # show the plot pyplot.show()

Running the example creates a line plot of the objective function and clearly marks the function optima.

Next, we can apply the hill climbing algorithm to the objective function.

First, we will seed the pseudorandom number generator. This is not required in general, but in this case, I want to ensure we get the same results (same sequence of random numbers) each time we run the algorithm so we can plot the results later.

... # seed the pseudorandom number generator seed(5)

Next, we can define the configuration of the search.

In this case, we will search for 1,000 iterations of the algorithm and use a step size of 0.1. Given that we are using a Gaussian function for generating the step, this means that about 99 percent of all steps taken will be within a distance of (0.1 * 3) of a given point, e.g. three standard deviations.

... n_iterations = 1000 # define the maximum step size step_size = 0.1

Next, we can perform the search and report the results.

... # perform the hill climbing search best, score = hillclimbing(objective, bounds, n_iterations, step_size) print('Done!') print('f(%s) = %f' % (best, score))

Tying this all together, the complete example is listed below.

# hill climbing search of a one-dimensional objective function from numpy import asarray from numpy.random import randn from numpy.random import rand from numpy.random import seed # objective function def objective(x): return x[0]**2.0 # hill climbing local search algorithm def hillclimbing(objective, bounds, n_iterations, step_size): # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # evaluate the initial point solution_eval = objective(solution) # run the hill climb for i in range(n_iterations): # take a step candidate = solution + randn(len(bounds)) * step_size # evaluate candidate point candidte_eval = objective(candidate) # check if we should keep the new point if candidte_eval <= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval # report progress print('>%d f(%s) = %.5f' % (i, solution, solution_eval)) return [solution, solution_eval] # seed the pseudorandom number generator seed(5) # define range for input bounds = asarray([[-5.0, 5.0]]) # define the total iterations n_iterations = 1000 # define the maximum step size step_size = 0.1 # perform the hill climbing search best, score = hillclimbing(objective, bounds, n_iterations, step_size) print('Done!') print('f(%s) = %f' % (best, score))

Running the example reports the progress of the search, including the iteration number, the input to the function, and the response from the objective function each time an improvement was detected.

At the end of the search, the best solution is found and its evaluation is reported.

In this case we can see about 36 improvements over the 1,000 iterations of the algorithm and a solution that is very close to the optimal input of 0.0 that evaluates to f(0.0) = 0.0.

>1 f([-2.74290923]) = 7.52355 >3 f([-2.65873147]) = 7.06885 >4 f([-2.52197291]) = 6.36035 >5 f([-2.46450214]) = 6.07377 >7 f([-2.44740961]) = 5.98981 >9 f([-2.28364676]) = 5.21504 >12 f([-2.19245939]) = 4.80688 >14 f([-2.01001538]) = 4.04016 >15 f([-1.86425287]) = 3.47544 >22 f([-1.79913002]) = 3.23687 >24 f([-1.57525573]) = 2.48143 >25 f([-1.55047719]) = 2.40398 >26 f([-1.51783757]) = 2.30383 >27 f([-1.49118756]) = 2.22364 >28 f([-1.45344116]) = 2.11249 >30 f([-1.33055275]) = 1.77037 >32 f([-1.17805016]) = 1.38780 >33 f([-1.15189314]) = 1.32686 >36 f([-1.03852644]) = 1.07854 >37 f([-0.99135322]) = 0.98278 >38 f([-0.79448984]) = 0.63121 >39 f([-0.69837955]) = 0.48773 >42 f([-0.69317313]) = 0.48049 >46 f([-0.61801423]) = 0.38194 >48 f([-0.48799625]) = 0.23814 >50 f([-0.22149135]) = 0.04906 >54 f([-0.20017144]) = 0.04007 >57 f([-0.15994446]) = 0.02558 >60 f([-0.15492485]) = 0.02400 >61 f([-0.03572481]) = 0.00128 >64 f([-0.03051261]) = 0.00093 >66 f([-0.0074283]) = 0.00006 >78 f([-0.00202357]) = 0.00000 >119 f([0.00128373]) = 0.00000 >120 f([-0.00040911]) = 0.00000 >314 f([-0.00017051]) = 0.00000 Done! f([-0.00017051]) = 0.000000

It can be interesting to review the progress of the search as a line plot that shows the change in the evaluation of the best solution each time there is an improvement.

We can update the *hillclimbing()* to keep track of the objective function evaluations each time there is an improvement and return this list of scores.

# hill climbing local search algorithm def hillclimbing(objective, bounds, n_iterations, step_size): # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # evaluate the initial point solution_eval = objective(solution) # run the hill climb scores = list() scores.append(solution_eval) for i in range(n_iterations): # take a step candidate = solution + randn(len(bounds)) * step_size # evaluate candidate point candidte_eval = objective(candidate) # check if we should keep the new point if candidte_eval <= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval # keep track of scores scores.append(solution_eval) # report progress print('>%d f(%s) = %.5f' % (i, solution, solution_eval)) return [solution, solution_eval, scores]

We can then create a line plot of these scores to see the relative change in objective function for each improvement found during the search.

... # line plot of best scores pyplot.plot(scores, '.-') pyplot.xlabel('Improvement Number') pyplot.ylabel('Evaluation f(x)') pyplot.show()

Tying this together, the complete example of performing the search and plotting the objective function scores of the improved solutions during the search is listed below.

# hill climbing search of a one-dimensional objective function from numpy import asarray from numpy.random import randn from numpy.random import rand from numpy.random import seed from matplotlib import pyplot # objective function def objective(x): return x[0]**2.0 # hill climbing local search algorithm def hillclimbing(objective, bounds, n_iterations, step_size): # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # evaluate the initial point solution_eval = objective(solution) # run the hill climb scores = list() scores.append(solution_eval) for i in range(n_iterations): # take a step candidate = solution + randn(len(bounds)) * step_size # evaluate candidate point candidte_eval = objective(candidate) # check if we should keep the new point if candidte_eval <= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval # keep track of scores scores.append(solution_eval) # report progress print('>%d f(%s) = %.5f' % (i, solution, solution_eval)) return [solution, solution_eval, scores] # seed the pseudorandom number generator seed(5) # define range for input bounds = asarray([[-5.0, 5.0]]) # define the total iterations n_iterations = 1000 # define the maximum step size step_size = 0.1 # perform the hill climbing search best, score, scores = hillclimbing(objective, bounds, n_iterations, step_size) print('Done!') print('f(%s) = %f' % (best, score)) # line plot of best scores pyplot.plot(scores, '.-') pyplot.xlabel('Improvement Number') pyplot.ylabel('Evaluation f(x)') pyplot.show()

Running the example performs the search and reports the results as before.

A line plot is created showing the objective function evaluation for each improvement during the hill climbing search. We can see about 36 changes to the objective function evaluation during the search, with large changes initially and very small to imperceptible changes towards the end of the search as the algorithm converged on the optima.

Given that the objective function is one-dimensional, it is straightforward to plot the response surface as we did above.

It can be interesting to review the progress of the search by plotting the best candidate solutions found during the search as points in the response surface. We would expect a sequence of points running down the response surface to the optima.

This can be achieved by first updating the *hillclimbing()* function to keep track of each best candidate solution as it is located during the search, then return a list of best solutions.

# hill climbing local search algorithm def hillclimbing(objective, bounds, n_iterations, step_size): # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # evaluate the initial point solution_eval = objective(solution) # run the hill climb solutions = list() solutions.append(solution) for i in range(n_iterations): # take a step candidate = solution + randn(len(bounds)) * step_size # evaluate candidate point candidte_eval = objective(candidate) # check if we should keep the new point if candidte_eval <= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval # keep track of solutions solutions.append(solution) # report progress print('>%d f(%s) = %.5f' % (i, solution, solution_eval)) return [solution, solution_eval, solutions]

We can then create a plot of the response surface of the objective function and mark the optima as before.

... # sample input range uniformly at 0.1 increments inputs = arange(bounds[0,0], bounds[0,1], 0.1) # create a line plot of input vs result pyplot.plot(inputs, [objective([x]) for x in inputs], '--') # draw a vertical line at the optimal input pyplot.axvline(x=[0.0], ls='--', color='red')

Finally, we can plot the sequence of candidate solutions found by the search as black dots.

... # plot the sample as black circles pyplot.plot(solutions, [objective(x) for x in solutions], 'o', color='black')

Tying this together, the complete example of plotting the sequence of improved solutions on the response surface of the objective function is listed below.

# hill climbing search of a one-dimensional objective function from numpy import asarray from numpy import arange from numpy.random import randn from numpy.random import rand from numpy.random import seed from matplotlib import pyplot # objective function def objective(x): return x[0]**2.0 # hill climbing local search algorithm def hillclimbing(objective, bounds, n_iterations, step_size): # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # evaluate the initial point solution_eval = objective(solution) # run the hill climb solutions = list() solutions.append(solution) for i in range(n_iterations): # take a step candidate = solution + randn(len(bounds)) * step_size # evaluate candidate point candidte_eval = objective(candidate) # check if we should keep the new point if candidte_eval <= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval # keep track of solutions solutions.append(solution) # report progress print('>%d f(%s) = %.5f' % (i, solution, solution_eval)) return [solution, solution_eval, solutions] # seed the pseudorandom number generator seed(5) # define range for input bounds = asarray([[-5.0, 5.0]]) # define the total iterations n_iterations = 1000 # define the maximum step size step_size = 0.1 # perform the hill climbing search best, score, solutions = hillclimbing(objective, bounds, n_iterations, step_size) print('Done!') print('f(%s) = %f' % (best, score)) # sample input range uniformly at 0.1 increments inputs = arange(bounds[0,0], bounds[0,1], 0.1) # create a line plot of input vs result pyplot.plot(inputs, [objective([x]) for x in inputs], '--') # draw a vertical line at the optimal input pyplot.axvline(x=[0.0], ls='--', color='red') # plot the sample as black circles pyplot.plot(solutions, [objective(x) for x in solutions], 'o', color='black') pyplot.show()

Running the example performs the hill climbing search and reports the results as before.

A plot of the response surface is created as before showing the familiar bowl shape of the function with a vertical red line marking the optima of the function.

The sequence of best solutions found during the search is shown as black dots running down the bowl shape to the optima.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered the hill climbing optimization algorithm for function optimization

Specifically, you learned:

- Hill climbing is a stochastic local search algorithm for function optimization.
- How to implement the hill climbing algorithm from scratch in Python.
- How to apply the hill climbing algorithm and inspect the results of the algorithm.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Stochastic Hill Climbing in Python from Scratch appeared first on Machine Learning Mastery.

]]>The post Curve Fitting With Python appeared first on Machine Learning Mastery.

]]>Unlike supervised learning, curve fitting requires that you define the function that maps examples of inputs to outputs.

The mapping function, also called the basis function can have any form you like, including a straight line (linear regression), a curved line (polynomial regression), and much more. This provides the flexibility and control to define the form of the curve, where an optimization process is used to find the specific optimal parameters of the function.

In this tutorial, you will discover how to perform curve fitting in Python.

After completing this tutorial, you will know:

- Curve fitting involves finding the optimal parameters to a function that maps examples of inputs to outputs.
- The SciPy Python library provides an API to fit a curve to a dataset.
- How to use curve fitting in SciPy to fit a range of different curves to a set of observations.

Let’s get started.

This tutorial is divided into three parts; they are:

- Curve Fitting
- Curve Fitting Python API
- Curve Fitting Worked Example

Curve fitting is an optimization problem that finds a line that best fits a collection of observations.

It is easiest to think about curve fitting in two dimensions, such as a graph.

Consider that we have collected examples of data from the problem domain with inputs and outputs.

The x-axis is the independent variable or the input to the function. The y-axis is the dependent variable or the output of the function. We don’t know the form of the function that maps examples of inputs to outputs, but we suspect that we can approximate the function with a standard function form.

Curve fitting involves first defining the functional form of the mapping function (also called the basis function or objective function), then searching for the parameters to the function that result in the minimum error.

Error is calculated by using the observations from the domain and passing the inputs to our candidate mapping function and calculating the output, then comparing the calculated output to the observed output.

Once fit, we can use the mapping function to interpolate or extrapolate new points in the domain. It is common to run a sequence of input values through the mapping function to calculate a sequence of outputs, then create a line plot of the result to show how output varies with input and how well the line fits the observed points.

The key to curve fitting is the form of the mapping function.

A straight line between inputs and outputs can be defined as follows:

- y = a * x + b

Where *y* is the calculated output, *x* is the input, and *a* and *b* are parameters of the mapping function found using an optimization algorithm.

This is called a linear equation because it is a weighted sum of the inputs.

In a linear regression model, these parameters are referred to as coefficients; in a neural network, they are referred to as weights.

This equation can be generalized to any number of inputs, meaning that the notion of curve fitting is not limited to two-dimensions (one input and one output), but could have many input variables.

For example, a line mapping function for two input variables may look as follows:

- y = a1 * x1 + a2 * x2 + b

The equation does not have to be a straight line.

We can add curves in the mapping function by adding exponents. For example, we can add a squared version of the input weighted by another parameter:

- y = a * x + b * x^2 + c

This is called polynomial regression, and the squared term means it is a second-degree polynomial.

So far, linear equations of this type can be fit by minimizing least squares and can be calculated analytically. This means we can find the optimal values of the parameters using a little linear algebra.

We might also want to add other mathematical functions to the equation, such as sine, cosine, and more. Each term is weighted with a parameter and added to the whole to give the output; for example:

- y = a * sin(b * x) + c

Adding arbitrary mathematical functions to our mapping function generally means we cannot calculate the parameters analytically, and instead, we will need to use an iterative optimization algorithm.

This is called nonlinear least squares, as the objective function is no longer convex (it’s nonlinear) and not as easy to solve.

Now that we are familiar with curve fitting, let’s look at how we might perform curve fitting in Python.

We can perform curve fitting for our dataset in Python.

The SciPy open source library provides the curve_fit() function for curve fitting via nonlinear least squares.

The function takes the same input and output data as arguments, as well as the name of the mapping function to use.

The mapping function must take examples of input data and some number of arguments. These remaining arguments will be the coefficients or weight constants that will be optimized by a nonlinear least squares optimization process.

For example, we may have some observations from our domain loaded as input variables *x* and output variables *y*.

... # load input variables from a file x_values = ... y_values = ...

Next, we need to design a mapping function to fit a line to the data and implement it as a Python function that takes inputs and the arguments.

It may be a straight line, in which case it would look as follows:

# objective function def objective(x, a, b, c): return a * x + b

We can then call the curve_fit() function to fit a straight line to the dataset using our defined function.

The function *curve_fit()* returns the optimal values for the mapping function, e.g, the coefficient values. It also returns a covariance matrix for the estimated parameters, but we can ignore that for now.

... # fit curve popt, _ = curve_fit(objective, x_values, y_values)

Once fit, we can use the optimal parameters and our mapping function *objective()* to calculate the output for any arbitrary input.

This might include the output for the examples we have already collected from the domain, it might include new values that interpolate observed values, or it might include extrapolated values outside of the limits of what was observed.

... # define new input values x_new = ... # unpack optima parameters for the objective function a, b, c = popt # use optimal parameters to calculate new values y_new = objective(x_new, a, b, c)

Now that we are familiar with using the curve fitting API, let’s look at a worked example.

We will develop a curve to fit some real world observations of economic data.

In this example, we will use the so-called “*Longley’s Economic Regression*” dataset; you can learn more about it here:

- Longley’s Economic Regression (longley.csv)
- Longley’s Economic Regression Description (longley.names)

We will download the dataset automatically as part of the worked example.

There are seven input variables and 16 rows of data, where each row defines a summary of economic details for a year between 1947 to 1962.

In this example, we will explore fitting a line between population size and the number of people employed for each year.

The example below loads the dataset from the URL, selects the input variable as “*population*,” and the output variable as “*employed*” and creates a scatter plot.

# plot "Population" vs "Employed" from pandas import read_csv from matplotlib import pyplot # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/longley.csv' dataframe = read_csv(url, header=None) data = dataframe.values # choose the input and output variables x, y = data[:, 4], data[:, -1] # plot input vs output pyplot.scatter(x, y) pyplot.show()

Running the example loads the dataset, selects the variables, and creates a scatter plot.

We can see that there is a relationship between the two variables. Specifically, that as the population increases, the total number of employees increases.

It is not unreasonable to think we can fit a line to this data.

First, we will try fitting a straight line to this data, as follows:

# define the true objective function def objective(x, a, b): return a * x + b

We can use curve fitting to find the optimal values of “*a*” and “*b*” and summarize the values that were found:

... # curve fit popt, _ = curve_fit(objective, x, y) # summarize the parameter values a, b = popt print('y = %.5f * x + %.5f' % (a, b))

We can then create a scatter plot as before.

... # plot input vs output pyplot.scatter(x, y)

On top of the scatter plot, we can draw a line for the function with the optimized parameter values.

This involves first defining a sequence of input values between the minimum and maximum values observed in the dataset (e.g. between about 120 and about 130).

... # define a sequence of inputs between the smallest and largest known inputs x_line = arange(min(x), max(x), 1)

We can then calculate the output value for each input value.

... # calculate the output for the range y_line = objective(x_line, a, b)

Then create a line plot of the inputs vs. the outputs to see a line:

... # create a line plot for the mapping function pyplot.plot(x_line, y_line, '--', color='red')

Tying this together, the example below uses curve fitting to find the parameters of a straight line for our economic data.

# fit a straight line to the economic data from numpy import arange from pandas import read_csv from scipy.optimize import curve_fit from matplotlib import pyplot # define the true objective function def objective(x, a, b): return a * x + b # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/longley.csv' dataframe = read_csv(url, header=None) data = dataframe.values # choose the input and output variables x, y = data[:, 4], data[:, -1] # curve fit popt, _ = curve_fit(objective, x, y) # summarize the parameter values a, b = popt print('y = %.5f * x + %.5f' % (a, b)) # plot input vs output pyplot.scatter(x, y) # define a sequence of inputs between the smallest and largest known inputs x_line = arange(min(x), max(x), 1) # calculate the output for the range y_line = objective(x_line, a, b) # create a line plot for the mapping function pyplot.plot(x_line, y_line, '--', color='red') pyplot.show()

Running the example performs curve fitting and finds the optimal parameters to our objective function.

First, the values of the parameters are reported.

y = 0.48488 * x + 8.38067

Next, a plot is created showing the original data and the line that was fit to the data.

We can see that it is a reasonably good fit.

So far, this is not very exciting as we could achieve the same effect by fitting a linear regression model on the dataset.

Let’s try a polynomial regression model by adding squared terms to the objective function.

# define the true objective function def objective(x, a, b, c): return a * x + b * x**2 + c

Tying this together, the complete example is listed below.

# fit a second degree polynomial to the economic data from numpy import arange from pandas import read_csv from scipy.optimize import curve_fit from matplotlib import pyplot # define the true objective function def objective(x, a, b, c): return a * x + b * x**2 + c # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/longley.csv' dataframe = read_csv(url, header=None) data = dataframe.values # choose the input and output variables x, y = data[:, 4], data[:, -1] # curve fit popt, _ = curve_fit(objective, x, y) # summarize the parameter values a, b, c = popt print('y = %.5f * x + %.5f * x^2 + %.5f' % (a, b, c)) # plot input vs output pyplot.scatter(x, y) # define a sequence of inputs between the smallest and largest known inputs x_line = arange(min(x), max(x), 1) # calculate the output for the range y_line = objective(x_line, a, b, c) # create a line plot for the mapping function pyplot.plot(x_line, y_line, '--', color='red') pyplot.show()

First the optimal parameters are reported.

y = 3.25443 * x + -0.01170 * x^2 + -155.02783

Next, a plot is created showing the line in the context of the observed values from the domain.

We can see that the second-degree polynomial equation that we defined is visually a better fit for the data than the straight line that we tested first.

We could keep going and add more polynomial terms to the equation to better fit the curve.

For example, below is an example of a fifth-degree polynomial fit to the data.

# fit a fifth degree polynomial to the economic data from numpy import arange from pandas import read_csv from scipy.optimize import curve_fit from matplotlib import pyplot # define the true objective function def objective(x, a, b, c, d, e, f): return (a * x) + (b * x**2) + (c * x**3) + (d * x**4) + (e * x**5) + f # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/longley.csv' dataframe = read_csv(url, header=None) data = dataframe.values # choose the input and output variables x, y = data[:, 4], data[:, -1] # curve fit popt, _ = curve_fit(objective, x, y) # summarize the parameter values a, b, c, d, e, f = popt # plot input vs output pyplot.scatter(x, y) # define a sequence of inputs between the smallest and largest known inputs x_line = arange(min(x), max(x), 1) # calculate the output for the range y_line = objective(x_line, a, b, c, d, e, f) # create a line plot for the mapping function pyplot.plot(x_line, y_line, '--', color='red') pyplot.show()

Running the example fits the curve and plots the result, again capturing slightly more nuance in how the relationship in the data changes over time.

Importantly, we are not limited to linear regression or polynomial regression. We can use any arbitrary basis function.

For example, perhaps we want a line that has wiggles to capture the short-term movement in observation. We could add a sine curve to the equation and find the parameters that best integrate this element in the equation.

For example, an arbitrary function that uses a sine wave and a second degree polynomial is listed below:

# define the true objective function def objective(x, a, b, c, d): return a * sin(b - x) + c * x**2 + d

The complete example of fitting a curve using this basis function is listed below.

# fit a line to the economic data from numpy import sin from numpy import sqrt from numpy import arange from pandas import read_csv from scipy.optimize import curve_fit from matplotlib import pyplot # define the true objective function def objective(x, a, b, c, d): return a * sin(b - x) + c * x**2 + d # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/longley.csv' dataframe = read_csv(url, header=None) data = dataframe.values # choose the input and output variables x, y = data[:, 4], data[:, -1] # curve fit popt, _ = curve_fit(objective, x, y) # summarize the parameter values a, b, c, d = popt print(popt) # plot input vs output pyplot.scatter(x, y) # define a sequence of inputs between the smallest and largest known inputs x_line = arange(min(x), max(x), 1) # calculate the output for the range y_line = objective(x_line, a, b, c, d) # create a line plot for the mapping function pyplot.plot(x_line, y_line, '--', color='red') pyplot.show()

Running the example fits a curve and plots the result.

We can see that adding a sine wave has the desired effect showing a periodic wiggle with an upward trend that provides another way of capturing the relationships in the data.

**How do you choose the best fit?**

If you want the best fit, you would model the problem as a regression supervised learning problem and test a suite of algorithms in order to discover which is best at minimizing the error.

In this case, curve fitting is appropriate when you want to define the function explicitly, then discover the parameters of your function that best fit a line to the data.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered how to perform curve fitting in Python.

Specifically, you learned:

- Curve fitting involves finding the optimal parameters to a function that maps examples of inputs to outputs.
- Unlike supervised learning, curve fitting requires that you define the function that maps examples of inputs to outputs.
- How to use curve fitting in SciPy to fit a range of different curves to a set of observations.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Curve Fitting With Python appeared first on Machine Learning Mastery.

]]>