The post Gradient Descent With Adadelta from Scratch appeared first on Machine Learning Mastery.

]]>Gradient descent is an optimization algorithm that follows the negative gradient of an objective function in order to locate the minimum of the function.

A limitation of gradient descent is that it uses the same step size (learning rate) for each input variable. AdaGradn and RMSProp are extensions to gradient descent that add a self-adaptive learning rate for each parameter for the objective function.

**Adadelta** can be considered a further extension of gradient descent that builds upon AdaGrad and RMSProp and changes the calculation of the custom step size so that the units are consistent and in turn no longer requires an initial learning rate hyperparameter.

In this tutorial, you will discover how to develop the gradient descent with Adadelta optimization algorithm from scratch.

After completing this tutorial, you will know:

- Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
- Gradient descent can be updated to use an automatically adaptive step size for each input variable using a decaying average of partial derivatives, called Adadelta.
- How to implement the Adadelta optimization algorithm from scratch and apply it to an objective function and evaluate the results.

Let’s get started.

This tutorial is divided into three parts; they are:

- Gradient Descent
- Adadelta Algorithm
- Gradient Descent With Adadelta
- Two-Dimensional Test Problem
- Gradient Descent Optimization With Adadelta
- Visualization of Adadelta

Gradient descent is an optimization algorithm.

It is technically referred to as a first-order optimization algorithm as it explicitly makes use of the first-order derivative of the target objective function.

First-order methods rely on gradient information to help direct the search for a minimum …

— Page 69, Algorithms for Optimization, 2019.

The first order derivative, or simply the “*derivative*,” is the rate of change or slope of the target function at a specific point, e.g. for a specific input.

If the target function takes multiple input variables, it is referred to as a multivariate function and the input variables can be thought of as a vector. In turn, the derivative of a multivariate target function may also be taken as a vector and is referred to generally as the gradient.

**Gradient**: First-order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a specific input.

Gradient descent refers to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to locate the minimum of the function.

The gradient descent algorithm requires a target function that is being optimized and the derivative function for the objective function. The target function *f()* returns a score for a given set of inputs, and the derivative function *f'()* gives the derivative of the target function for a given set of inputs.

The gradient descent algorithm requires a starting point (*x*) in the problem, such as a randomly selected point in the input space.

The derivative is then calculated and a step is taken in the input space that is expected to result in a downhill movement in the target function, assuming we are minimizing the target function.

A downhill movement is made by first calculating how far to move in the input space, calculated as the steps size (called alpha or the learning rate) multiplied by the gradient. This is then subtracted from the current point, ensuring we move against the gradient, or down the target function.

- x = x – step_size * f'(x)

The steeper the objective function at a given point, the larger the magnitude of the gradient, and in turn, the larger the step taken in the search space. The size of the step taken is scaled using a step size hyperparameter.

**Step Size**(*alpha*): Hyperparameter that controls how far to move in the search space against the gradient each iteration of the algorithm.

If the step size is too small, the movement in the search space will be small and the search will take a long time. If the step size is too large, the search may bounce around the search space and skip over the optima.

Now that we are familiar with the gradient descent optimization algorithm, let’s take a look at Adadelta.

Adadelta (or “ADADELTA”) is an extension to the gradient descent optimization algorithm.

The algorithm was described in the 2012 paper by Matthew Zeiler titled “ADADELTA: An Adaptive Learning Rate Method.”

Adadelta is designed to accelerate the optimization process, e.g. decrease the number of function evaluations required to reach the optima, or to improve the capability of the optimization algorithm, e.g. result in a better final result.

It is best understood as an extension of the AdaGrad and RMSProp algorithms.

AdaGrad is an extension of gradient descent that calculates a step size (learning rate) for each parameter for the objective function each time an update is made. The step size is calculated by first summing the partial derivatives for the parameter seen so far during the search, then dividing the initial step size hyperparameter by the square root of the sum of the squared partial derivatives.

The calculation of the custom step size for one parameter with AdaGrad is as follows:

- cust_step_size(t+1) = step_size / (1e-8 + sqrt(s(t)))

Where *cust_step_size(t+1)* is the calculated step size for an input variable for a given point during the search, *step_size* is the initial step size, *sqrt()* is the square root operation, and *s(t)* is the sum of the squared partial derivatives for the input variable seen during the search so far (including the current iteration).

RMSProp can be thought of as an extension of AdaGrad in that it uses a decaying average or moving average of the partial derivatives instead of the sum in the calculation of the step size for each parameter. This is achieved by adding a new hyperparameter “*rho*” that acts like a momentum for the partial derivatives.

The calculation of the decaying moving average squared partial derivative for one parameter is as follows:

- s(t+1) = (s(t) * rho) + (f'(x(t))^2 * (1.0-rho))

Where *s(t+1)* is the mean squared partial derivative for one parameter for the current iteration of the algorithm, *s(t)* is the decaying moving average squared partial derivative for the previous iteration, *f'(x(t))^2* is the squared partial derivative for the current parameter, and rho is a hyperparameter, typically with the value of 0.9 like momentum.

Adadelta is a further extension of RMSProp designed to improve the convergence of the algorithm and to remove the need for a manually specified initial learning rate.

The idea presented in this paper was derived from ADAGRAD in order to improve upon the two main drawbacks of the method: 1) the continual decay of learning rates throughout training, and 2) the need for a manually selected global learning rate.

— ADADELTA: An Adaptive Learning Rate Method, 2012.

The decaying moving average of the squared partial derivative is calculated for each parameter, as with RMSProp. The key difference is in the calculation of the step size for a parameter that uses the decaying average of the delta or change in parameter.

This choice of numerator was to ensure that both parts of the calculation have the same units.

After independently deriving the RMSProp update, the authors noticed that the units in the update equations for gradient descent, momentum and Adagrad do not match. To fix this, they use an exponentially decaying average of the square updates

— Pages 78-79, Algorithms for Optimization, 2019.

First, the custom step size is calculated as the square root of the decaying moving average of the change in the delta divided by the square root of the decaying moving average of the squared partial derivatives.

- cust_step_size(t+1) = (ep + sqrt(delta(t))) / (ep + sqrt(s(t)))

Where *cust_step_size(t+1)* is the custom step size for a parameter for a given update, *ep* is a hyperparameter that is added to the numerator and denominator to avoid a divide by zero error, *delta(t)* is the decaying moving average of the squared change to the parameter (calculated in the last iteration), and *s(t)* is the decaying moving average of the squared partial derivative (calculated in the current iteration).

The *ep* hyperparameter is set to a small value such as 1e-3 or 1e-8. In addition to avoiding a divide by zero error, it also helps with the first step of the algorithm when the decaying moving average squared change and decaying moving average squared gradient are zero.

Next, the change to the parameter is calculated as the custom step size multiplied by the partial derivative

- change(t+1) = cust_step_size(t+1) * f'(x(t))

Next, the decaying average of the squared change to the parameter is updated.

- delta(t+1) = (delta(t) * rho) + (change(t+1)^2 * (1.0-rho))

Where *delta(t+1)* is the decaying average of the change to the variable to be used in the next iteration, *change(t+1)* was calculated in the step before and *rho* is a hyperparameter that acts like momentum and has a value like 0.9.

Finally, the new value for the variable is calculated using the change.

- x(t+1) = x(t) – change(t+1)

This process is then repeated for each variable for the objective function, then the entire process is repeated to navigate the search space for a fixed number of algorithm iterations.

Now that we are familiar with the Adadelta algorithm, let’s explore how we might implement it and evaluate its performance.

In this section, we will explore how to implement the gradient descent optimization algorithm with Adadelta.

First, let’s define an optimization function.

We will use a simple two-dimensional function that squares the input of each dimension and define the range of valid inputs from -1.0 to 1.0.

The objective() function below implements this function

# objective function def objective(x, y): return x**2.0 + y**2.0

We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the response surface.

The complete example of plotting the objective function is listed below.

# 3d plot of the test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -1.0, 1.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a three dimensional surface plot of the objective function.

We can see the familiar bowl shape with the global minima at f(0, 0) = 0.

We can also create a two-dimensional plot of the function. This will be helpful later when we want to plot the progress of the search.

The example below creates a contour plot of the objective function.

# contour plot of the test function from numpy import asarray from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # show the plot pyplot.show()

Running the example creates a two-dimensional contour plot of the objective function.

We can see the bowl shape compressed to contours shown with a color gradient. We will use this plot to plot the specific points explored during the progress of the search.

Now that we have a test objective function, let’s look at how we might implement the Adadelta optimization algorithm.

We can apply the gradient descent with Adadelta to the test problem.

First, we need a function that calculates the derivative for this function.

- f(x) = x^2
- f'(x) = x * 2

The derivative of x^2 is x * 2 in each dimension. The derivative() function implements this below.

# derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0])

Next, we can implement gradient descent optimization.

First, we can select a random point in the bounds of the problem as a starting point for the search.

This assumes we have an array that defines the bounds of the search with one row for each dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

... # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

Next, we need to initialize the decaying average of the squared partial derivatives and squared change for each dimension to 0.0 values.

... # list of the average square gradients for each variable sq_grad_avg = [0.0 for _ in range(bounds.shape[0])] # list of the average parameter updates sq_para_avg = [0.0 for _ in range(bounds.shape[0])]

We can then enumerate a fixed number of iterations of the search optimization algorithm defined by a “*n_iter*” hyperparameter.

... # run the gradient descent for it in range(n_iter): ...

The first step is to calculate the gradient for the current solution using the *derivative()* function.

... # calculate gradient gradient = derivative(solution[0], solution[1])

We then need to calculate the square of the partial derivative and update the decaying moving average of the squared partial derivatives with the “*rho*” hyperparameter.

... # update the average of the squared partial derivatives for i in range(gradient.shape[0]): # calculate the squared gradient sg = gradient[i]**2.0 # update the moving average of the squared gradient sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))

We can then use the decaying moving average of the squared partial derivatives and gradient to calculate the step size for the next point. We will do this one variable at a time.

... # build solution new_solution = list() for i in range(solution.shape[0]): ...

First, we will calculate the custom step size for this variable on this iteration using the decaying moving average of the squared changes and squared partial derivatives, as well as the “ep” hyperparameter.

... # calculate the step size for this variable alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))

Next, we can use the custom step size and partial derivative to calculate the change to the variable.

... # calculate the change change = alpha * gradient[i]

We can then use the change to update the decaying moving average of the squared change using the “*rho*” hyperparameter.

... # update the moving average of squared parameter changes sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))

Finally, we can change the variable and store the result before moving on to the next variable.

... # calculate the new position in this variable value = solution[i] - change # store this variable new_solution.append(value)

This new solution can then be evaluated using the objective() function and the performance of the search can be reported.

... # evaluate candidate point solution = asarray(new_solution) solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

And that’s it.

We can tie all of this together into a function named *adadelta()* that takes the names of the objective function and the derivative function, an array with the bounds of the domain and hyperparameter values for the total number of algorithm iterations and *rho*, and returns the final solution and its evaluation.

The *ep* hyperparameter can also be taken as an argument, although has a sensible default value of 1e-3.

This complete function is listed below.

# gradient descent algorithm with adadelta def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3): # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of the average square gradients for each variable sq_grad_avg = [0.0 for _ in range(bounds.shape[0])] # list of the average parameter updates sq_para_avg = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate gradient gradient = derivative(solution[0], solution[1]) # update the average of the squared partial derivatives for i in range(gradient.shape[0]): # calculate the squared gradient sg = gradient[i]**2.0 # update the moving average of the squared gradient sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho)) # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # calculate the step size for this variable alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i])) # calculate the change change = alpha * gradient[i] # update the moving average of squared parameter changes sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho)) # calculate the new position in this variable value = solution[i] - change # store this variable new_solution.append(value) # evaluate candidate point solution = asarray(new_solution) solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return [solution, solution_eval]

**Note**: we have intentionally used lists and imperative coding style instead of vectorized operations for readability. Feel free to adapt the implementation to a vectorization implementation with NumPy arrays for better performance.

We can then define our hyperparameters and call the *adadelta()* function to optimize our test objective function.

In this case, we will use 120 iterations of the algorithm and a value of 0.99 for the rho hyperparameter, chosen after a little trial and error.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 120 # momentum for adadelta rho = 0.99 # perform the gradient descent search with adadelta best, score = adadelta(objective, derivative, bounds, n_iter, rho) print('Done!') print('f(%s) = %f' % (best, score))

Tying all of this together, the complete example of gradient descent optimization with Adadelta is listed below.

# gradient descent optimization with adadelta for a two-dimensional test function from math import sqrt from numpy import asarray from numpy.random import rand from numpy.random import seed # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with adadelta def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3): # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of the average square gradients for each variable sq_grad_avg = [0.0 for _ in range(bounds.shape[0])] # list of the average parameter updates sq_para_avg = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate gradient gradient = derivative(solution[0], solution[1]) # update the average of the squared partial derivatives for i in range(gradient.shape[0]): # calculate the squared gradient sg = gradient[i]**2.0 # update the moving average of the squared gradient sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho)) # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # calculate the step size for this variable alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i])) # calculate the change change = alpha * gradient[i] # update the moving average of squared parameter changes sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho)) # calculate the new position in this variable value = solution[i] - change # store this variable new_solution.append(value) # evaluate candidate point solution = asarray(new_solution) solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return [solution, solution_eval] # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 120 # momentum for adadelta rho = 0.99 # perform the gradient descent search with adadelta best, score = adadelta(objective, derivative, bounds, n_iter, rho) print('Done!') print('f(%s) = %f' % (best, score))

Running the example applies the Adadelta optimization algorithm to our test problem and reports performance of the search for each iteration of the algorithm.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a near optimal solution was found after perhaps 105 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.

... >100 f([-1.45142626e-07 2.71163181e-03]) = 0.00001 >101 f([-1.24898699e-07 2.56875692e-03]) = 0.00001 >102 f([-1.07454197e-07 2.43328237e-03]) = 0.00001 >103 f([-9.24253035e-08 2.30483111e-03]) = 0.00001 >104 f([-7.94803792e-08 2.18304501e-03]) = 0.00000 >105 f([-6.83329263e-08 2.06758392e-03]) = 0.00000 >106 f([-5.87354975e-08 1.95812477e-03]) = 0.00000 >107 f([-5.04744185e-08 1.85436071e-03]) = 0.00000 >108 f([-4.33652179e-08 1.75600036e-03]) = 0.00000 >109 f([-3.72486699e-08 1.66276699e-03]) = 0.00000 >110 f([-3.19873691e-08 1.57439783e-03]) = 0.00000 >111 f([-2.74627662e-08 1.49064334e-03]) = 0.00000 >112 f([-2.3572602e-08 1.4112666e-03]) = 0.00000 >113 f([-2.02286891e-08 1.33604264e-03]) = 0.00000 >114 f([-1.73549914e-08 1.26475787e-03]) = 0.00000 >115 f([-1.48859650e-08 1.19720951e-03]) = 0.00000 >116 f([-1.27651224e-08 1.13320504e-03]) = 0.00000 >117 f([-1.09437923e-08 1.07256172e-03]) = 0.00000 >118 f([-9.38004754e-09 1.01510604e-03]) = 0.00000 >119 f([-8.03777865e-09 9.60673346e-04]) = 0.00000 Done! f([-8.03777865e-09 9.60673346e-04]) = 0.000001

We can plot the progress of the Adadelta search on a contour plot of the domain.

This can provide an intuition for the progress of the search over the iterations of the algorithm.

We must update the *adadelta()* function to maintain a list of all solutions found during the search, then return this list at the end of the search.

The updated version of the function with these changes is listed below.

# gradient descent algorithm with adadelta def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3): # track all solutions solutions = list() # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of the average square gradients for each variable sq_grad_avg = [0.0 for _ in range(bounds.shape[0])] # list of the average parameter updates sq_para_avg = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate gradient gradient = derivative(solution[0], solution[1]) # update the average of the squared partial derivatives for i in range(gradient.shape[0]): # calculate the squared gradient sg = gradient[i]**2.0 # update the moving average of the squared gradient sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho)) # build solution new_solution = list() for i in range(solution.shape[0]): # calculate the step size for this variable alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i])) # calculate the change change = alpha * gradient[i] # update the moving average of squared parameter changes sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho)) # calculate the new position in this variable value = solution[i] - change # store this variable new_solution.append(value) # store the new solution solution = asarray(new_solution) solutions.append(solution) # evaluate candidate point solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return solutions

We can then execute the search as before, and this time retrieve the list of solutions instead of the best final solution.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 120 # rho for adadelta rho = 0.99 # perform the gradient descent search with adadelta solutions = adadelta(objective, derivative, bounds, n_iter, rho)

We can then create a contour plot of the objective function, as before.

... # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet')

Finally, we can plot each solution found during the search as a white dot connected by a line.

... # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

Tying this all together, the complete example of performing the Adadelta optimization on the test problem and plotting the results on a contour plot is listed below.

# example of plotting the adadelta search on a contour plot of the test function from math import sqrt from numpy import asarray from numpy import arange from numpy.random import rand from numpy.random import seed from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with adadelta def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3): # track all solutions solutions = list() # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of the average square gradients for each variable sq_grad_avg = [0.0 for _ in range(bounds.shape[0])] # list of the average parameter updates sq_para_avg = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate gradient gradient = derivative(solution[0], solution[1]) # update the average of the squared partial derivatives for i in range(gradient.shape[0]): # calculate the squared gradient sg = gradient[i]**2.0 # update the moving average of the squared gradient sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho)) # build solution new_solution = list() for i in range(solution.shape[0]): # calculate the step size for this variable alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i])) # calculate the change change = alpha * gradient[i] # update the moving average of squared parameter changes sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho)) # calculate the new position in this variable value = solution[i] - change # store this variable new_solution.append(value) # store the new solution solution = asarray(new_solution) solutions.append(solution) # evaluate candidate point solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return solutions # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 120 # rho for adadelta rho = 0.99 # perform the gradient descent search with adadelta solutions = adadelta(objective, derivative, bounds, n_iter, rho) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w') # show the plot pyplot.show()

Running the example performs the search as before, except in this case, the contour plot of the objective function is created.

In this case, we can see that a white dot is shown for each solution found during the search, starting above the optima and progressively getting closer to the optima at the center of the plot.

This section provides more resources on the topic if you are looking to go deeper.

- Algorithms for Optimization, 2019.
- Deep Learning, 2016.

- Gradient descent, Wikipedia.
- Stochastic gradient descent, Wikipedia.
- An overview of gradient descent optimization algorithms, 2016.

In this tutorial, you discovered how to develop the gradient descent with Adadelta optimization algorithm from scratch.

Specifically, you learned:

- Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
- Gradient descent can be updated to use an automatically adaptive step size for each input variable using a decaying average of partial derivatives, called Adadelta.
- How to implement the Adadelta optimization algorithm from scratch and apply it to an objective function and evaluate the results.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Gradient Descent With Adadelta from Scratch appeared first on Machine Learning Mastery.

]]>The post What Is Semi-Supervised Learning appeared first on Machine Learning Mastery.

]]>**Semi-supervised learning** is a learning problem that involves a small number of labeled examples and a large number of unlabeled examples.

Learning problems of this type are challenging as neither supervised nor unsupervised learning algorithms are able to make effective use of the mixtures of labeled and untellable data. As such, specialized semis-supervised learning algorithms are required.

In this tutorial, you will discover a gentle introduction to the field of semi-supervised learning for machine learning.

After completing this tutorial, you will know:

- Semi-supervised learning is a type of machine learning that sits between supervised and unsupervised learning.
- Top books on semi-supervised learning designed to get you up to speed in the field.
- Additional resources on semi-supervised learning, such as review papers and APIs.

Let’s get started.

This tutorial is divided into three parts; they are:

- Semi-Supervised Learning
- Books on Semi-Supervised Learning
- Additional Resources

Semi-supervised learning is a type of machine learning.

It refers to a learning problem (and algorithms designed for the learning problem) that involves a small portion of labeled examples and a large number of unlabeled examples from which a model must learn and make predictions on new examples.

… dealing with the situation where relatively few labeled training points are available, but a large number of unlabeled points are given, it is directly relevant to a multitude of practical problems where it is relatively expensive to produce labeled data …

— Page xiii, Semi-Supervised Learning, 2006.

As such, it is a learning problem that sits between supervised learning and unsupervised learning.

Semi-supervised learning (SSL) is halfway between supervised and unsupervised learning. In addition to unlabeled data, the algorithm is provided with some super- vision information – but not necessarily for all examples. Often, this information will be the targets associated with some of the examples.

— Page 2, Semi-Supervised Learning, 2006.

We require semi-supervised learning algorithms when working with data where labeling examples is challenging or expensive.

Semi-supervised learning has tremendous practical value. In many tasks, there is a paucity of labeled data. The labels y may be difficult to obtain because they require human annotators, special devices, or expensive and slow experiments.

— Page 9, Introduction to Semi-Supervised Learning, 2009.

The sign of an effective semi-supervised learning algorithm is that it can achieve better performance than a supervised learning algorithm fit only on the labeled training examples.

Semi-supervised learning algorithms generally are able to clear this low bar expectation.

… in comparison with a supervised algorithm that uses only labeled data, can one hope to have a more accurate prediction by taking into account the unlabeled points? […] in principle the answer is ‘yes.’”

— Page 4, Semi-Supervised Learning, 2006.

Finally, semi-supervised learning may be used or may contrast inductive and transductive learning.

Generally, inductive learning refers to a learning algorithm that learns from labeled training data and generalizes to new data, such as a test dataset. Transductive learning refers to learning from labeled training data and generalizing to available unlabeled (training) data. Both types of learning tasks may be performed by a semi-supervised learning algorithm.

… there are two distinct goals. One is to predict the labels on future test data. The other goal is to predict the labels on the unlabeled instances in the training sample. We call the former inductive semi-supervised learning, and the latter transductive learning.

— Page 12, Introduction to Semi-Supervised Learning, 2009.

If you are new to the idea of transduction vs. induction, the following tutorial has more information:

Now that we are familiar with semi-supervised learning from a high-level, let’s take a look at top books on the topic.

Semi-supervised learning is a new and fast-moving field of study, and as such, there are very few books on the topic.

There are perhaps two key books on semi-supervised learning that you should consider if you are new to the topic; they are:

Let’s take a closer look at each in turn.

The book “Semi-Supervised Learning” was published in 2006 and was edited by Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien.

This book provides a large number of chapters, each written by top researchers in the field.

It is designed to take you on a tour of the field of research including intuitions, top techniques, and open problems.

The full table of contents is listed below.

- Chapter 01: Introduction to Semi-Supervised Learning
- Part I: Generative Models
- Chapter 02: A Taxonomy for Semi-Supervised Learning Methods
- Chapter 03: Semi-Supervised Text Classification Using EM
- Chapter 04: Risks of Semi-Supervised Learning
- Chapter 05: Probabilistic Semi-Supervised Clustering with Constraints

- Part II: Low-Density Separation
- Chapter 06: Transductive Support Vector Machines
- Chapter 07: Semi-Supervised Learning Using Semi-Definite Programming
- Chapter 08: Gaussian Processes and the Null-Category Noise Model
- Chapter 09: Entropy Regularization
- Chapter 10: Data-Dependent Regularization

- Part III: Graph-Based Methods
- Chapter 11: Label Propagation and Quadratic Criterion
- Chapter 12: The Geometric Basis of Semi-Supervised Learning
- Chapter 13: Discrete Regularization
- Chapter 14: Semi-Supervised Learning with Conditional Harmonic Mixing

- Part IV: Change of Representation
- Chapter 15: Graph Kernels by Spectral Transforms
- Chapter 16: Spectral Methods for Dimensionality Reduction
- Chapter 17: Modifying Distances

- Part V: Semi-Supervised Learning in Practice
- Chapter 18: Large-Scale Algorithms
- Chapter 19: Semi-Supervised Protein Classification Using Cluster Kernels
- Chapter 20: Prediction of Protein Function from Networks
- Chapter 21: Analysis of Benchmarks

- Part VI: Perspectives
- Chapter 22: An Augmented PAC Model for Semi-Supervised Learning
- Chapter 23: Metric-Based Approaches for Semi-Supervised Regression and Classification
- Chapter 24: Transductive Inference and Semi-Supervised Learning
- Chapter 25: A Discussion of Semi-Supervised Learning and Transduction

I highly recommend this book and reading it cover to cover if you are starting out in this field.

The book “Introduction to Semi-Supervised Learning” was published in 2009 and was written by Xiaojin Zhu and Andrew Goldberg.

This book is aimed at students, researchers, and engineers just getting started in the field.

The book is a beginner’s guide to semi-supervised learning. It is aimed at advanced under-graduates, entry-level graduate students and researchers in areas as diverse as Computer Science, Electrical Engineering, Statistics, and Psychology.

— Page xiii, Introduction to Semi-Supervised Learning, 2009.

It’s a shorter read than the above book and a great introduction.

The full table of contents is listed below.

- Chapter 01: Introduction to Statistical Machine Learning
- Chapter 02: Overview of Semi-Supervised Learning
- Chapter 03: Mixture Models and EM
- Chapter 04: Co-Training
- Chapter 05: Graph-Based Semi-Supervised Learning
- Chapter 06: Semi-Supervised Support Vector Machines
- Chapter 07: Human Semi-Supervised Learning
- Chapter 08: Theory and Outlook

I also recommend this book if you’re just starting out for a quick review of the key elements of the field.

There are some additional books on semi-supervised learning that you might also like to consider; they are:

- Semi-Supervised Learning: Background, Applications and Future Directions, 2018.
- Graph-Based Semi-Supervised Learning, 2014.

**Have you read any of the above books?**

What did you think?

**Did I miss your favorite book?**

Let me know in the comments below.

There are additional resources that may be helpful when getting started in the field of semi-supervised learning.

I would recommend reading some review papers.

Some examples of good review papers on semi-supervised learning include:

- Semi-Supervised Learning Literature Survey, 2005.
- Introduction to Semi-Supervised Learning, 2009.
- An Overview of Deep Semi-Supervised Learning, 2020.

In this paper, we provide a comprehensive overview of deep semi-supervised learning, starting with an introduction to the field, followed by a summarization of the dominant semi-supervised approaches in deep learning.

— An Overview of Deep Semi-Supervised Learning, 2020.

It is also a good idea to try out some of the algorithms.

The scikit-learn Python machine learning library provides a few graph-based semi-supervised learning algorithms that you can try:

The Wikipedia article may also provide some useful links for further reading:

In this tutorial, you discovered a gentle introduction to the field of semi-supervised learning for machine learning.

Specifically, you learned:

- Semi-supervised learning is a type of machine learning that sits between supervised and unsupervised learning.
- Top books on semi-supervised learning designed to get you up to speed in the field.
- Additional resources on semi-supervised learning, such as review papers and APIs.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post What Is Semi-Supervised Learning appeared first on Machine Learning Mastery.

]]>The post Develop a Neural Network for Cancer Survival Dataset appeared first on Machine Learning Mastery.

]]>It can be challenging to develop a neural network predictive model for a new dataset.

One approach is to first inspect the dataset and develop ideas for what models might work, then explore the learning dynamics of simple models on the dataset, then finally develop and tune a model for the dataset with a robust test harness.

This process can be used to develop effective neural network models for classification and regression predictive modeling problems.

In this tutorial, you will discover how to develop a Multilayer Perceptron neural network model for the cancer survival binary classification dataset.

After completing this tutorial, you will know:

- How to load and summarize the cancer survival dataset and use the results to suggest data preparations and model configurations to use.
- How to explore the learning dynamics of simple MLP models on the dataset.
- How to develop robust estimates of model performance, tune model performance and make predictions on new data.

Let’s get started.

This tutorial is divided into 4 parts; they are:

- Haberman Breast Cancer Survival Dataset
- Neural Network Learning Dynamics
- Robust Model Evaluation
- Final Model and Make Predictions

The first step is to define and explore the dataset.

We will be working with the “*haberman*” standard binary classification dataset.

The dataset describes breast cancer patient data and the outcome is patient survival. Specifically whether the patient survived for five years or longer, or whether the patient did not survive.

This is a standard dataset used in the study of imbalanced classification. According to the dataset description, the operations were conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital.

There are 306 examples in the dataset, and there are 3 input variables; they are:

- The age of the patient at the time of the operation.
- The two-digit year of the operation.
- The number of “
*positive axillary nodes*” detected, a measure of whether cancer has spread.

As such, we have no control over the selection of cases that make up the dataset or features to use in those cases, other than what is available in the dataset.

Although the dataset describes breast cancer patient survival, given the small dataset size and the fact the data is based on breast cancer diagnosis and operations many decades ago, any models built on this dataset are not expected to generalize.

**Note: to be crystal clear**, we are NOT “*solving breast cancer*“. We are exploring a standard classification dataset.

Below is a sample of the first 5 rows of the dataset

30,64,1,1 30,62,3,1 30,65,0,1 31,59,2,1 31,65,4,1 ...

You can learn more about the dataset here:

We can load the dataset as a pandas DataFrame directly from the URL; for example:

# load the haberman dataset and summarize the shape from pandas import read_csv # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/haberman.csv' # load the dataset df = read_csv(url, header=None) # summarize shape print(df.shape)

Running the example loads the dataset directly from the URL and reports the shape of the dataset.

In this case, we can confirm that the dataset has 4 variables (3 input and one output) and that the dataset has 306 rows of data.

This is not many rows of data for a neural network and suggests that a small network, perhaps with regularization, would be appropriate.

It also suggests that using k-fold cross-validation would be a good idea given that it will give a more reliable estimate of model performance than a train/test split and because a single model will fit in seconds instead of hours or days with the largest datasets.

(306, 4)

Next, we can learn more about the dataset by looking at summary statistics and a plot of the data.

# show summary statistics and plots of the haberman dataset from pandas import read_csv from matplotlib import pyplot # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/haberman.csv' # load the dataset df = read_csv(url, header=None) # show summary statistics print(df.describe()) # plot histograms df.hist() pyplot.show()

Running the example first loads the data before and then prints summary statistics for each variable.

We can see that values vary with different means and standard deviations, perhaps some normalization or standardization would be required prior to modeling.

0 1 2 3 count 306.000000 306.000000 306.000000 306.000000 mean 52.457516 62.852941 4.026144 1.264706 std 10.803452 3.249405 7.189654 0.441899 min 30.000000 58.000000 0.000000 1.000000 25% 44.000000 60.000000 0.000000 1.000000 50% 52.000000 63.000000 1.000000 1.000000 75% 60.750000 65.750000 4.000000 2.000000 max 83.000000 69.000000 52.000000 2.000000

A histogram plot is then created for each variable.

We can see that perhaps the first variable has a Gaussian-like distribution and the next two input variables may have an exponential distribution.

We may have some benefit in using a power transform on each variable in order to make the probability distribution less skewed which will likely improve model performance.

We can see some skew in the distribution of examples between the two classes, meaning that the classification problem is not balanced. It is imbalanced.

It may be helpful to know how imbalanced the dataset actually is.

We can use the Counter object to count the number of examples in each class, then use those counts to summarize the distribution.

The complete example is listed below.

# summarize the class ratio of the haberman dataset from pandas import read_csv from collections import Counter # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/haberman.csv' # define the dataset column names columns = ['age', 'year', 'nodes', 'class'] # load the csv file as a data frame dataframe = read_csv(url, header=None, names=columns) # summarize the class distribution target = dataframe['class'].values counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))

Running the example summarizes the class distribution for the dataset.

We can see that class 1 for survival has the most examples at 225, or about 74 percent of the dataset. We can see class 2 for non-survival has fewer examples at 81, or about 26 percent of the dataset.

The class distribution is skewed, but it is not severely imbalanced.

Class=1, Count=225, Percentage=73.529% Class=2, Count=81, Percentage=26.471%

This is helpful because if we use classification accuracy, then any model that achieves an accuracy less than about 73.5% does not have skill on this dataset.

Now that we are familiar with the dataset, let’s explore how we might develop a neural network model.

We will develop a Multilayer Perceptron (MLP) model for the dataset using TensorFlow.

We cannot know what model architecture of learning hyperparameters would be good or best for this dataset, so we must experiment and discover what works well.

Given that the dataset is small, a small batch size is probably a good idea, e.g. 16 or 32 rows. Using the Adam version of stochastic gradient descent is a good idea when getting started as it will automatically adapt the learning rate and works well on most datasets.

Before we evaluate models in earnest, it is a good idea to review the learning dynamics and tune the model architecture and learning configuration until we have stable learning dynamics, then look at getting the most out of the model.

We can do this by using a simple train/test split of the data and review plots of the learning curves. This will help us see if we are over-learning or under-learning; then we can adapt the configuration accordingly.

First, we must ensure all input variables are floating-point values and encode the target label as integer values 0 and 1.

... # ensure all data are floating point values X = X.astype('float32') # encode strings to integer y = LabelEncoder().fit_transform(y)

Next, we can split the dataset into input and output variables, then into 67/33 train and test sets.

We must ensure that the split is stratified by the class ensuring that the train and test sets have the same distribution of class labels as the main dataset.

... # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # split into train and test datasets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, stratify=y, random_state=3)

We can define a minimal MLP model. In this case, we will use one hidden layer with 10 nodes and one output layer (chosen arbitrarily). We will use the ReLU activation function in the hidden layer and the “*he_normal*” weight initialization, as together, they are a good practice.

The output of the model is a sigmoid activation for binary classification and we will minimize binary cross-entropy loss.

... # determine the number of input features n_features = X.shape[1] # define model model = Sequential() model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy')

We will fit the model for 200 training epochs (chosen arbitrarily) with a batch size of 16 because it is a small dataset.

We are fitting the model on raw data, which we think might be a good idea, but it is an important starting point.

... # fit the model history = model.fit(X_train, y_train, epochs=200, batch_size=16, verbose=0, validation_data=(X_test,y_test))

At the end of training, we will evaluate the model’s performance on the test dataset and report performance as the classification accuracy.

... # predict test set yhat = model.predict_classes(X_test) # evaluate predictions score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score)

Finally, we will plot learning curves of the cross-entropy loss on the train and test sets during training.

... # plot learning curves pyplot.title('Learning Curves') pyplot.xlabel('Epoch') pyplot.ylabel('Cross Entropy') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='val') pyplot.legend() pyplot.show()

Tying this all together, the complete example of evaluating our first MLP on the cancer survival dataset is listed below.

# fit a simple mlp model on the haberman and review learning curves from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense from matplotlib import pyplot # load the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/haberman.csv' df = read_csv(path, header=None) # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # ensure all data are floating point values X = X.astype('float32') # encode strings to integer y = LabelEncoder().fit_transform(y) # split into train and test datasets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, stratify=y, random_state=3) # determine the number of input features n_features = X.shape[1] # define model model = Sequential() model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy') # fit the model history = model.fit(X_train, y_train, epochs=200, batch_size=16, verbose=0, validation_data=(X_test,y_test)) # predict test set yhat = model.predict_classes(X_test) # evaluate predictions score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score) # plot learning curves pyplot.title('Learning Curves') pyplot.xlabel('Epoch') pyplot.ylabel('Cross Entropy') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='val') pyplot.legend() pyplot.show()

Running the example first fits the model on the training dataset, then reports the classification accuracy on the test dataset.

**Kick-start your project** with my new book Data Preparation for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

In this case we can see that the model performs better than a no-skill model, given that the accuracy is above about 73.5%.

Accuracy: 0.765

Line plots of the loss on the train and test sets are then created.

We can see that the model quickly finds a good fit on the dataset and does not appear to be over or underfitting.

Now that we have some idea of the learning dynamics for a simple MLP model on the dataset, we can look at developing a more robust evaluation of model performance on the dataset.

The k-fold cross-validation procedure can provide a more reliable estimate of MLP performance, although it can be very slow.

This is because k models must be fit and evaluated. This is not a problem when the dataset size is small, such as the cancer survival dataset.

We can use the StratifiedKFold class and enumerate each fold manually, fit the model, evaluate it, and then report the mean of the evaluation scores at the end of the procedure.

... # prepare cross validation kfold = KFold(10) # enumerate splits scores = list() for train_ix, test_ix in kfold.split(X, y): # fit and evaluate the model... ... ... # summarize all scores print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

We can use this framework to develop a reliable estimate of MLP model performance with our base configuration, and even with a range of different data preparations, model architectures, and learning configurations.

It is important that we first developed an understanding of the learning dynamics of the model on the dataset in the previous section before using k-fold cross-validation to estimate the performance. If we started to tune the model directly, we might get good results, but if not, we might have no idea of why, e.g. that the model was over or under fitting.

If we make large changes to the model again, it is a good idea to go back and confirm that the model is converging appropriately.

The complete example of this framework to evaluate the base MLP model from the previous section is listed below.

# k-fold cross-validation of base model for the haberman dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense from matplotlib import pyplot # load the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/haberman.csv' df = read_csv(path, header=None) # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # ensure all data are floating point values X = X.astype('float32') # encode strings to integer y = LabelEncoder().fit_transform(y) # prepare cross validation kfold = StratifiedKFold(10, random_state=1) # enumerate splits scores = list() for train_ix, test_ix in kfold.split(X, y): # split data X_train, X_test, y_train, y_test = X[train_ix], X[test_ix], y[train_ix], y[test_ix] # determine the number of input features n_features = X.shape[1] # define model model = Sequential() model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy') # fit the model model.fit(X_train, y_train, epochs=200, batch_size=16, verbose=0) # predict test set yhat = model.predict_classes(X_test) # evaluate predictions score = accuracy_score(y_test, yhat) print('>%.3f' % score) scores.append(score) # summarize all scores print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports the model performance each iteration of the evaluation procedure and reports the mean and standard deviation of classification accuracy at the end of the run.

**Kick-start your project** with my new book Data Preparation for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

In this case, we can see that the MLP model achieved a mean accuracy of about 75.2 percent, which is pretty close to our rough estimate in the previous section.

This confirms our expectation that the base model configuration may work better than a naive model for this dataset

>0.742 >0.774 >0.774 >0.806 >0.742 >0.710 >0.767 >0.800 >0.767 >0.633 Mean Accuracy: 0.752 (0.048)

Is this a good result?

In fact, this is a challenging classification problem and achieving a score above about 74.5% is good.

Next, let’s look at how we might fit a final model and use it to make predictions.

Once we choose a model configuration, we can train a final model on all available data and use it to make predictions on new data.

In this case, we will use the model with dropout and a small batch size as our final model.

We can prepare the data and fit the model as before, although on the entire dataset instead of a training subset of the dataset.

... # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # ensure all data are floating point values X = X.astype('float32') # encode strings to integer le = LabelEncoder() y = le.fit_transform(y) # determine the number of input features n_features = X.shape[1] # define model model = Sequential() model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy')

We can then use this model to make predictions on new data.

First, we can define a row of new data.

... # define a row of new data row = [30,64,1]

Note: I took this row from the first row of the dataset and the expected label is a ‘1’.

We can then make a prediction.

... # make prediction yhat = model.predict_classes([row])

Then invert the transform on the prediction, so we can use or interpret the result in the correct label (which is just an integer for this dataset).

... # invert transform to get label for class yhat = le.inverse_transform(yhat)

And in this case, we will simply report the prediction.

... # report prediction print('Predicted: %s' % (yhat[0]))

Tying this all together, the complete example of fitting a final model for the haberman dataset and using it to make a prediction on new data is listed below.

# fit a final model and make predictions on new data for the haberman dataset from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.layers import Dropout # load the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/haberman.csv' df = read_csv(path, header=None) # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # ensure all data are floating point values X = X.astype('float32') # encode strings to integer le = LabelEncoder() y = le.fit_transform(y) # determine the number of input features n_features = X.shape[1] # define model model = Sequential() model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy') # fit the model model.fit(X, y, epochs=200, batch_size=16, verbose=0) # define a row of new data row = [30,64,1] # make prediction yhat = model.predict_classes([row]) # invert transform to get label for class yhat = le.inverse_transform(yhat) # report prediction print('Predicted: %s' % (yhat[0]))

Running the example fits the model on the entire dataset and makes a prediction for a single row of new data.

**Kick-start your project** with my new book Data Preparation for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

In this case, we can see that the model predicted a “1” label for the input row.

Predicted: 1

This section provides more resources on the topic if you are looking to go deeper.

- How to Develop a Probabilistic Model of Breast Cancer Patient Survival
- How to Develop a Neural Net for Predicting Disturbances in the Ionosphere
- Best Results for Standard Machine Learning Datasets
- TensorFlow 2 Tutorial: Get Started in Deep Learning With tf.keras
- A Gentle Introduction to k-fold Cross-Validation

In this tutorial, you discovered how to develop a Multilayer Perceptron neural network model for the cancer survival binary classification dataset.

Specifically, you learned:

- How to load and summarize the cancer survival dataset and use the results to suggest data preparations and model configurations to use.
- How to explore the learning dynamics of simple MLP models on the dataset.
- How to develop robust estimates of model performance, tune model performance and make predictions on new data.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Develop a Neural Network for Cancer Survival Dataset appeared first on Machine Learning Mastery.

]]>The post Neural Network Models for Combined Classification and Regression appeared first on Machine Learning Mastery.

]]>Some prediction problems require predicting both numeric values and a class label for the same input.

A simple approach is to develop both regression and classification predictive models on the same data and use the models sequentially.

An alternative and often more effective approach is to develop a single neural network model that can predict both a numeric and class label value from the same input. This is called a **multi-output model** and can be relatively easy to develop and evaluate using modern deep learning libraries such as Keras and TensorFlow.

In this tutorial, you will discover how to develop a neural network for combined regression and classification predictions.

After completing this tutorial, you will know:

- Some prediction problems require predicting both numeric and class label values for each input example.
- How to develop separate regression and classification models for problems that require multiple outputs.
- How to develop and evaluate a neural network model capable of making simultaneous regression and classification predictions.

Let’s get started.

This tutorial is divided into three parts; they are:

- Single Model for Regression and Classification
- Separate Regression and Classification Models
- Abalone Dataset
- Regression Model
- Classification Model

- Combined Regression and Classification Models

It is common to develop a deep learning neural network model for a regression or classification problem, but on some predictive modeling tasks, we may want to develop a single model that can make both regression and classification predictions.

Regression refers to predictive modeling problems that involve predicting a numeric value given an input.

Classification refers to predictive modeling problems that involve predicting a class label or probability of class labels for a given input.

For more on the difference between classification and regression, see the tutorial:

There may be some problems where we want to predict both a numerical value and a classification value.

One approach to solving this problem is to develop a separate model for each prediction that is required.

The problem with this approach is that the predictions made by the separate models may diverge.

An alternate approach that can be used when using neural network models is to develop a single model capable of making separate predictions for a numeric and class output for the same input.

This is called a multi-output neural network model.

The benefit of this type of model is that we have a single model to develop and maintain instead of two models and that training and updating the model on both output types at the same time may offer more consistency in the predictions between the two output types.

We will develop a multi-output neural network model capable of making regression and classification predictions at the same time.

First, let’s select a dataset where this requirement makes sense and start by developing separate models for both regression and classification predictions.

In this section, we will start by selecting a real dataset where we may want regression and classification predictions at the same time, then develop separate models for each type of prediction.

We will use the “*abalone*” dataset.

Determining the age of an abalone is a time-consuming task and it is desirable to determine the age from physical details alone.

This is a dataset that describes the physical details of abalone and requires predicting the number of rings of the abalone, which is a proxy for the age of the creature.

You can learn more about the dataset from here:

The “*age*” can be predicted as both a numerical value (in years) or a class label (ordinal year as a class).

No need to download the dataset as we will download it automatically as part of the worked examples.

The dataset provides an example of a dataset where we may want both a numerical and classification of an input.

First, let’s develop an example to download and summarize the dataset.

# load and summarize the abalone dataset from pandas import read_csv from matplotlib import pyplot # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/abalone.csv' dataframe = read_csv(url, header=None) # summarize shape print(dataframe.shape) # summarize first few lines print(dataframe.head())

Running the example first downloads and summarizes the shape of the dataset.

We can see that there are 4,177 examples (rows) that we can use to train and evaluate a model and 9 features (columns) including the target variable.

We can see that all input variables are numeric except the first, which is a string value.

To keep data preparation simple, we will drop the first column from our models and focus on modeling the numeric input values.

(4177, 9) 0 1 2 3 4 5 6 7 8 0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15 1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7 2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9 3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10 4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7

We can use the data as the basis for developing separate regression and classification Multilayer Perceptron (MLP) neural network models.

**Note**: we are not trying to develop an optimal model for this dataset; instead we are demonstrating a specific technique: developing a model that can make both regression and classification predictions.

In this section, we will develop a regression MLP model for the abalone dataset.

First, we must separate the columns into input and output elements and drop the first column that contains string values.

We will also force all loaded columns to have a float type (expected by neural network models) and record the number of input features, which will need to be known by the model later.

... # split into input (X) and output (y) variables X, y = dataset[:, 1:-1], dataset[:, -1] X, y = X.astype('float'), y.astype('float') n_features = X.shape[1]

Next, we can split the dataset into a train and test dataset.

We will use a 67% random sample to train the model and the remaining 33% to evaluate the model.

... # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

We can then define an MLP neural network model.

The model will have two hidden layers, the first with 20 nodes and the second with 10 nodes, both using ReLU activation and “*he normal*” weight initialization (a good practice). The number of layers and nodes were chosen arbitrarily.

The output layer will have a single node for predicting a numeric value and a linear activation function.

... # define the keras model model = Sequential() model.add(Dense(20, input_dim=n_features, activation='relu', kernel_initializer='he_normal')) model.add(Dense(10, activation='relu', kernel_initializer='he_normal')) model.add(Dense(1, activation='linear'))

The model will be trained to minimize the mean squared error (MSE) loss function using the effective Adam version of stochastic gradient descent.

... # compile the keras model model.compile(loss='mse', optimizer='adam')

We will train the model for 150 epochs with a mini-batch size of 32 samples, again chosen arbitrarily.

... # fit the keras model on the dataset model.fit(X_train, y_train, epochs=150, batch_size=32, verbose=2)

Finally, after the model is trained, we will evaluate it on the holdout test dataset and report the mean absolute error (MAE).

... # evaluate on test set yhat = model.predict(X_test) error = mean_absolute_error(y_test, yhat) print('MAE: %.3f' % error)

Tying this all together, the complete example of an MLP neural network for the abalone dataset framed as a regression problem is listed below.

# regression mlp model for the abalone dataset from pandas import read_csv from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from sklearn.metrics import mean_absolute_error from sklearn.model_selection import train_test_split # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/abalone.csv' dataframe = read_csv(url, header=None) dataset = dataframe.values # split into input (X) and output (y) variables X, y = dataset[:, 1:-1], dataset[:, -1] X, y = X.astype('float'), y.astype('float') n_features = X.shape[1] # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # define the keras model model = Sequential() model.add(Dense(20, input_dim=n_features, activation='relu', kernel_initializer='he_normal')) model.add(Dense(10, activation='relu', kernel_initializer='he_normal')) model.add(Dense(1, activation='linear')) # compile the keras model model.compile(loss='mse', optimizer='adam') # fit the keras model on the dataset model.fit(X_train, y_train, epochs=150, batch_size=32, verbose=2) # evaluate on test set yhat = model.predict(X_test) error = mean_absolute_error(y_test, yhat) print('MAE: %.3f' % error)

Running the example will prepare the dataset, fit the model, and report an estimate of model error.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved an error of about 1.5 (rings).

... Epoch 145/150 88/88 - 0s - loss: 4.6130 Epoch 146/150 88/88 - 0s - loss: 4.6182 Epoch 147/150 88/88 - 0s - loss: 4.6277 Epoch 148/150 88/88 - 0s - loss: 4.6437 Epoch 149/150 88/88 - 0s - loss: 4.6166 Epoch 150/150 88/88 - 0s - loss: 4.6132 MAE: 1.554

So far so good.

Next, let’s look at developing a similar model for classification.

The abalone dataset can be framed as a classification problem where each “*ring*” integer is taken as a separate class label.

The example and model are much the same as the above example for regression, with a few important changes.

This requires first assigning a separate integer for each “*ring*” value, starting at 0 and ending at the total number of “*classes*” minus one.

This can be achieved using the LabelEncoder.

We can also record the total number of classes as the total number of unique encoded class values, which will be needed by the model later.

... # encode strings to integer y = LabelEncoder().fit_transform(y) n_class = len(unique(y))

After splitting the data into train and test sets as before, we can define the model and change the number of outputs from the model to equal the number of classes and use the softmax activation function, common for multi-class classification.

... # define the keras model model = Sequential() model.add(Dense(20, input_dim=n_features, activation='relu', kernel_initializer='he_normal')) model.add(Dense(10, activation='relu', kernel_initializer='he_normal')) model.add(Dense(n_class, activation='softmax'))

Given we have encoded class labels as integer values, we can fit the model by minimizing the sparse categorical cross-entropy loss function, appropriate for multi-class classification tasks with integer encoded class labels.

... # compile the keras model model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')

After the model is fit on the training dataset as before, we can evaluate the performance of the model by calculating the classification accuracy on the hold-out test set.

... # evaluate on test set yhat = model.predict(X_test) yhat = argmax(yhat, axis=-1).astype('int') acc = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % acc)

Tying this all together, the complete example of an MLP neural network for the abalone dataset framed as a classification problem is listed below.

# classification mlp model for the abalone dataset from numpy import unique from numpy import argmax from pandas import read_csv from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/abalone.csv' dataframe = read_csv(url, header=None) dataset = dataframe.values # split into input (X) and output (y) variables X, y = dataset[:, 1:-1], dataset[:, -1] X, y = X.astype('float'), y.astype('float') n_features = X.shape[1] # encode strings to integer y = LabelEncoder().fit_transform(y) n_class = len(unique(y)) # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # define the keras model model = Sequential() model.add(Dense(20, input_dim=n_features, activation='relu', kernel_initializer='he_normal')) model.add(Dense(10, activation='relu', kernel_initializer='he_normal')) model.add(Dense(n_class, activation='softmax')) # compile the keras model model.compile(loss='sparse_categorical_crossentropy', optimizer='adam') # fit the keras model on the dataset model.fit(X_train, y_train, epochs=150, batch_size=32, verbose=2) # evaluate on test set yhat = model.predict(X_test) yhat = argmax(yhat, axis=-1).astype('int') acc = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % acc)

Running the example will prepare the dataset, fit the model, and report an estimate of model error.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved an accuracy of about 27%.

... Epoch 145/150 88/88 - 0s - loss: 1.9271 Epoch 146/150 88/88 - 0s - loss: 1.9265 Epoch 147/150 88/88 - 0s - loss: 1.9265 Epoch 148/150 88/88 - 0s - loss: 1.9271 Epoch 149/150 88/88 - 0s - loss: 1.9262 Epoch 150/150 88/88 - 0s - loss: 1.9260 Accuracy: 0.274

So far so good.

Next, let’s look at developing a combined model capable of both regression and classification predictions.

In this section, we can develop a single MLP neural network model that can make both regression and classification predictions for a single input.

This is called a multi-output model and can be developed using the functional Keras API.

For more on this functional API, which can be tricky for beginners, see the tutorials:

- TensorFlow 2 Tutorial: Get Started in Deep Learning With tf.keras
- How to Use the Keras Functional API for Deep Learning

First, the dataset must be prepared.

We can prepare the dataset as we did before for classification, although we should save the encoded target variable with a separate name to differentiate it from the raw target variable values.

... # encode strings to integer y_class = LabelEncoder().fit_transform(y) n_class = len(unique(y_class))

We can then split the input, raw output, and encoded output variables into train and test sets.

... # split data into train and test sets X_train, X_test, y_train, y_test, y_train_class, y_test_class = train_test_split(X, y, y_class, test_size=0.33, random_state=1)

Next, we can define the model using the functional API.

The model takes the same number of inputs as before with the standalone models and uses two hidden layers configured in the same way.

... # input visible = Input(shape=(n_features,)) hidden1 = Dense(20, activation='relu', kernel_initializer='he_normal')(visible) hidden2 = Dense(10, activation='relu', kernel_initializer='he_normal')(hidden1)

We can then define two separate output layers that connect to the second hidden layer of the model.

The first is a regression output layer that has a single node and a linear activation function.

... # regression output out_reg = Dense(1, activation='linear')(hidden2)

The second is a classification output layer that has one node for each class being predicted and uses a softmax activation function.

... # classification output out_clas = Dense(n_class, activation='softmax')(hidden2)

We can then define the model with a single input layer and two output layers.

... # define model model = Model(inputs=visible, outputs=[out_reg, out_clas])

Given the two output layers, we can compile the model with two loss functions, mean squared error loss for the first (regression) output layer and sparse categorical cross-entropy for the second (classification) output layer.

... # compile the keras model model.compile(loss=['mse','sparse_categorical_crossentropy'], optimizer='adam')

We can also create a plot of the model for reference.

This requires that pydot and pygraphviz are installed. If this is a problem, you can comment out this line and the import statement for the *plot_model()* function.

... # plot graph of model plot_model(model, to_file='model.png', show_shapes=True)

Each time the model makes a prediction, it will predict two values.

Similarly, when training the model, it will need one target variable per sample for each output.

As such, we can train the model, carefully providing both the regression target and classification target data to each output of the model.

... # fit the keras model on the dataset model.fit(X_train, [y_train,y_train_class], epochs=150, batch_size=32, verbose=2)

The fit model can then make a regression and classification prediction for each example in the hold-out test set.

... # make predictions on test set yhat1, yhat2 = model.predict(X_test)

The first array can be used to evaluate the regression predictions via mean absolute error.

... # calculate error for regression model error = mean_absolute_error(y_test, yhat1) print('MAE: %.3f' % error)

The second array can be used to evaluate the classification predictions via classification accuracy.

... # evaluate accuracy for classification model yhat2 = argmax(yhat2, axis=-1).astype('int') acc = accuracy_score(y_test_class, yhat2) print('Accuracy: %.3f' % acc)

And that’s it.

Tying this together, the complete example of training and evaluating a multi-output model for combiner regression and classification predictions on the abalone dataset is listed below.

# mlp for combined regression and classification predictions on the abalone dataset from numpy import unique from numpy import argmax from pandas import read_csv from sklearn.metrics import mean_absolute_error from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from tensorflow.keras.models import Model from tensorflow.keras.layers import Input from tensorflow.keras.layers import Dense from tensorflow.keras.utils import plot_model # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/abalone.csv' dataframe = read_csv(url, header=None) dataset = dataframe.values # split into input (X) and output (y) variables X, y = dataset[:, 1:-1], dataset[:, -1] X, y = X.astype('float'), y.astype('float') n_features = X.shape[1] # encode strings to integer y_class = LabelEncoder().fit_transform(y) n_class = len(unique(y_class)) # split data into train and test sets X_train, X_test, y_train, y_test, y_train_class, y_test_class = train_test_split(X, y, y_class, test_size=0.33, random_state=1) # input visible = Input(shape=(n_features,)) hidden1 = Dense(20, activation='relu', kernel_initializer='he_normal')(visible) hidden2 = Dense(10, activation='relu', kernel_initializer='he_normal')(hidden1) # regression output out_reg = Dense(1, activation='linear')(hidden2) # classification output out_clas = Dense(n_class, activation='softmax')(hidden2) # define model model = Model(inputs=visible, outputs=[out_reg, out_clas]) # compile the keras model model.compile(loss=['mse','sparse_categorical_crossentropy'], optimizer='adam') # plot graph of model plot_model(model, to_file='model.png', show_shapes=True) # fit the keras model on the dataset model.fit(X_train, [y_train,y_train_class], epochs=150, batch_size=32, verbose=2) # make predictions on test set yhat1, yhat2 = model.predict(X_test) # calculate error for regression model error = mean_absolute_error(y_test, yhat1) print('MAE: %.3f' % error) # evaluate accuracy for classification model yhat2 = argmax(yhat2, axis=-1).astype('int') acc = accuracy_score(y_test_class, yhat2) print('Accuracy: %.3f' % acc)

Running the example will prepare the dataset, fit the model, and report an estimate of model error.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

A plot of the multi-output model is created, clearly showing the regression (left) and classification (right) output layers connected to the second hidden layer of the model.

In this case, we can see that the model achieved both a reasonable error of about 1.495 (rings) and a similar accuracy as before of about 25.6%.

... Epoch 145/150 88/88 - 0s - loss: 6.5707 - dense_2_loss: 4.5396 - dense_3_loss: 2.0311 Epoch 146/150 88/88 - 0s - loss: 6.5753 - dense_2_loss: 4.5466 - dense_3_loss: 2.0287 Epoch 147/150 88/88 - 0s - loss: 6.5970 - dense_2_loss: 4.5723 - dense_3_loss: 2.0247 Epoch 148/150 88/88 - 0s - loss: 6.5640 - dense_2_loss: 4.5389 - dense_3_loss: 2.0251 Epoch 149/150 88/88 - 0s - loss: 6.6053 - dense_2_loss: 4.5827 - dense_3_loss: 2.0226 Epoch 150/150 88/88 - 0s - loss: 6.5754 - dense_2_loss: 4.5524 - dense_3_loss: 2.0230 MAE: 1.495 Accuracy: 0.256

This section provides more resources on the topic if you are looking to go deeper.

- Difference Between Classification and Regression in Machine Learning
- TensorFlow 2 Tutorial: Get Started in Deep Learning With tf.keras
- Best Results for Standard Machine Learning Datasets
- How to Use the Keras Functional API for Deep Learning

In this tutorial, you discovered how to develop a neural network for combined regression and classification predictions.

Specifically, you learned:

- Some prediction problems require predicting both numeric and class label values for each input example.
- How to develop separate regression and classification models for problems that require multiple outputs.
- How to develop and evaluate a neural network model capable of making simultaneous regression and classification predictions.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Neural Network Models for Combined Classification and Regression appeared first on Machine Learning Mastery.

]]>The post Iterated Local Search From Scratch in Python appeared first on Machine Learning Mastery.

]]>**Iterated Local Search** is a stochastic global optimization algorithm.

It involves the repeated application of a local search algorithm to modified versions of a good solution found previously. In this way, it is like a clever version of the stochastic hill climbing with random restarts algorithm.

The intuition behind the algorithm is that random restarts can help to locate many local optima in a problem and that better local optima are often close to other local optima. Therefore modest perturbations to existing local optima may locate better or even best solutions to an optimization problem.

In this tutorial, you will discover how to implement the iterated local search algorithm from scratch.

After completing this tutorial, you will know:

- Iterated local search is a stochastic global search optimization algorithm that is a smarter version of stochastic hill climbing with random restarts.
- How to implement stochastic hill climbing with random restarts from scratch.
- How to implement and apply the iterated local search algorithm to a nonlinear objective function.

Let’s get started.

This tutorial is divided into five parts; they are:

- What Is Iterated Local Search
- Ackley Objective Function
- Stochastic Hill Climbing Algorithm
- Stochastic Hill Climbing With Random Restarts
- Iterated Local Search Algorithm

Iterated Local Search, or ILS for short, is a stochastic global search optimization algorithm.

It is related to or an extension of stochastic hill climbing and stochastic hill climbing with random starts.

It’s essentially a more clever version of Hill-Climbing with Random Restarts.

— Page 26, Essentials of Metaheuristics, 2011.

Stochastic hill climbing is a local search algorithm that involves making random modifications to an existing solution and accepting the modification only if it results in better results than the current working solution.

Local search algorithms in general can get stuck in local optima. One approach to address this problem is to restart the search from a new randomly selected starting point. The restart procedure can be performed many times and may be triggered after a fixed number of function evaluations or if no further improvement is seen for a given number of algorithm iterations. This algorithm is called stochastic hill climbing with random restarts.

The simplest possibility to improve upon a cost found by LocalSearch is to repeat the search from another starting point.

— Page 132, Handbook of Metaheuristics, 3rd edition 2019.

Iterated local search is similar to stochastic hill climbing with random restarts, except rather than selecting a random starting point for each restart, a point is selected based on a modified version of the best point found so far during the broader search.

The perturbation of the best solution so far is like a large jump in the search space to a new region, whereas the perturbations made by the stochastic hill climbing algorithm are much smaller, confined to a specific region of the search space.

The heuristic here is that you can often find better local optima near to the one you’re presently in, and walking from local optimum to local optimum in this way often outperforms just trying new locations entirely at random.

— Page 26, Essentials of Metaheuristics, 2011.

This allows the search to be performed at two levels. The hill climbing algorithm is the local search for getting the most out of a specific candidate solution or region of the search space, and the restart approach allows different regions of the search space to be explored.

In this way, the algorithm Iterated Local Search explores multiple local optima in the search space, increasing the likelihood of locating the global optima.

The Iterated Local Search was proposed for combinatorial optimization problems, such as the traveling salesman problem (TSP), although it can be applied to continuous function optimization by using different step sizes in the search space: smaller steps for the hill climbing and larger steps for the random restart.

Now that we are familiar with the Iterated Local Search algorithm, let’s explore how to implement the algorithm from scratch.

First, let’s define a channeling optimization problem as the basis for implementing the Iterated Local Search algorithm.

The Ackley function is an example of a multimodal objective function that has a single global optima and multiple local optima in which a local search might get stuck.

As such, a global optimization technique is required. It is a two-dimensional objective function that has a global optima at [0,0], which evaluates to 0.0.

The example below implements the Ackley and creates a three-dimensional surface plot showing the global optima and multiple local optima.

# ackley multimodal function from numpy import arange from numpy import exp from numpy import sqrt from numpy import cos from numpy import e from numpy import pi from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi * x) + cos(2 * pi * y))) + e + 20 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates the surface plot of the Ackley function showing the vast number of local optima.

We will use this as the basis for implementing and comparing a simple stochastic hill climbing algorithm, stochastic hill climbing with random restarts, and finally iterated local search.

We would expect a stochastic hill climbing algorithm to get stuck easily in local minima. We would expect stochastic hill climbing with restarts to find many local minima, and we would expect iterated local search to perform better than either method on this problem if configured appropriately.

Core to the Iterated Local Search algorithm is a local search, and in this tutorial, we will use the Stochastic Hill Climbing algorithm for this purpose.

The Stochastic Hill Climbing algorithm involves first generating a random starting point and current working solution, then generating perturbed versions of the current working solution and accepting them if they are better than the current working solution.

Given that we are working on a continuous optimization problem, a solution is a vector of values to be evaluated by the objective function, in this case, a point in a two-dimensional space bounded by -5 and 5.

We can generate a random point by sampling the search space with a uniform probability distribution. For example:

... # generate a random point in the search space solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

We can generate perturbed versions of a currently working solution using a Gaussian probability distribution with the mean of the current values in the solution and a standard deviation controlled by a hyperparameter that controls how far the search is allowed to explore from the current working solution.

We will refer to this hyperparameter as “*step_size*“, for example:

... # generate a perturbed version of a current working solution candidate = solution + randn(len(bounds)) * step_size

Importantly, we must check that generated solutions are within the search space.

This can be achieved with a custom function named *in_bounds()* that takes a candidate solution and the bounds of the search space and returns True if the point is in the search space, *False* otherwise.

# check if a point is within the bounds of the search def in_bounds(point, bounds): # enumerate all dimensions of the point for d in range(len(bounds)): # check if out of bounds for this dimension if point[d] < bounds[d, 0] or point[d] > bounds[d, 1]: return False return True

This function can then be called during the hill climb to confirm that new points are in the bounds of the search space, and if not, new points can be generated.

Tying this together, the function *hillclimbing()* below implements the stochastic hill climbing local search algorithm. It takes the name of the objective function, bounds of the problem, number of iterations, and steps size as arguments and returns the best solution and its evaluation.

# hill climbing local search algorithm def hillclimbing(objective, bounds, n_iterations, step_size): # generate an initial point solution = None while solution is None or not in_bounds(solution, bounds): solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # evaluate the initial point solution_eval = objective(solution) # run the hill climb for i in range(n_iterations): # take a step candidate = None while candidate is None or not in_bounds(candidate, bounds): candidate = solution + randn(len(bounds)) * step_size # evaluate candidate point candidte_eval = objective(candidate) # check if we should keep the new point if candidte_eval <= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval # report progress print('>%d f(%s) = %.5f' % (i, solution, solution_eval)) return [solution, solution_eval]

We can test this algorithm on the Ackley function.

We will fix the seed for the pseudorandom number generator to ensure we get the same results each time the code is run.

The algorithm will be run for 1,000 iterations and a step size of 0.05 units will be used; both hyperparameters were chosen after a little trial and error.

At the end of the run, we will report the best solution found.

... # seed the pseudorandom number generator seed(1) # define range for input bounds = asarray([[-5.0, 5.0], [-5.0, 5.0]]) # define the total iterations n_iterations = 1000 # define the maximum step size step_size = 0.05 # perform the hill climbing search best, score = hillclimbing(objective, bounds, n_iterations, step_size) print('Done!') print('f(%s) = %f' % (best, score))

Tying this together, the complete example of applying the stochastic hill climbing algorithm to the Ackley objective function is listed below.

# hill climbing search of the ackley objective function from numpy import asarray from numpy import exp from numpy import sqrt from numpy import cos from numpy import e from numpy import pi from numpy.random import randn from numpy.random import rand from numpy.random import seed # objective function def objective(v): x, y = v return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi * x) + cos(2 * pi * y))) + e + 20 # check if a point is within the bounds of the search def in_bounds(point, bounds): # enumerate all dimensions of the point for d in range(len(bounds)): # check if out of bounds for this dimension if point[d] < bounds[d, 0] or point[d] > bounds[d, 1]: return False return True # hill climbing local search algorithm def hillclimbing(objective, bounds, n_iterations, step_size): # generate an initial point solution = None while solution is None or not in_bounds(solution, bounds): solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # evaluate the initial point solution_eval = objective(solution) # run the hill climb for i in range(n_iterations): # take a step candidate = None while candidate is None or not in_bounds(candidate, bounds): candidate = solution + randn(len(bounds)) * step_size # evaluate candidate point candidte_eval = objective(candidate) # check if we should keep the new point if candidte_eval <= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval # report progress print('>%d f(%s) = %.5f' % (i, solution, solution_eval)) return [solution, solution_eval] # seed the pseudorandom number generator seed(1) # define range for input bounds = asarray([[-5.0, 5.0], [-5.0, 5.0]]) # define the total iterations n_iterations = 1000 # define the maximum step size step_size = 0.05 # perform the hill climbing search best, score = hillclimbing(objective, bounds, n_iterations, step_size) print('Done!') print('f(%s) = %f' % (best, score))

Running the example performs the stochastic hill climbing search on the objective function. Each improvement found during the search is reported and the best solution is then reported at the end of the search.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see about 13 improvements during the search and a final solution of about f(-0.981, 1.965), resulting in an evaluation of about 5.381, which is far from f(0.0, 0.0) = 0.

>0 f([-0.85618854 2.1495965 ]) = 6.46986 >1 f([-0.81291816 2.03451957]) = 6.07149 >5 f([-0.82903902 2.01531685]) = 5.93526 >7 f([-0.83766043 1.97142393]) = 5.82047 >9 f([-0.89269139 2.02866012]) = 5.68283 >12 f([-0.8988359 1.98187164]) = 5.55899 >13 f([-0.9122303 2.00838942]) = 5.55566 >14 f([-0.94681334 1.98855174]) = 5.43024 >15 f([-0.98117198 1.94629146]) = 5.39010 >23 f([-0.97516403 1.97715161]) = 5.38735 >39 f([-0.98628044 1.96711371]) = 5.38241 >362 f([-0.9808789 1.96858459]) = 5.38233 >629 f([-0.98102417 1.96555308]) = 5.38194 Done! f([-0.98102417 1.96555308]) = 5.381939

Next, we will modify the algorithm to perform random restarts and see if we can achieve better results.

The Stochastic Hill Climbing With Random Restarts algorithm involves the repeated running of the Stochastic Hill Climbing algorithm and keeping track of the best solution found.

First, let’s modify the *hillclimbing()* function to take the starting point of the search rather than generating it randomly. This will help later when we implement the Iterated Local Search algorithm later.

# hill climbing local search algorithm def hillclimbing(objective, bounds, n_iterations, step_size, start_pt): # store the initial point solution = start_pt # evaluate the initial point solution_eval = objective(solution) # run the hill climb for i in range(n_iterations): # take a step candidate = None while candidate is None or not in_bounds(candidate, bounds): candidate = solution + randn(len(bounds)) * step_size # evaluate candidate point candidte_eval = objective(candidate) # check if we should keep the new point if candidte_eval <= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval return [solution, solution_eval]

Next, we can implement the random restart algorithm by repeatedly calling the *hillclimbing()* function a fixed number of times.

Each call, we will generate a new randomly selected starting point for the hill climbing search.

... # generate a random initial point for the search start_pt = None while start_pt is None or not in_bounds(start_pt, bounds): start_pt = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # perform a stochastic hill climbing search solution, solution_eval = hillclimbing(objective, bounds, n_iter, step_size, start_pt)

We can then inspect the result and keep it if it is better than any result of the search we have seen so far.

... # check for new best if solution_eval < best_eval: best, best_eval = solution, solution_eval print('Restart %d, best: f(%s) = %.5f' % (n, best, best_eval))

Tying this together, the *random_restarts()* function implemented the stochastic hill climbing algorithm with random restarts.

# hill climbing with random restarts algorithm def random_restarts(objective, bounds, n_iter, step_size, n_restarts): best, best_eval = None, 1e+10 # enumerate restarts for n in range(n_restarts): # generate a random initial point for the search start_pt = None while start_pt is None or not in_bounds(start_pt, bounds): start_pt = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # perform a stochastic hill climbing search solution, solution_eval = hillclimbing(objective, bounds, n_iter, step_size, start_pt) # check for new best if solution_eval < best_eval: best, best_eval = solution, solution_eval print('Restart %d, best: f(%s) = %.5f' % (n, best, best_eval)) return [best, best_eval]

We can then apply this algorithm to the Ackley objective function. In this case, we will limit the number of random restarts to 30, chosen arbitrarily.

The complete example is listed below.

# hill climbing search with random restarts of the ackley objective function from numpy import asarray from numpy import exp from numpy import sqrt from numpy import cos from numpy import e from numpy import pi from numpy.random import randn from numpy.random import rand from numpy.random import seed # objective function def objective(v): x, y = v return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi * x) + cos(2 * pi * y))) + e + 20 # check if a point is within the bounds of the search def in_bounds(point, bounds): # enumerate all dimensions of the point for d in range(len(bounds)): # check if out of bounds for this dimension if point[d] < bounds[d, 0] or point[d] > bounds[d, 1]: return False return True # hill climbing local search algorithm def hillclimbing(objective, bounds, n_iterations, step_size, start_pt): # store the initial point solution = start_pt # evaluate the initial point solution_eval = objective(solution) # run the hill climb for i in range(n_iterations): # take a step candidate = None while candidate is None or not in_bounds(candidate, bounds): candidate = solution + randn(len(bounds)) * step_size # evaluate candidate point candidte_eval = objective(candidate) # check if we should keep the new point if candidte_eval <= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval return [solution, solution_eval] # hill climbing with random restarts algorithm def random_restarts(objective, bounds, n_iter, step_size, n_restarts): best, best_eval = None, 1e+10 # enumerate restarts for n in range(n_restarts): # generate a random initial point for the search start_pt = None while start_pt is None or not in_bounds(start_pt, bounds): start_pt = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # perform a stochastic hill climbing search solution, solution_eval = hillclimbing(objective, bounds, n_iter, step_size, start_pt) # check for new best if solution_eval < best_eval: best, best_eval = solution, solution_eval print('Restart %d, best: f(%s) = %.5f' % (n, best, best_eval)) return [best, best_eval] # seed the pseudorandom number generator seed(1) # define range for input bounds = asarray([[-5.0, 5.0], [-5.0, 5.0]]) # define the total iterations n_iter = 1000 # define the maximum step size step_size = 0.05 # total number of random restarts n_restarts = 30 # perform the hill climbing search best, score = random_restarts(objective, bounds, n_iter, step_size, n_restarts) print('Done!') print('f(%s) = %f' % (best, score))

Running the example will perform a stochastic hill climbing with random restarts search for the Ackley objective function. Each time an improved overall solution is discovered, it is reported and the final best solution found by the search is summarized.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see three improvements during the search and that the best solution found was approximately f(0.002, 0.002), which evaluated to about 0.009, which is much better than a single run of the hill climbing algorithm.

Restart 0, best: f([-0.98102417 1.96555308]) = 5.38194 Restart 2, best: f([1.96522236 0.98120013]) = 5.38191 Restart 4, best: f([0.00223194 0.00258853]) = 0.00998 Done! f([0.00223194 0.00258853]) = 0.009978

Next, let’s look at how we can implement the iterated local search algorithm.

The Iterated Local Search algorithm is a modified version of the stochastic hill climbing with random restarts algorithm.

The important difference is that the starting point for each application of the stochastic hill climbing algorithm is a perturbed version of the best point found so far.

We can implement this algorithm by using the *random_restarts()* function as a starting point. Each restart iteration, we can generate a modified version of the best solution found so far instead of a random starting point.

This can be achieved by using a step size hyperparameter, much like is used in the stochastic hill climber. In this case, a larger step size value will be used given the need for larger perturbations in the search space.

... # generate an initial point as a perturbed version of the last best start_pt = None while start_pt is None or not in_bounds(start_pt, bounds): start_pt = best + randn(len(bounds)) * p_size

Tying this together, the *iterated_local_search()* function is defined below.

# iterated local search algorithm def iterated_local_search(objective, bounds, n_iter, step_size, n_restarts, p_size): # define starting point best = None while best is None or not in_bounds(best, bounds): best = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # evaluate current best point best_eval = objective(best) # enumerate restarts for n in range(n_restarts): # generate an initial point as a perturbed version of the last best start_pt = None while start_pt is None or not in_bounds(start_pt, bounds): start_pt = best + randn(len(bounds)) * p_size # perform a stochastic hill climbing search solution, solution_eval = hillclimbing(objective, bounds, n_iter, step_size, start_pt) # check for new best if solution_eval < best_eval: best, best_eval = solution, solution_eval print('Restart %d, best: f(%s) = %.5f' % (n, best, best_eval)) return [best, best_eval]

We can then apply the algorithm to the Ackley objective function. In this case, we will use a larger step size value of 1.0 for the random restarts, chosen after a little trial and error.

The complete example is listed below.

# iterated local search of the ackley objective function from numpy import asarray from numpy import exp from numpy import sqrt from numpy import cos from numpy import e from numpy import pi from numpy.random import randn from numpy.random import rand from numpy.random import seed # objective function def objective(v): x, y = v return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi * x) + cos(2 * pi * y))) + e + 20 # check if a point is within the bounds of the search def in_bounds(point, bounds): # enumerate all dimensions of the point for d in range(len(bounds)): # check if out of bounds for this dimension if point[d] < bounds[d, 0] or point[d] > bounds[d, 1]: return False return True # hill climbing local search algorithm def hillclimbing(objective, bounds, n_iterations, step_size, start_pt): # store the initial point solution = start_pt # evaluate the initial point solution_eval = objective(solution) # run the hill climb for i in range(n_iterations): # take a step candidate = None while candidate is None or not in_bounds(candidate, bounds): candidate = solution + randn(len(bounds)) * step_size # evaluate candidate point candidte_eval = objective(candidate) # check if we should keep the new point if candidte_eval <= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval return [solution, solution_eval] # iterated local search algorithm def iterated_local_search(objective, bounds, n_iter, step_size, n_restarts, p_size): # define starting point best = None while best is None or not in_bounds(best, bounds): best = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # evaluate current best point best_eval = objective(best) # enumerate restarts for n in range(n_restarts): # generate an initial point as a perturbed version of the last best start_pt = None while start_pt is None or not in_bounds(start_pt, bounds): start_pt = best + randn(len(bounds)) * p_size # perform a stochastic hill climbing search solution, solution_eval = hillclimbing(objective, bounds, n_iter, step_size, start_pt) # check for new best if solution_eval < best_eval: best, best_eval = solution, solution_eval print('Restart %d, best: f(%s) = %.5f' % (n, best, best_eval)) return [best, best_eval] # seed the pseudorandom number generator seed(1) # define range for input bounds = asarray([[-5.0, 5.0], [-5.0, 5.0]]) # define the total iterations n_iter = 1000 # define the maximum step size s_size = 0.05 # total number of random restarts n_restarts = 30 # perturbation step size p_size = 1.0 # perform the hill climbing search best, score = iterated_local_search(objective, bounds, n_iter, s_size, n_restarts, p_size) print('Done!') print('f(%s) = %f' % (best, score))

Running the example will perform an Iterated Local Search of the Ackley objective function.

Each time an improved overall solution is discovered, it is reported and the final best solution found by the search is summarized at the end of the run.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see four improvements during the search and that the best solution found was two very small inputs that are close to zero, which evaluated to about 0.0003, which is better than either a single run of the hill climber or the hill climber with restarts.

Restart 0, best: f([-0.96775653 0.96853129]) = 3.57447 Restart 3, best: f([-4.50618519e-04 9.51020713e-01]) = 2.57996 Restart 5, best: f([ 0.00137423 -0.00047059]) = 0.00416 Restart 22, best: f([ 1.16431936e-04 -3.31358206e-06]) = 0.00033 Done! f([ 1.16431936e-04 -3.31358206e-06]) = 0.000330

This section provides more resources on the topic if you are looking to go deeper.

- Essentials of Metaheuristics, 2011.
- Handbook of Metaheuristics, 3rd edition 2019.

In this tutorial, you discovered how to implement the iterated local search algorithm from scratch.

Specifically, you learned:

- Iterated local search is a stochastic global search optimization algorithm that is a smarter version of stochastic hill climbing with random restarts.
- How to implement stochastic hill climbing with random restarts from scratch.
- How to implement and apply the iterated local search algorithm to a nonlinear objective function.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Iterated Local Search From Scratch in Python appeared first on Machine Learning Mastery.

]]>The post Develop a Neural Network for Woods Mammography Dataset appeared first on Machine Learning Mastery.

]]>It can be challenging to develop a neural network predictive model for a new dataset.

One approach is to first inspect the dataset and develop ideas for what models might work, then explore the learning dynamics of simple models on the dataset, then finally develop and tune a model for the dataset with a robust test harness.

This process can be used to develop effective neural network models for classification and regression predictive modeling problems.

In this tutorial, you will discover how to develop a Multilayer Perceptron neural network model for the Wood’s Mammography classification dataset.

After completing this tutorial, you will know:

- How to load and summarize the Wood’s Mammography dataset and use the results to suggest data preparations and model configurations to use.
- How to explore the learning dynamics of simple MLP models on the dataset.
- How to develop robust estimates of model performance, tune model performance and make predictions on new data.

Let’s get started.

This tutorial is divided into 4 parts; they are:

- Woods Mammography Dataset
- Neural Network Learning Dynamics
- Robust Model Evaluation
- Final Model and Make Predictions

The first step is to define and explore the dataset.

We will be working with the “*mammography*” standard binary classification dataset, sometimes called “*Woods Mammography*“.

The dataset is credited to Kevin Woods, et al. and the 1993 paper titled “Comparative Evaluation Of Pattern Recognition Techniques For Detection Of Microcalcifications In Mammography.”

The focus of the problem is on detecting breast cancer from radiological scans, specifically the presence of clusters of microcalcifications that appear bright on a mammogram.

There are two classes and the goal is to distinguish between microcalcifications and non-microcalcifications using the features for a given segmented object.

**Non-microcalcifications**: negative case, or majority class.**Microcalcifications**: positive case, or minority class.

The Mammography dataset is a widely used standard machine learning dataset, used to explore and demonstrate many techniques designed specifically for imbalanced classification.

**Note**: To be crystal clear, we are NOT “*solving breast cancer*“. We are exploring a standard classification dataset.

Below is a sample of the first 5 rows of the dataset

0.23001961,5.0725783,-0.27606055,0.83244412,-0.37786573,0.4803223,'-1' 0.15549112,-0.16939038,0.67065219,-0.85955255,-0.37786573,-0.94572324,'-1' -0.78441482,-0.44365372,5.6747053,-0.85955255,-0.37786573,-0.94572324,'-1' 0.54608818,0.13141457,-0.45638679,-0.85955255,-0.37786573,-0.94572324,'-1' -0.10298725,-0.3949941,-0.14081588,0.97970269,-0.37786573,1.0135658,'-1' ...

You can learn more about the dataset here:

We can load the dataset as a pandas DataFrame directly from the URL; for example:

# load the mammography dataset and summarize the shape from pandas import read_csv # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/mammography.csv' # load the dataset df = read_csv(url, header=None) # summarize shape print(df.shape)

Running the example loads the dataset directly from the URL and reports the shape of the dataset.

In this case, we can confirm that the dataset has 7 variables (6 input and one output) and that the dataset has 11,183 rows of data.

This a modest sized dataset for a neural network and suggests that a small network would be appropriate.

It also suggests that using k-fold cross-validation would be a good idea given that it will give a more reliable estimate of model performance than a train/test split and because a single model will fit in seconds instead of hours or days with the largest datasets.

(11183, 7)

Next, we can learn more about the dataset by looking at summary statistics and a plot of the data.

# show summary statistics and plots of the mammography dataset from pandas import read_csv from matplotlib import pyplot # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/mammography.csv' # load the dataset df = read_csv(url, header=None) # show summary statistics print(df.describe()) # plot histograms df.hist() pyplot.show()

Running the example first loads the data before and then prints summary statistics for each variable.

We can see that the values are generally small with means close to zero.

0 1 ... 4 5 count 1.118300e+04 1.118300e+04 ... 1.118300e+04 1.118300e+04 mean 1.096535e-10 1.297595e-09 ... -1.120680e-09 1.459483e-09 std 1.000000e+00 1.000000e+00 ... 1.000000e+00 1.000000e+00 min -7.844148e-01 -4.701953e-01 ... -3.778657e-01 -9.457232e-01 25% -7.844148e-01 -4.701953e-01 ... -3.778657e-01 -9.457232e-01 50% -1.085769e-01 -3.949941e-01 ... -3.778657e-01 -9.457232e-01 75% 3.139489e-01 -7.649473e-02 ... -3.778657e-01 1.016613e+00 max 3.150844e+01 5.085849e+00 ... 2.361712e+01 1.949027e+00

A histogram plot is then created for each variable.

We can see that perhaps most variables have an exponential distribution, and perhaps variable 5 (the last input variable) is Gaussian with outliers/missing values.

We may have some benefit in using a power transform on each variable in order to make the probability distribution less skewed which will likely improve model performance.

It may be helpful to know how imbalanced the dataset actually is.

We can use the Counter object to count the number of examples in each class, then use those counts to summarize the distribution.

The complete example is listed below.

# summarize the class ratio of the mammography dataset from pandas import read_csv from collections import Counter # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/mammography.csv' # load the csv file as a data frame dataframe = read_csv(url, header=None) # summarize the class distribution target = dataframe.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))

Running the example summarizes the class distribution, confirming the severe class imbalanced with approximately 98 percent for the majority class (no cancer) and approximately 2 percent for the minority class (cancer).

Class='-1', Count=10923, Percentage=97.675% Class='1', Count=260, Percentage=2.325%

This is helpful because if we use classification accuracy, then any model that achieves an accuracy less than about 97.7% does not have skill on this dataset.

Now that we are familiar with the dataset, let’s explore how we might develop a neural network model.

We will develop a Multilayer Perceptron (MLP) model for the dataset using TensorFlow.

We cannot know what model architecture of learning hyperparameters would be good or best for this dataset, so we must experiment and discover what works well.

Given that the dataset is small, a small batch size is probably a good idea, e.g. 16 or 32 rows. Using the Adam version of stochastic gradient descent is a good idea when getting started as it will automatically adapt the learning rate and works well on most datasets.

Before we evaluate models in earnest, it is a good idea to review the learning dynamics and tune the model architecture and learning configuration until we have stable learning dynamics, then look at getting the most out of the model.

We can do this by using a simple train/test split of the data and review plots of the learning curves. This will help us see if we are over-learning or under-learning; then we can adapt the configuration accordingly.

First, we must ensure all input variables are floating-point values and encode the target label as integer values 0 and 1.

... # ensure all data are floating point values X = X.astype('float32') # encode strings to integer y = LabelEncoder().fit_transform(y)

Next, we can split the dataset into input and output variables, then into 67/33 train and test sets.

We must ensure that the split is stratified by the class ensuring that the train and test sets have the same distribution of class labels as the main dataset.

... # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # split into train and test datasets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, stratify=y, random_state=1)

We can define a minimal MLP model.

In this case, we will use one hidden layer with 50 nodes and one output layer (chosen arbitrarily). We will use the ReLU activation function in the hidden layer and the “*he_normal*” weight initialization, as together, they are a good practice.

The output of the model is a sigmoid activation for binary classification and we will minimize binary cross-entropy loss.

... # define model model = Sequential() model.add(Dense(50, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy')

We will fit the model for 300 training epochs (chosen arbitrarily) with a batch size of 32 because it is a modestly sized dataset.

We are fitting the model on raw data, which we think might be a good idea, but it is an important starting point.

... history = model.fit(X_train, y_train, epochs=300, batch_size=32, verbose=0, validation_data=(X_test,y_test))

At the end of training, we will evaluate the model’s performance on the test dataset and report performance as the classification accuracy.

... # predict test set yhat = model.predict_classes(X_test) # evaluate predictions score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score)

Finally, we will plot learning curves of the cross-entropy loss on the train and test sets during training.

... # plot learning curves pyplot.title('Learning Curves') pyplot.xlabel('Epoch') pyplot.ylabel('Cross Entropy') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='val') pyplot.legend() pyplot.show()

Tying this all together, the complete example of evaluating our first MLP on the cancer survival dataset is listed below.

# fit a simple mlp model on the mammography and review learning curves from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense from matplotlib import pyplot # load the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/mammography.csv' df = read_csv(path, header=None) # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # ensure all data are floating point values X = X.astype('float32') # encode strings to integer y = LabelEncoder().fit_transform(y) # split into train and test datasets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, stratify=y, random_state=1) # determine the number of input features n_features = X.shape[1] # define model model = Sequential() model.add(Dense(50, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy') # fit the model history = model.fit(X_train, y_train, epochs=300, batch_size=32, verbose=0, validation_data=(X_test,y_test)) # predict test set yhat = model.predict_classes(X_test) # evaluate predictions score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score) # plot learning curves pyplot.title('Learning Curves') pyplot.xlabel('Epoch') pyplot.ylabel('Cross Entropy') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='val') pyplot.legend() pyplot.show()

Running the example first fits the model on the training dataset, then reports the classification accuracy on the test dataset.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case we can see that the model performs better than a no-skill model, given that the accuracy is above about 97.7 percent, in this case achieving an accuracy of about 98.8 percent.

Accuracy: 0.988

Line plots of the loss on the train and test sets are then created.

We can see that the model quickly finds a good fit on the dataset and does not appear to be over or underfitting.

Now that we have some idea of the learning dynamics for a simple MLP model on the dataset, we can look at developing a more robust evaluation of model performance on the dataset.

The k-fold cross-validation procedure can provide a more reliable estimate of MLP performance, although it can be very slow.

This is because k models must be fit and evaluated. This is not a problem when the dataset size is small, such as the cancer survival dataset.

We can use the StratifiedKFold class and enumerate each fold manually, fit the model, evaluate it, and then report the mean of the evaluation scores at the end of the procedure.

... # prepare cross validation kfold = KFold(10) # enumerate splits scores = list() for train_ix, test_ix in kfold.split(X, y): # fit and evaluate the model... ... ... # summarize all scores print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

We can use this framework to develop a reliable estimate of MLP model performance with our base configuration, and even with a range of different data preparations, model architectures, and learning configurations.

It is important that we first developed an understanding of the learning dynamics of the model on the dataset in the previous section before using k-fold cross-validation to estimate the performance. If we started to tune the model directly, we might get good results, but if not, we might have no idea of why, e.g. that the model was over or under fitting.

If we make large changes to the model again, it is a good idea to go back and confirm that the model is converging appropriately.

The complete example of this framework to evaluate the base MLP model from the previous section is listed below.

# k-fold cross-validation of base model for the mammography dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense from matplotlib import pyplot # load the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/mammography.csv' df = read_csv(path, header=None) # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # ensure all data are floating point values X = X.astype('float32') # encode strings to integer y = LabelEncoder().fit_transform(y) # prepare cross validation kfold = StratifiedKFold(10, random_state=1) # enumerate splits scores = list() for train_ix, test_ix in kfold.split(X, y): # split data X_train, X_test, y_train, y_test = X[train_ix], X[test_ix], y[train_ix], y[test_ix] # determine the number of input features n_features = X.shape[1] # define model model = Sequential() model.add(Dense(50, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy') # fit the model model.fit(X_train, y_train, epochs=300, batch_size=32, verbose=0) # predict test set yhat = model.predict_classes(X_test) # evaluate predictions score = accuracy_score(y_test, yhat) print('>%.3f' % score) scores.append(score) # summarize all scores print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports the model performance each iteration of the evaluation procedure and reports the mean and standard deviation of classification accuracy at the end of the run.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the MLP model achieved a mean accuracy of about 98.7 percent, which is pretty close to our rough estimate in the previous section.

This confirms our expectation that the base model configuration may work better than a naive model for this dataset

>0.987 >0.986 >0.989 >0.987 >0.986 >0.988 >0.989 >0.989 >0.983 >0.988 Mean Accuracy: 0.987 (0.002)

Next, let’s look at how we might fit a final model and use it to make predictions.

Once we choose a model configuration, we can train a final model on all available data and use it to make predictions on new data.

In this case, we will use the model with dropout and a small batch size as our final model.

We can prepare the data and fit the model as before, although on the entire dataset instead of a training subset of the dataset.

... # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # ensure all data are floating point values X = X.astype('float32') # encode strings to integer le = LabelEncoder() y = le.fit_transform(y) # determine the number of input features n_features = X.shape[1] # define model model = Sequential() model.add(Dense(50, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy')

We can then use this model to make predictions on new data.

First, we can define a row of new data.

... # define a row of new data row = [0.23001961,5.0725783,-0.27606055,0.83244412,-0.37786573,0.4803223]

Note: I took this row from the first row of the dataset and the expected label is a ‘-1’.

We can then make a prediction.

... # make prediction yhat = model.predict_classes([row])

Then invert the transform on the prediction, so we can use or interpret the result in the correct label (which is just an integer for this dataset).

... # invert transform to get label for class yhat = le.inverse_transform(yhat)

And in this case, we will simply report the prediction.

... # report prediction print('Predicted: %s' % (yhat[0]))

Tying this all together, the complete example of fitting a final model for the mammography dataset and using it to make a prediction on new data is listed below.

# fit a final model and make predictions on new data for the mammography dataset from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.layers import Dropout # load the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/mammography.csv' df = read_csv(path, header=None) # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # ensure all data are floating point values X = X.astype('float32') # encode strings to integer le = LabelEncoder() y = le.fit_transform(y) # determine the number of input features n_features = X.shape[1] # define model model = Sequential() model.add(Dense(50, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy') # fit the model model.fit(X, y, epochs=300, batch_size=32, verbose=0) # define a row of new data row = [0.23001961,5.0725783,-0.27606055,0.83244412,-0.37786573,0.4803223] # make prediction yhat = model.predict_classes([row]) # invert transform to get label for class yhat = le.inverse_transform(yhat) # report prediction print('Predicted: %s' % (yhat[0]))

Running the example fits the model on the entire dataset and makes a prediction for a single row of new data.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model predicted a “-1” label for the input row.

Predicted: '-1'

This section provides more resources on the topic if you are looking to go deeper.

- Imbalanced Classification Model to Detect Mammography Microcalcifications
- Best Results for Standard Machine Learning Datasets
- TensorFlow 2 Tutorial: Get Started in Deep Learning With tf.keras
- A Gentle Introduction to k-fold Cross-Validation

In this tutorial, you discovered how to develop a Multilayer Perceptron neural network model for the Wood’s Mammography classification dataset.

Specifically, you learned:

- How to load and summarize the Wood’s Mammography dataset and use the results to suggest data preparations and model configurations to use.
- How to explore the learning dynamics of simple MLP models on the dataset.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Develop a Neural Network for Woods Mammography Dataset appeared first on Machine Learning Mastery.

]]>The post Tune XGBoost Performance With Learning Curves appeared first on Machine Learning Mastery.

]]>XGBoost is a powerful and effective implementation of the gradient boosting ensemble algorithm.

It can be challenging to configure the hyperparameters of XGBoost models, which often leads to using large grid search experiments that are both time consuming and computationally expensive.

An alternate approach to configuring **XGBoost** models is to evaluate the performance of the model each iteration of the algorithm during training and to plot the results as **learning curves**. These learning curve plots provide a diagnostic tool that can be interpreted and suggest specific changes to model hyperparameters that may lead to improvements in predictive performance.

In this tutorial, you will discover how to plot and interpret learning curves for XGBoost models in Python.

After completing this tutorial, you will know:

- Learning curves provide a useful diagnostic tool for understanding the training dynamics of supervised learning models like XGBoost.
- How to configure XGBoost to evaluate datasets each iteration and plot the results as learning curves.
- How to interpret and use learning curve plots to improve XGBoost model performance.

Let’s get started.

This tutorial is divided into four parts; they are:

- Extreme Gradient Boosting
- Learning Curves
- Plot XGBoost Learning Curve
- Tune XGBoost Model Using Learning Curves

**Gradient boosting** refers to a class of ensemble machine learning algorithms that can be used for classification or regression predictive modeling problems.

Ensembles are constructed from decision tree models. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. This is a type of ensemble machine learning model referred to as boosting.

Models are fit using any arbitrary differentiable loss function and gradient descent optimization algorithm. This gives the technique its name, “gradient boosting,” as the loss gradient is minimized as the model is fit, much like a neural network.

For more on gradient boosting, see the tutorial:

Extreme Gradient Boosting, or XGBoost for short, is an efficient open-source implementation of the gradient boosting algorithm. As such, XGBoost is an algorithm, an open-source project, and a Python library.

It was initially developed by Tianqi Chen and was described by Chen and Carlos Guestrin in their 2016 paper titled “XGBoost: A Scalable Tree Boosting System.”

It is designed to be both computationally efficient (e.g. fast to execute) and highly effective, perhaps more effective than other open-source implementations.

The two main reasons to use XGBoost are execution speed and model performance.

XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems. The evidence is that it is the go-to algorithm for competition winners on the Kaggle competitive data science platform.

Among the 29 challenge winning solutions 3 published at Kaggle’s blog during 2015, 17 solutions used XGBoost. […] The success of the system was also witnessed in KDDCup 2015, where XGBoost was used by every winning team in the top-10.

— XGBoost: A Scalable Tree Boosting System, 2016.

For more on XGBoost and how to install and use the XGBoost Python API, see the tutorial:

Now that we are familiar with what XGBoost is and why it is important, let’s take a closer look at learning curves.

Generally, a learning curve is a plot that shows time or experience on the x-axis and learning or improvement on the y-axis.

Learning curves are widely used in machine learning for algorithms that learn (optimize their internal parameters) incrementally over time, such as deep learning neural networks.

The metric used to evaluate learning could be maximizing, meaning that better scores (larger numbers) indicate more learning. An example would be classification accuracy.

It is more common to use a score that is minimizing, such as loss or error whereby better scores (smaller numbers) indicate more learning and a value of 0.0 indicates that the training dataset was learned perfectly and no mistakes were made.

During the training of a machine learning model, the current state of the model at each step of the training algorithm can be evaluated. It can be evaluated on the training dataset to give an idea of how well the model is “*learning*.” It can also be evaluated on a hold-out validation dataset that is not part of the training dataset. Evaluation on the validation dataset gives an idea of how well the model is “*generalizing*.”

It is common to create dual learning curves for a machine learning model during training on both the training and validation datasets.

The shape and dynamics of a learning curve can be used to diagnose the behavior of a machine learning model, and in turn, perhaps suggest the type of configuration changes that may be made to improve learning and/or performance.

There are three common dynamics that you are likely to observe in learning curves; they are:

- Underfit.
- Overfit.
- Good Fit.

Most commonly, learning curves are used to diagnose overfitting behavior of a model that can be addressed by tuning the hyperparameters of the model.

Overfitting refers to a model that has learned the training dataset too well, including the statistical noise or random fluctuations in the training dataset.

The problem with overfitting is that the more specialized the model becomes to training data, the less well it is able to generalize to new data, resulting in an increase in generalization error. This increase in generalization error can be measured by the performance of the model on the validation dataset.

For more on learning curves, see the tutorial:

Now that we are familiar with learning curves, let’s look at how we might plot learning curves for XGBoost models.

In this section, we will plot the learning curve for an XGBoost model.

First, we need a dataset to use as the basis for fitting and evaluating the model.

We will use a synthetic binary (two-class) classification dataset in this tutorial.

The make_classification() scikit-learn function can be used to create a synthetic classification dataset. In this case, we will use 50 input features (columns) and generate 10,000 samples (rows). The seed for the pseudo-random number generator is fixed to ensure the same base “*problem*” is used each time samples are generated.

The example below generates the synthetic classification dataset and summarizes the shape of the generated data.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=10000, n_features=50, n_informative=50, n_redundant=0, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example generates the data and reports the size of the input and output components, confirming the expected shape.

(10000, 50) (10000,)

Next, we can fit an XGBoost model on this dataset and plot learning curves.

First, we must split the dataset into one portion that will be used to train the model (train) and another portion that will not be used to train the model, but will be held back and used to evaluate the model each step of the training algorithm (test set or validation set).

... # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1)

We can then define an XGBoost classification model with default hyperparameters.

... # define the model model = XGBClassifier()

Next, the model can be fit on the dataset.

In this case, we must specify to the training algorithm that we want it to evaluate the performance of the model on the train and test sets each iteration (e.g. after each new tree is added to the ensemble).

To do this we must specify the datasets to evaluate and the metric to evaluate.

The dataset must be specified as a list of tuples, where each tuple contains the input and output columns of a dataset and each element in the list is a different dataset to evaluate, e.g. the train and the test sets.

... # define the datasets to evaluate each iteration evalset = [(X_train, y_train), (X_test,y_test)]

There are many metrics we may want to evaluate, although given that it is a classification task, we will evaluate the log loss (cross-entropy) of the model which is a minimizing score (lower values are better).

This can be achieved by specifying the “*eval_metric*” argument when calling *fit()* and providing it the name of the metric we will evaluate ‘*logloss*‘. We can also specify the datasets to evaluate via the “*eval_set*” argument. The *fit()* function takes the training dataset as the first two arguments as per normal.

... # fit the model model.fit(X_train, y_train, eval_metric='logloss', eval_set=evalset)

Once the model is fit, we can evaluate its performance as the classification accuracy on the test dataset.

... # evaluate performance yhat = model.predict(X_test) score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score)

We can then retrieve the metrics calculated for each dataset via a call to the *evals_result()* function.

... # retrieve performance metrics results = model.evals_result()

This returns a dictionary organized first by dataset (‘*validation_0*‘ and ‘*validation_1*‘) and then by metric (‘*logloss*‘).

We can create line plots of metrics for each dataset.

... # plot learning curves pyplot.plot(results['validation_0']['logloss'], label='train') pyplot.plot(results['validation_1']['logloss'], label='test') # show the legend pyplot.legend() # show the plot pyplot.show()

And that’s it.

Tying all of this together, the complete example of fitting an XGBoost model on the synthetic classification task and plotting learning curves is listed below.

# plot learning curve of an xgboost model from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from xgboost import XGBClassifier from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=10000, n_features=50, n_informative=50, n_redundant=0, random_state=1) # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1) # define the model model = XGBClassifier() # define the datasets to evaluate each iteration evalset = [(X_train, y_train), (X_test,y_test)] # fit the model model.fit(X_train, y_train, eval_metric='logloss', eval_set=evalset) # evaluate performance yhat = model.predict(X_test) score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score) # retrieve performance metrics results = model.evals_result() # plot learning curves pyplot.plot(results['validation_0']['logloss'], label='train') pyplot.plot(results['validation_1']['logloss'], label='test') # show the legend pyplot.legend() # show the plot pyplot.show()

Running the example fits the XGBoost model, retrieves the calculated metrics, and plots learning curves.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

First, the model performance is reported, showing that the model achieved a classification accuracy of about 94.5% on the hold-out test set.

Accuracy: 0.945

The plot shows learning curves for the train and test dataset where the x-axis is the number of iterations of the algorithm (or the number of trees added to the ensemble) and the y-axis is the logloss of the model. Each line shows the logloss per iteration for a given dataset.

From the learning curves, we can see that the performance of the model on the training dataset (blue line) is better or has lower loss than the performance of the model on the test dataset (orange line), as we might generally expect.

Now that we know how to plot learning curves for XGBoost models, let’s look at how we might use the curves to improve model performance.

We can use the learning curves as a diagnostic tool.

The curves can be interpreted and used as the basis for suggesting specific changes to the model configuration that might result in better performance.

The model and result in the previous section can be used as a baseline and starting point.

Looking at the plot, we can see that both curves are sloping down and suggest that more iterations (adding more trees) may result in a further decrease in loss.

Let’s try it out.

We can increase the number of iterations of the algorithm via the “*n_estimators*” hyperparameter that defaults to 100. Let’s increase it to 500.

... # define the model model = XGBClassifier(n_estimators=500)

The complete example is listed below.

# plot learning curve of an xgboost model from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from xgboost import XGBClassifier from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=10000, n_features=50, n_informative=50, n_redundant=0, random_state=1) # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1) # define the model model = XGBClassifier(n_estimators=500) # define the datasets to evaluate each iteration evalset = [(X_train, y_train), (X_test,y_test)] # fit the model model.fit(X_train, y_train, eval_metric='logloss', eval_set=evalset) # evaluate performance yhat = model.predict(X_test) score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score) # retrieve performance metrics results = model.evals_result() # plot learning curves pyplot.plot(results['validation_0']['logloss'], label='train') pyplot.plot(results['validation_1']['logloss'], label='test') # show the legend pyplot.legend() # show the plot pyplot.show()

Running the example fits and evaluates the model and plots the learning curves of model performance.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that more iterations have resulted in a lift in accuracy from about 94.5% to about 95.8%.

Accuracy: 0.958

We can see from the learning curves that indeed the additional iterations of the algorithm caused the curves to continue to drop and then level out after perhaps 150 iterations, where they remain reasonably flat.

The long flat curves may suggest that the algorithm is learning too fast and we may benefit from slowing it down.

This can be achieved using the learning rate, which limits the contribution of each tree added to the ensemble. This can be controlled via the “*eta*” hyperparameter and defaults to the value of 0.3. We can try a smaller value, such as 0.05.

... # define the model model = XGBClassifier(n_estimators=500, eta=0.05)

The complete example is listed below.

# plot learning curve of an xgboost model from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from xgboost import XGBClassifier from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=10000, n_features=50, n_informative=50, n_redundant=0, random_state=1) # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1) # define the model model = XGBClassifier(n_estimators=500, eta=0.05) # define the datasets to evaluate each iteration evalset = [(X_train, y_train), (X_test,y_test)] # fit the model model.fit(X_train, y_train, eval_metric='logloss', eval_set=evalset) # evaluate performance yhat = model.predict(X_test) score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score) # retrieve performance metrics results = model.evals_result() # plot learning curves pyplot.plot(results['validation_0']['logloss'], label='train') pyplot.plot(results['validation_1']['logloss'], label='test') # show the legend pyplot.legend() # show the plot pyplot.show()

Running the example fits and evaluates the model and plots the learning curves of model performance.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the smaller learning rate has made the accuracy worse, dropping from about 95.8% to about 95.1%.

Accuracy: 0.951

We can see from the learning curves that indeed learning has slowed right down. The curves suggest that we can continue to add more iterations and perhaps achieve better performance as the curves would have more opportunity to continue to decrease.

Let’s try increasing the number of iterations from 500 to 2,000.

... # define the model model = XGBClassifier(n_estimators=2000, eta=0.05)

The complete example is listed below.

# plot learning curve of an xgboost model from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from xgboost import XGBClassifier from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=10000, n_features=50, n_informative=50, n_redundant=0, random_state=1) # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1) # define the model model = XGBClassifier(n_estimators=2000, eta=0.05) # define the datasets to evaluate each iteration evalset = [(X_train, y_train), (X_test,y_test)] # fit the model model.fit(X_train, y_train, eval_metric='logloss', eval_set=evalset) # evaluate performance yhat = model.predict(X_test) score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score) # retrieve performance metrics results = model.evals_result() # plot learning curves pyplot.plot(results['validation_0']['logloss'], label='train') pyplot.plot(results['validation_1']['logloss'], label='test') # show the legend pyplot.legend() # show the plot pyplot.show()

Running the example fits and evaluates the model and plots the learning curves of model performance.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that more iterations have given the algorithm more space to improve, achieving an accuracy of 96.1%, the best so far.

Accuracy: 0.961

The learning curves again show a stable convergence of the algorithm with a steep decrease and long flattening out.

We could repeat the process of decreasing the learning rate and increasing the number of iterations to see if further improvements are possible.

Another approach to slowing down learning is to add regularization in the form of reducing the number of samples and features (rows and columns) used to construct each tree in the ensemble.

In this case, we will try halving the number of samples and features respectively via the “*subsample*” and “*colsample_bytree*” hyperparameters.

... # define the model model = XGBClassifier(n_estimators=2000, eta=0.05, subsample=0.5, colsample_bytree=0.5)

The complete example is listed below.

# plot learning curve of an xgboost model from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from xgboost import XGBClassifier from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=10000, n_features=50, n_informative=50, n_redundant=0, random_state=1) # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1) # define the model model = XGBClassifier(n_estimators=2000, eta=0.05, subsample=0.5, colsample_bytree=0.5) # define the datasets to evaluate each iteration evalset = [(X_train, y_train), (X_test,y_test)] # fit the model model.fit(X_train, y_train, eval_metric='logloss', eval_set=evalset) # evaluate performance yhat = model.predict(X_test) score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score) # retrieve performance metrics results = model.evals_result() # plot learning curves pyplot.plot(results['validation_0']['logloss'], label='train') pyplot.plot(results['validation_1']['logloss'], label='test') # show the legend pyplot.legend() # show the plot pyplot.show()

Running the example fits and evaluates the model and plots the learning curves of model performance.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the addition of regularization has resulted in a further improvement, bumping accuracy from about 96.1% to about 96.6%.

Accuracy: 0.966

The curves suggest that regularization has slowed learning and that perhaps increasing the number of iterations may result in further improvements.

This process can continue, and I am interested to see what you can come up with.

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning
- Extreme Gradient Boosting (XGBoost) Ensemble in Python
- How to use Learning Curves to Diagnose Machine Learning Model Performance
- Avoid Overfitting By Early Stopping With XGBoost In Python

In this tutorial, you discovered how to plot and interpret learning curves for XGBoost models in Python.

Specifically, you learned:

- Learning curves provide a useful diagnostic tool for understanding the training dynamics of supervised learning models like XGBoost.
- How to configure XGBoost to evaluate datasets each iteration and plot the results as learning curves.
- How to interpret and use learning curve plots to improve XGBoost model performance.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Tune XGBoost Performance With Learning Curves appeared first on Machine Learning Mastery.

]]>The post Two-Dimensional (2D) Test Functions for Function Optimization appeared first on Machine Learning Mastery.

]]>Function optimization is a field of study that seeks an input to a function that results in the maximum or minimum output of the function.

There are a large number of optimization algorithms and it is important to study and develop intuitions for optimization algorithms on simple and easy-to-visualize test functions.

**Two-dimensional functions** take two input values (x and y) and output a single evaluation of the input. They are among the simplest types of test functions to use when studying function optimization. The benefit of two-dimensional functions is that they can be visualized as a contour plot or surface plot that shows the topography of the problem domain with the optima and samples of the domain marked with points.

In this tutorial, you will discover standard two-dimensional functions you can use when studying function optimization.

Let’s get started.

A two-dimensional function is a function that takes two input variables and computes the objective value.

We can think of the two input variables as two axes on a graph, x and y. Each input to the function is a single point on the graph and the outcome of the function can be taken as the height on the graph.

This allows the function to be conceptualized as a surface and we can characterize the function based on the structure of the surface. For example, hills for input points that result in large relative outcomes of the objective function and valleys for input points that result in small relative outcomes of the objective function.

A surface may have one major feature or global optima, or it may have many with lots of places for an optimization to get stuck. The surface may be smooth, noisy, convex, and all manner of other properties that we may care about when testing optimization algorithms.

There are many different types of simple two-dimensional test functions we could use.

Nevertheless, there are standard test functions that are commonly used in the field of function optimization. There are also specific properties of test functions that we may wish to select when testing different algorithms.

We will explore a small number of simple two-dimensional test functions in this tutorial and organize them by their properties with two different groups; they are:

- Unimodal Functions
- Unimodal Function 1
- Unimodal Function 2
- Unimodal Function 3

- Multimodal Functions
- Multimodal Function 1
- Multimodal Function 2
- Multimodal Function 3

Each function will be presented using Python code with a function implementation of the target objective function and a sampling of the function that is shown as a surface plot.

All functions are presented as a minimization function, e.g. find the input that results in the minimum (smallest value) output of the function. Any maximizing function can be made a minimization function by adding a negative sign to all output. Similarly, any minimizing function can be made maximizing in the same way.

I did not invent these functions; they are taken from the literature. See the further reading section for references.

You can then choose and copy-paste the code one or more functions to use in your own project to study or compare the behavior of optimization algorithms.

Unimodal means that the function has a single global optima.

A unimodal function may or may not be convex. A convex function is a function where a line can be drawn between any two points in the domain and the line remains in the domain. For a two-dimensional function shown as a contour or surface plot, this means the function has a bowl shape and the line between two remains above or in the bowl.

Let’s look at a few examples of unimodal functions.

The range is bounded to -5.0 and 5.0 and one global optimal at [0.0, 0.0].

# unimodal test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a surface plot of the function.

The range is bounded to -10.0 and 10.0 and one global optimal at [0.0, 0.0].

# unimodal test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return 0.26 * (x**2 + y**2) - 0.48 * x * y # define range for input r_min, r_max = -10.0, 10.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a surface plot of the function.

The range is bounded to -10.0 and 10.0 and one global optimal at [0.0, 0.0]. This function is known as Easom’s function.

# unimodal test function from numpy import cos from numpy import exp from numpy import pi from numpy import arange from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return -cos(x) * cos(y) * exp(-((x - pi)**2 + (y - pi)**2)) # define range for input r_min, r_max = -10, 10 # sample input range uniformly at 0.01 increments xaxis = arange(r_min, r_max, 0.01) yaxis = arange(r_min, r_max, 0.01) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a surface plot of the function.

A multi-modal function means a function with more than one “*mode*” or optima (e.g. valley).

Multimodal functions are non-convex.

There may be one global optima and one or more local or deceptive optima. Alternately, there may be multiple global optima, i.e. multiple different inputs that result in the same minimal output of the function.

Let’s look at a few examples of multimodal functions.

The range is bounded to -5.0 and 5.0 and one global optimal at [0.0, 0.0]. This function is known as Ackley’s function.

# multimodal test function from numpy import arange from numpy import exp from numpy import sqrt from numpy import cos from numpy import e from numpy import pi from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi * x) + cos(2 * pi * y))) + e + 20 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a surface plot of the function.

The range is bounded to -5.0 and 5.0 and the function as four global optima at [3.0, 2.0], [-2.805118, 3.131312], [-3.779310, -3.283186], [3.584428, -1.848126]. This function is known as Himmelblau’s function.

# multimodal test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return (x**2 + y - 11)**2 + (x + y**2 -7)**2 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a surface plot of the function.

The range is bounded to -10.0 and 10.0 and the function as four global optima at [8.05502, 9.66459], [-8.05502, 9.66459], [8.05502, -9.66459], [-8.05502, -9.66459]. This function is known as Holder’s table function.

# multimodal test function from numpy import arange from numpy import exp from numpy import sqrt from numpy import cos from numpy import sin from numpy import e from numpy import pi from numpy import absolute from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return -absolute(sin(x) * cos(y) * exp(absolute(1 - (sqrt(x**2 + y**2)/pi)))) # define range for input r_min, r_max = -10.0, 10.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a surface plot of the function.

This section provides more resources on the topic if you are looking to go deeper.

- Test functions for optimization, Wikipedia.
- Virtual Library of Simulation Experiments: Test Functions and Datasets
- Test Functions Index
- GEA Toolbox – Examples of Objective Functions

In this tutorial, you discovered standard two-dimensional functions you can use when studying function optimization.

**Are you using any of the above functions?**

Let me know which one in the comments below.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Two-Dimensional (2D) Test Functions for Function Optimization appeared first on Machine Learning Mastery.

]]>The post How to Manually Optimize Machine Learning Model Hyperparameters appeared first on Machine Learning Mastery.

]]>Last Updated on March 29, 2021

Machine learning algorithms have hyperparameters that allow the algorithms to be tailored to specific datasets.

Although the impact of hyperparameters may be understood generally, their specific effect on a dataset and their interactions during learning may not be known. Therefore, it is important to tune the values of algorithm hyperparameters as part of a machine learning project.

It is common to use naive optimization algorithms to tune hyperparameters, such as a grid search and a random search. An alternate approach is to use a stochastic optimization algorithm, like a stochastic hill climbing algorithm.

In this tutorial, you will discover how to manually optimize the hyperparameters of machine learning algorithms.

After completing this tutorial, you will know:

- Stochastic optimization algorithms can be used instead of grid and random search for hyperparameter optimization.
- How to use a stochastic hill climbing algorithm to tune the hyperparameters of the Perceptron algorithm.
- How to manually optimize the hyperparameters of the XGBoost gradient boosting algorithm.

Let’s get started.

This tutorial is divided into three parts; they are:

- Manual Hyperparameter Optimization
- Perceptron Hyperparameter Optimization
- XGBoost Hyperparameter Optimization

Machine learning models have hyperparameters that you must set in order to customize the model to your dataset.

Often, the general effects of hyperparameters on a model are known, but how to best set a hyperparameter and combinations of interacting hyperparameters for a given dataset is challenging.

A better approach is to objectively search different values for model hyperparameters and choose a subset that results in a model that achieves the best performance on a given dataset. This is called hyperparameter optimization, or hyperparameter tuning.

A range of different optimization algorithms may be used, although two of the simplest and most common methods are random search and grid search.

**Random Search**. Define a search space as a bounded domain of hyperparameter values and randomly sample points in that domain.**Grid Search**. Define a search space as a grid of hyperparameter values and evaluate every position in the grid.

Grid search is great for spot-checking combinations that are known to perform well generally. Random search is great for discovery and getting hyperparameter combinations that you would not have guessed intuitively, although it often requires more time to execute.

For more on grid and random search for hyperparameter tuning, see the tutorial:

Grid and random search are primitive optimization algorithms, and it is possible to use any optimization we like to tune the performance of a machine learning algorithm. For example, it is possible to use stochastic optimization algorithms. This might be desirable when good or great performance is required and there are sufficient resources available to tune the model.

Next, let’s look at how we might use a stochastic hill climbing algorithm to tune the performance of the Perceptron algorithm.

The Perceptron algorithm is the simplest type of artificial neural network.

It is a model of a single neuron that can be used for two-class classification problems and provides the foundation for later developing much larger networks.

In this section, we will explore how to manually optimize the hyperparameters of the Perceptron model.

First, let’s define a synthetic binary classification problem that we can use as the focus of optimizing the model.

We can use the make_classification() function to define a binary classification problem with 1,000 rows and five input variables.

The example below creates the dataset and summarizes the shape of the data.

# define a binary classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1) # summarize the shape of the dataset print(X.shape, y.shape)

Running the example prints the shape of the created dataset, confirming our expectations.

(1000, 5) (1000,)

The scikit-learn provides an implementation of the Perceptron model via the Perceptron class.

Before we tune the hyperparameters of the model, we can establish a baseline in performance using the default hyperparameters.

We will evaluate the model using good practices of repeated stratified k-fold cross-validation via the RepeatedStratifiedKFold class.

The complete example of evaluating the Perceptron model with default hyperparameters on our synthetic binary classification dataset is listed below.

# perceptron default hyperparameters for binary classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import Perceptron # define dataset X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1) # define model model = Perceptron() # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report result print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports evaluates the model and reports the mean and standard deviation of the classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model with default hyperparameters achieved a classification accuracy of about 78.5 percent.

We would hope that we can achieve better performance than this with optimized hyperparameters.

Mean Accuracy: 0.786 (0.069)

Next, we can optimize the hyperparameters of the Perceptron model using a stochastic hill climbing algorithm.

There are many hyperparameters that we could optimize, although we will focus on two that perhaps have the most impact on the learning behavior of the model; they are:

- Learning Rate (
*eta0*). - Regularization (
*alpha*).

The learning rate controls the amount the model is updated based on prediction errors and controls the speed of learning. The default value of eta is 1.0. reasonable values are larger than zero (e.g. larger than 1e-8 or 1e-10) and probably less than 1.0

By default, the Perceptron does not use any regularization, but we will enable “*elastic net*” regularization which applies both L1 and L2 regularization during learning. This will encourage the model to seek small model weights and, in turn, often better performance.

We will tune the “*alpha*” hyperparameter that controls the weighting of the regularization, e.g. the amount it impacts the learning. If set to 0.0, it is as though no regularization is being used. Reasonable values are between 0.0 and 1.0.

First, we need to define the objective function for the optimization algorithm. We will evaluate a configuration using mean classification accuracy with repeated stratified k-fold cross-validation. We will seek to maximize accuracy in the configurations.

The *objective()* function below implements this, taking the dataset and a list of config values. The config values (learning rate and regularization weighting) are unpacked, used to configure the model, which is then evaluated, and the mean accuracy is returned.

# objective function def objective(X, y, cfg): # unpack config eta, alpha = cfg # define model model = Perceptron(penalty='elasticnet', alpha=alpha, eta0=eta) # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # calculate mean accuracy result = mean(scores) return result

Next, we need a function to take a step in the search space.

The search space is defined by two variables (*eta* and *alpha*). A step in the search space must have some relationship to the previous values and must be bound to sensible values (e.g. between 0 and 1).

We will use a “*step size*” hyperparameter that controls how far the algorithm is allowed to move from the existing configuration. A new configuration will be chosen probabilistically using a Gaussian distribution with the current value as the mean of the distribution and the step size as the standard deviation of the distribution.

We can use the randn() NumPy function to generate random numbers with a Gaussian distribution.

The *step()* function below implements this and will take a step in the search space and generate a new configuration using an existing configuration.

# take a step in the search space def step(cfg, step_size): # unpack the configuration eta, alpha = cfg # step eta new_eta = eta + randn() * step_size # check the bounds of eta if new_eta <= 0.0: new_eta = 1e-8 # step alpha new_alpha = alpha + randn() * step_size # check the bounds of alpha if new_alpha < 0.0: new_alpha = 0.0 # return the new configuration return [new_eta, new_alpha]

Next, we need to implement the stochastic hill climbing algorithm that will call our *objective()* function to evaluate candidate solutions and our *step()* function to take a step in the search space.

The search first generates a random initial solution, in this case with eta and alpha values in the range 0 and 1. The initial solution is then evaluated and is taken as the current best working solution.

... # starting point for the search solution = [rand(), rand()] # evaluate the initial point solution_eval = objective(X, y, solution)

Next, the algorithm iterates for a fixed number of iterations provided as a hyperparameter to the search. Each iteration involves taking a step and evaluating the new candidate solution.

... # take a step candidate = step(solution, step_size) # evaluate candidate point candidate_eval = objective(X, y, candidate)

If the new solution is better than the current working solution, it is taken as the new current working solution.

... # check if we should keep the new point if candidate_eval >= solution_eval: # store the new point solution, solution_eval = candidate, candidate_eval # report progress print('>%d, cfg=%s %.5f' % (i, solution, solution_eval))

At the end of the search, the best solution and its performance are then returned.

Tying this together, the *hillclimbing()* function below implements the stochastic hill climbing algorithm for tuning the Perceptron algorithm, taking the dataset, objective function, number of iterations, and step size as arguments.

# hill climbing local search algorithm def hillclimbing(X, y, objective, n_iter, step_size): # starting point for the search solution = [rand(), rand()] # evaluate the initial point solution_eval = objective(X, y, solution) # run the hill climb for i in range(n_iter): # take a step candidate = step(solution, step_size) # evaluate candidate point candidate_eval = objective(X, y, candidate) # check if we should keep the new point if candidate_eval >= solution_eval: # store the new point solution, solution_eval = candidate, candidate_eval # report progress print('>%d, cfg=%s %.5f' % (i, solution, solution_eval)) return [solution, solution_eval]

We can then call the algorithm and report the results of the search.

In this case, we will run the algorithm for 100 iterations and use a step size of 0.1, chosen after a little trial and error.

... # define the total iterations n_iter = 100 # step size in the search space step_size = 0.1 # perform the hill climbing search cfg, score = hillclimbing(X, y, objective, n_iter, step_size) print('Done!') print('cfg=%s: Mean Accuracy: %f' % (cfg, score))

Tying this together, the complete example of manually tuning the Perceptron algorithm is listed below.

# manually search perceptron hyperparameters for binary classification from numpy import mean from numpy.random import randn from numpy.random import rand from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import Perceptron # objective function def objective(X, y, cfg): # unpack config eta, alpha = cfg # define model model = Perceptron(penalty='elasticnet', alpha=alpha, eta0=eta) # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # calculate mean accuracy result = mean(scores) return result # take a step in the search space def step(cfg, step_size): # unpack the configuration eta, alpha = cfg # step eta new_eta = eta + randn() * step_size # check the bounds of eta if new_eta <= 0.0: new_eta = 1e-8 # step alpha new_alpha = alpha + randn() * step_size # check the bounds of alpha if new_alpha < 0.0: new_alpha = 0.0 # return the new configuration return [new_eta, new_alpha] # hill climbing local search algorithm def hillclimbing(X, y, objective, n_iter, step_size): # starting point for the search solution = [rand(), rand()] # evaluate the initial point solution_eval = objective(X, y, solution) # run the hill climb for i in range(n_iter): # take a step candidate = step(solution, step_size) # evaluate candidate point candidate_eval = objective(X, y, candidate) # check if we should keep the new point if candidate_eval >= solution_eval: # store the new point solution, solution_eval = candidate, candidate_eval # report progress print('>%d, cfg=%s %.5f' % (i, solution, solution_eval)) return [solution, solution_eval] # define dataset X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1) # define the total iterations n_iter = 100 # step size in the search space step_size = 0.1 # perform the hill climbing search cfg, score = hillclimbing(X, y, objective, n_iter, step_size) print('Done!') print('cfg=%s: Mean Accuracy: %f' % (cfg, score))

Running the example reports the configuration and result each time an improvement is seen during the search. At the end of the run, the best configuration and result are reported.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the best result involved using a learning rate slightly above 1 at 1.004 and a regularization weight of about 0.002 achieving a mean accuracy of about 79.1 percent, better than the default configuration that achieved an accuracy of about 78.5 percent.

**Can you get a better result?**

Let me know in the comments below.

>0, cfg=[0.5827274503894747, 0.260872709578015] 0.70533 >4, cfg=[0.5449820307807399, 0.3017271170801444] 0.70567 >6, cfg=[0.6286475606495414, 0.17499090243915086] 0.71933 >7, cfg=[0.5956196828965779, 0.0] 0.78633 >8, cfg=[0.5878361167354715, 0.0] 0.78633 >10, cfg=[0.6353507984485595, 0.0] 0.78633 >13, cfg=[0.5690530537610675, 0.0] 0.78633 >17, cfg=[0.6650936023999641, 0.0] 0.78633 >22, cfg=[0.9070451625704087, 0.0] 0.78633 >23, cfg=[0.9253366187387938, 0.0] 0.78633 >26, cfg=[0.9966143540220266, 0.0] 0.78633 >31, cfg=[1.0048613895650054, 0.002162219228449132] 0.79133 Done! cfg=[1.0048613895650054, 0.002162219228449132]: Mean Accuracy: 0.791333

Now that we are familiar with how to use a stochastic hill climbing algorithm to tune the hyperparameters of a simple machine learning algorithm, let’s look at tuning a more advanced algorithm, such as XGBoost.

XGBoost is short for Extreme Gradient Boosting and is an efficient implementation of the stochastic gradient boosting machine learning algorithm.

The stochastic gradient boosting algorithm, also called gradient boosting machines or tree boosting, is a powerful machine learning technique that performs well or even best on a wide range of challenging machine learning problems.

First, the XGBoost library must be installed.

You can install it using pip, as follows:

sudo pip install xgboost

Once installed, you can confirm that it was installed successfully and that you are using a modern version by running the following code:

# xgboost import xgboost print("xgboost", xgboost.__version__)

Running the code, you should see the following version number or higher.

xgboost 1.0.1

Although the XGBoost library has its own Python API, we can use XGBoost models with the scikit-learn API via the XGBClassifier wrapper class.

An instance of the model can be instantiated and used just like any other scikit-learn class for model evaluation. For example:

... # define model model = XGBClassifier()

Before we tune the hyperparameters of XGBoost, we can establish a baseline in performance using the default hyperparameters.

We will use the same synthetic binary classification dataset from the previous section and the same test harness of repeated stratified k-fold cross-validation.

The complete example of evaluating the performance of XGBoost with default hyperparameters is listed below.

# xgboost with default hyperparameters for binary classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from xgboost import XGBClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1) # define model model = XGBClassifier() # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report result print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the model and reports the mean and standard deviation of the classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model with default hyperparameters achieved a classification accuracy of about 84.9 percent.

We would hope that we can achieve better performance than this with optimized hyperparameters.

Mean Accuracy: 0.849 (0.040)

Next, we can adapt the stochastic hill climbing optimization algorithm to tune the hyperparameters of the XGBoost model.

There are many hyperparameters that we may want to optimize for the XGBoost model.

For an overview of how to tune the XGBoost model, see the tutorial:

We will focus on four key hyperparameters; they are:

- Learning Rate (
*learning_rate*) - Number of Trees (
*n_estimators*) - Subsample Percentage (
*subsample*) - Tree Depth (
*max_depth*)

The **learning rate** controls the contribution of each tree to the ensemble. Sensible values are less than 1.0 and slightly above 0.0 (e.g. 1e-8).

The **number of trees** controls the size of the ensemble, and often, more trees is better to a point of diminishing returns. Sensible values are between 1 tree and hundreds or thousands of trees.

The **subsample** percentages define the random sample size used to train each tree, defined as a percentage of the size of the original dataset. Values are between a value slightly above 0.0 (e.g. 1e-8) and 1.0

The **tree depth** is the number of levels in each tree. Deeper trees are more specific to the training dataset and perhaps overfit. Shorter trees often generalize better. Sensible values are between 1 and 10 or 20.

First, we must update the *objective()* function to unpack the hyperparameters of the XGBoost model, configure it, and then evaluate the mean classification accuracy.

# objective function def objective(X, y, cfg): # unpack config lrate, n_tree, subsam, depth = cfg # define model model = XGBClassifier(learning_rate=lrate, n_estimators=n_tree, subsample=subsam, max_depth=depth) # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # calculate mean accuracy result = mean(scores) return result

Next, we need to define the *step()* function used to take a step in the search space.

Each hyperparameter is quite a different range, therefore, we will define the step size (standard deviation of the distribution) separately for each hyperparameter. We will also define the step sizes in line rather than as arguments to the function, to keep things simple.

The number of trees and the depth are integers, so the stepped values are rounded.

The step sizes chosen are arbitrary, chosen after a little trial and error.

The updated step function is listed below.

# take a step in the search space def step(cfg): # unpack config lrate, n_tree, subsam, depth = cfg # learning rate lrate = lrate + randn() * 0.01 if lrate <= 0.0: lrate = 1e-8 if lrate > 1: lrate = 1.0 # number of trees n_tree = round(n_tree + randn() * 50) if n_tree <= 0.0: n_tree = 1 # subsample percentage subsam = subsam + randn() * 0.1 if subsam <= 0.0: subsam = 1e-8 if subsam > 1: subsam = 1.0 # max tree depth depth = round(depth + randn() * 7) if depth <= 1: depth = 1 # return new config return [lrate, n_tree, subsam, depth]

Finally, the *hillclimbing()* algorithm must be updated to define an initial solution with appropriate values.

In this case, we will define the initial solution with sensible defaults, matching the default hyperparameters, or close to them.

... # starting point for the search solution = step([0.1, 100, 1.0, 7])

Tying this together, the complete example of manually tuning the hyperparameters of the XGBoost algorithm using a stochastic hill climbing algorithm is listed below.

# xgboost manual hyperparameter optimization for binary classification from numpy import mean from numpy.random import randn from numpy.random import rand from numpy.random import randint from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from xgboost import XGBClassifier # objective function def objective(X, y, cfg): # unpack config lrate, n_tree, subsam, depth = cfg # define model model = XGBClassifier(learning_rate=lrate, n_estimators=n_tree, subsample=subsam, max_depth=depth) # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # calculate mean accuracy result = mean(scores) return result # take a step in the search space def step(cfg): # unpack config lrate, n_tree, subsam, depth = cfg # learning rate lrate = lrate + randn() * 0.01 if lrate <= 0.0: lrate = 1e-8 if lrate > 1: lrate = 1.0 # number of trees n_tree = round(n_tree + randn() * 50) if n_tree <= 0.0: n_tree = 1 # subsample percentage subsam = subsam + randn() * 0.1 if subsam <= 0.0: subsam = 1e-8 if subsam > 1: subsam = 1.0 # max tree depth depth = round(depth + randn() * 7) if depth <= 1: depth = 1 # return new config return [lrate, n_tree, subsam, depth] # hill climbing local search algorithm def hillclimbing(X, y, objective, n_iter): # starting point for the search solution = step([0.1, 100, 1.0, 7]) # evaluate the initial point solution_eval = objective(X, y, solution) # run the hill climb for i in range(n_iter): # take a step candidate = step(solution) # evaluate candidate point candidate_eval = objective(X, y, candidate) # check if we should keep the new point if candidate_eval >= solution_eval: # store the new point solution, solution_eval = candidate, candidate_eval # report progress print('>%d, cfg=[%s] %.5f' % (i, solution, solution_eval)) return [solution, solution_eval] # define dataset X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1) # define the total iterations n_iter = 200 # perform the hill climbing search cfg, score = hillclimbing(X, y, objective, n_iter) print('Done!') print('cfg=[%s]: Mean Accuracy: %f' % (cfg, score))

Running the example reports the configuration and result each time an improvement is seen during the search. At the end of the run, the best configuration and result are reported.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the best result involved using a learning rate of about 0.02, 52 trees, a subsample rate of about 50 percent, and a large depth of 53 levels.

This configuration resulted in a mean accuracy of about 87.3 percent, better than the default configuration that achieved an accuracy of about 84.9 percent.

**Can you get a better result?**

Let me know in the comments below.

>0, cfg=[[0.1058242692126418, 67, 0.9228490731610172, 12]] 0.85933 >1, cfg=[[0.11060813799692253, 51, 0.859353656735739, 13]] 0.86100 >4, cfg=[[0.11890247679234153, 58, 0.7135275461723894, 12]] 0.86167 >5, cfg=[[0.10226257987735601, 61, 0.6086462443373852, 17]] 0.86400 >15, cfg=[[0.11176962034280596, 106, 0.5592742266405146, 13]] 0.86500 >19, cfg=[[0.09493587069112454, 153, 0.5049124222437619, 34]] 0.86533 >23, cfg=[[0.08516531024154426, 88, 0.5895201311518876, 31]] 0.86733 >46, cfg=[[0.10092590898175327, 32, 0.5982811365027455, 30]] 0.86867 >75, cfg=[[0.099469211050998, 20, 0.36372573610040404, 32]] 0.86900 >96, cfg=[[0.09021536590375884, 38, 0.4725379807796971, 20]] 0.86900 >100, cfg=[[0.08979482274655906, 65, 0.3697395430835758, 14]] 0.87000 >110, cfg=[[0.06792737273465625, 89, 0.33827505722318224, 17]] 0.87000 >118, cfg=[[0.05544969684589669, 72, 0.2989721608535262, 23]] 0.87200 >122, cfg=[[0.050102976159097, 128, 0.2043203965148931, 24]] 0.87200 >123, cfg=[[0.031493266763680444, 120, 0.2998819062922256, 30]] 0.87333 >128, cfg=[[0.023324201169625292, 84, 0.4017169945431015, 42]] 0.87333 >140, cfg=[[0.020224220443108752, 52, 0.5088096815056933, 53]] 0.87367 Done! cfg=[[0.020224220443108752, 52, 0.5088096815056933, 53]]: Mean Accuracy: 0.873667

This section provides more resources on the topic if you are looking to go deeper.

- Hyperparameter Optimization With Random Search and Grid Search
- How to Configure the Gradient Boosting Algorithm
- How To Implement The Perceptron Algorithm From Scratch In Python

- sklearn.datasets.make_classification APIs.
- sklearn.metrics.accuracy_score APIs.
- numpy.random.rand API.
- sklearn.linear_model.Perceptron API.

In this tutorial, you discovered how to manually optimize the hyperparameters of machine learning algorithms.

Specifically, you learned:

- Stochastic optimization algorithms can be used instead of grid and random search for hyperparameter optimization.
- How to use a stochastic hill climbing algorithm to tune the hyperparameters of the Perceptron algorithm.
- How to manually optimize the hyperparameters of the XGBoost gradient boosting algorithm.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Manually Optimize Machine Learning Model Hyperparameters appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to XGBoost Loss Functions appeared first on Machine Learning Mastery.

]]>**XGBoost** is a powerful and popular implementation of the gradient boosting ensemble algorithm.

An important aspect in configuring XGBoost models is the choice of loss function that is minimized during the training of the model.

The **loss function** must be matched to the predictive modeling problem type, in the same way we must choose appropriate loss functions based on problem types with deep learning neural networks.

In this tutorial, you will discover how to configure loss functions for XGBoost ensemble models.

After completing this tutorial, you will know:

- Specifying loss functions used when training XGBoost ensembles is a critical step, much like neural networks.
- How to configure XGBoost loss functions for binary and multi-class classification tasks.
- How to configure XGBoost loss functions for regression predictive modeling tasks.

Let’s get started.

This tutorial is divided into three parts; they are:

- XGBoost and Loss Functions
- XGBoost Loss for Classification
- XGBoost Loss for Regression

Extreme Gradient Boosting, or XGBoost for short, is an efficient open-source implementation of the gradient boosting algorithm. As such, XGBoost is an algorithm, an open-source project, and a Python library.

It was initially developed by Tianqi Chen and was described by Chen and Carlos Guestrin in their 2016 paper titled “XGBoost: A Scalable Tree Boosting System.”

It is designed to be both computationally efficient (e.g. fast to execute) and highly effective, perhaps more effective than other open-source implementations.

XGBoost supports a range of different predictive modeling problems, most notably classification and regression.

XGBoost is trained by minimizing loss of an objective function against a dataset. As such, the choice of loss function is a critical hyperparameter and tied directly to the type of problem being solved, much like deep learning neural networks.

The implementation allows the objective function to be specified via the “*objective*” hyperparameter, and sensible defaults are used that work for most cases.

Nevertheless, there remains some confusion by beginners as to what loss function to use when training XGBoost models.

We will take a closer look at how to configure the loss function for XGBoost in this tutorial.

Before we get started, let’s get setup.

XGBoost can be installed as a standalone library and an XGBoost model can be developed using the scikit-learn API.

The first step is to install the XGBoost library if it is not already installed. This can be achieved using the pip python package manager on most platforms; for example:

sudo pip install xgboost

You can then confirm that the XGBoost library was installed correctly and can be used by running the following script.

# check xgboost version import xgboost print(xgboost.__version__)

Running the script will print your version of the XGBoost library you have installed.

Your version should be the same or higher. If not, you must upgrade your version of the XGBoost library.

1.1.1

It is possible that you may have problems with the latest version of the library. It is not your fault.

Sometimes, the most recent version of the library imposes additional requirements or may be less stable.

If you do have errors when trying to run the above script, I recommend downgrading to version 1.0.1 (or lower). This can be achieved by specifying the version to install to the pip command, as follows:

sudo pip install xgboost==1.0.1

If you see a warning message, you can safely ignore it for now. For example, below is an example of a warning message that you may see and can ignore:

FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.

If you require specific instructions for your development environment, see the tutorial:

The XGBoost library has its own custom API, although we will use the method via the scikit-learn wrapper classes: XGBRegressor and XGBClassifier. This will allow us to use the full suite of tools from the scikit-learn machine learning library to prepare data and evaluate models.

Both models operate the same way and take the same arguments that influence how the decision trees are created and added to the ensemble.

For more on how to use the XGBoost API with scikit-learn, see the tutorial:

Next, let’s take a closer look at how to configure the loss function for XGBoost on classification problems.

Classification tasks involve predicting a label or probability for each possible class, given an input sample.

There are two main types of classification tasks with mutually exclusive labels: binary classification that has two class labels, and multi-class classification that have more than two class labels.

**Binary Classification**: Classification task with two class labels.**Multi-Class Classification**: Classification task with more than two class labels.

For more on the different types of classification tasks, see the tutorial:

XGBoost provides loss functions for each of these problem types.

It is typical in machine learning to train a model to predict the probability of class membership for probability tasks and if the task requires crisp class labels to post-process the predicted probabilities (e.g. use argmax).

This approach is used when training deep learning neural networks for classification, and is also recommended when using XGBoost for classification.

The loss function used for predicting probabilities for binary classification problems is “*binary:logistic*” and the loss function for predicting class probabilities for multi-class problems is “*multi:softprob*“.

- “
*multi:logistic*“: XGBoost loss function for binary classification. - “
*multi:softprob*“: XGBoost loss function for multi-class classification.

These string values can be specified via the “*objective*” hyperparameter when configuring your XGBClassifier model.

For example, for binary classification:

... # define the model for binary classification model = XGBClassifier(objective='binary:logistic')

And, for multi-class classification:

... # define the model for multi-class classification model = XGBClassifier(objective='binary:softprob')

Importantly, if you do not specify the “*objective*” hyperparameter, the *XGBClassifier* will automatically choose one of these loss functions based on the data provided during training.

We can make this concrete with a worked example.

The example below creates a synthetic binary classification dataset, fits an *XGBClassifier* on the dataset with default hyperparameters, then prints the model objective configuration.

# example of automatically choosing the loss function for binary classification from sklearn.datasets import make_classification from xgboost import XGBClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define the model model = XGBClassifier() # fit the model model.fit(X, y) # summarize the model loss function print(model.objective)

Running the example fits the model on the dataset and prints the loss function configuration.

We can see the model automatically choose a loss function for binary classification.

binary:logistic

Alternately, we can specify the objective and fit the model, confirming the loss function was used.

# example of manually specifying the loss function for binary classification from sklearn.datasets import make_classification from xgboost import XGBClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define the model model = XGBClassifier(objective='binary:logistic') # fit the model model.fit(X, y) # summarize the model loss function print(model.objective)

Running the example fits the model on the dataset and prints the loss function configuration.

We can see the model used to specify a loss function for binary classification.

binary:logistic

Let’s repeat this example on a dataset with more than two classes. In this case, three classes.

The complete example is listed below.

# example of automatically choosing the loss function for multi-class classification from sklearn.datasets import make_classification from xgboost import XGBClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1, n_classes=3) # define the model model = XGBClassifier() # fit the model model.fit(X, y) # summarize the model loss function print(model.objective)

Running the example fits the model on the dataset and prints the loss function configuration.

We can see the model automatically chose a loss function for multi-class classification.

multi:softprob

Alternately, we can manually specify the loss function and confirm it was used to train the model.

# example of manually specifying the loss function for multi-class classification from sklearn.datasets import make_classification from xgboost import XGBClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1, n_classes=3) # define the model model = XGBClassifier(objective="multi:softprob") # fit the model model.fit(X, y) # summarize the model loss function print(model.objective)

Running the example fits the model on the dataset and prints the loss function configuration.

We can see the model used to specify a loss function for multi-class classification.

multi:softprob

Finally, there are other loss functions you can use for classification, including: “*binary:logitraw*” and “*binary:hinge*” for binary classification and “*multi:softmax*” for multi-class classification.

You can see a full list here:

Next, let’s take a look at XGBoost loss functions for regression.

Regression refers to predictive modeling problems where a numerical value is predicted given an input sample.

Although predicting a probability sounds like a regression problem (i.e. a probability is a numerical value), it is generally not considered a regression type predictive modeling problem.

The XGBoost objective function used when predicting numerical values is the “*reg:squarederror*” loss function.

*“reg:squarederror”*: Loss function for regression predictive modeling problems.

This string value can be specified via the “*objective*” hyperparameter when configuring your *XGBRegressor* model.

For example:

... # define the model for regression model = XGBRegressor(objective='reg:squarederror')

Importantly, if you do not specify the “*objective*” hyperparameter, the *XGBRegressor* will automatically choose this objective function for you.

We can make this concrete with a worked example.

The example below creates a synthetic regression dataset, fits an *XGBRegressor* on the dataset, then prints the model objective configuration.

# example of automatically choosing the loss function for regression from sklearn.datasets import make_regression from xgboost import XGBRegressor # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7) # define the model model = XGBRegressor() # fit the model model.fit(X, y) # summarize the model loss function print(model.objective)

Running the example fits the model on the dataset and prints the loss function configuration.

We can see the model automatically choose a loss function for regression.

reg:squarederror

Alternately, we can specify the objective and fit the model, confirming the loss function was used.

# example of manually specifying the loss function for regression from sklearn.datasets import make_regression from xgboost import XGBRegressor # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7) # define the model model = XGBRegressor(objective='reg:squarederror') # fit the model model.fit(X, y) # summarize the model loss function print(model.objective)

Running the example fits the model on the dataset and prints the loss function configuration.

We can see the model used the specified a loss function for regression.

reg:squarederror

Finally, there are other loss functions you can use for regression, including: “*reg:squaredlogerror*“, “*reg:logistic*“, “*reg:pseudohubererror*“, “*reg:gamma*“, and “*reg:tweedie*“.

You can see a full list here:

This section provides more resources on the topic if you are looking to go deeper.

- Extreme Gradient Boosting (XGBoost) Ensemble in Python
- Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost
- 4 Types of Classification Tasks in Machine Learning

In this tutorial, you discovered how to configure loss functions for XGBoost ensemble models.

Specifically, you learned:

- Specifying loss functions used when training XGBoost ensembles is a critical step much like neural networks.
- How to configure XGBoost loss functions for binary and multi-class classification tasks.
- How to configure XGBoost loss functions for regression predictive modeling tasks.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to XGBoost Loss Functions appeared first on Machine Learning Mastery.

]]>