The post Differential Evolution from Scratch in Python appeared first on Machine Learning Mastery.

]]>The differential evolution algorithm belongs to a broader family of evolutionary computing algorithms. Similar to other popular direct search approaches, such as genetic algorithms and evolution strategies, the differential evolution algorithm starts with an initial population of candidate solutions. These candidate solutions are iteratively improved by introducing mutations into the population, and retaining the fittest candidate solutions that yield a lower objective function value.

The differential evolution algorithm is advantageous over the aforementioned popular approaches because it can handle nonlinear and non-differentiable multi-dimensional objective functions, while requiring very few control parameters to steer the minimisation. These characteristics make the algorithm easier and more practical to use.

In this tutorial, you will discover the differential evolution algorithm for global optimisation.

After completing this tutorial, you will know:

- Differential evolution is a heuristic approach for the global optimisation of nonlinear and non- differentiable continuous space functions.
- How to implement the differential evolution algorithm from scratch in Python.
- How to apply the differential evolution algorithm to a real-valued 2D objective function.

Let’s get started.

**June/2021**: Fixed mutation operation in the code to match the description.

This tutorial is divided into three parts; they are:

- Differential Evolution
- Differential Evolution Algorithm From Scratch
- Differential Evolution Algorithm on the Sphere Function

Differential evolution is a heuristic approach for the global optimisation of nonlinear and non- differentiable continuous space functions.

For a minimisation algorithm to be considered practical, it is expected to fulfil five different requirements:

(1) Ability to handle non-differentiable, nonlinear and multimodal cost functions.

(2) Parallelizability to cope with computation intensive cost functions.

(3) Ease of use, i.e. few control variables to steer the minimization. These variables should

also be robust and easy to choose.

(4) Good convergence properties, i.e. consistent convergence to the global minimum in

consecutive independent trials.

— A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces, 1997.

The strength of the differential evolution algorithm stems from the fact that it was designed to fulfil all of the above requirements.

Differential Evolution (DE) is arguably one of the most powerful and versatile evolutionary optimizers for the continuous parameter spaces in recent times.

— Recent advances in differential evolution: An updated survey, 2016.

The algorithm begins by randomly initiating a population of real-valued decision vectors, also known as genomes or chromosomes. These represent the candidates solutions to the multi- dimensional optimisation problem.

At each iteration, the algorithm introduces mutations into the population to generate new candidate solutions. The mutation process adds the weighted difference between two population vectors to a third vector, to produce a mutated vector. The parameters of the mutated vector are again mixed with the parameters of another predetermined vector, the target vector, during a process known as crossover that aims to increase the diversity of the perturbed parameter vectors. The resulting vector is known as the trial vector.

DE generates new parameter vectors by adding the weighted difference between two population vectors to a third vector. Let this operation be called mutation.

In order to increase the diversity of the perturbed parameter vectors, crossover is introduced.

— A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces, 1997.

These mutations are generated according to a mutation strategy, which follows a general naming convention of DE/x/y/z, where DE stands for Differential Evolution, while x denotes the vector to be mutated, y denotes the number of difference vectors considered for the mutation of x, and z is the type of crossover in use. For instance, the popular strategies:

- DE/rand/1/bin
- DE/best/2/bin

Specify that vector x can either be picked randomly (rand) from the population, or else the vector with the lowest cost (best) is selected; that the number of difference vectors under consideration is either 1 or 2; and that crossover is performed according to independent binomial (bin) experiments. The DE/best/2/bin strategy, in particular, appears to be highly beneficial in improving the diversity of the population if the population size is large enough.

The usage of two difference vectors seems to improve the diversity of the population if the number of population vectors NP is high enough.

— A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces, 1997.

A final selection operation replaces the target vector, or the parent, by the trial vector, its offspring, if the latter yields a lower objective function value. Hence, the fitter offspring now becomes a member of the newly generated population, and subsequently participates in the mutation of further population members. These iterations continue until a termination criterion is reached.

The iterations continue till a termination criterion (such as exhaustion of maximum functional evaluations) is satisfied.

— Recent advances in differential evolution: An updated survey, 2016.

The differential evolution algorithm requires very few parameters to operate, namely the population size, NP, a real and constant scale factor, F ∈ [0, 2], that weights the differential variation during the mutation process, and a crossover rate, CR ∈ [0, 1], that is determined experimentally. This makes the algorithm easy and practical to use.

In addition, the canonical DE requires very few control parameters (3 to be precise: the scale factor, the crossover rate and the population size) — a feature that makes it easy to use for the practitioners.

— Recent advances in differential evolution: An updated survey, 2016.

There have been further variants to the canonical differential evolution algorithm described above,

which one may read on in Recent advances in differential evolution – An updated survey, 2016.

Now that we are familiar with the differential evolution algorithm, let’s look at how to implement it from scratch.

In this section, we will explore how to implement the differential evolution algorithm from scratch.

The differential evolution algorithm begins by generating an initial population of candidate solutions. For this purpose, we shall use the rand() function to generate an array of random values sampled from a uniform distribution over the range, [0, 1).

We will then scale these values to change the range of their distribution to (lower bound, upper bound), where the bounds are specified in the form of a 2D array with each dimension corresponding to each input variable.

... # initialise population of candidate solutions randomly within the specified bounds pop = bounds[:, 0] + (rand(pop_size, len(bounds)) * (bounds[:, 1] - bounds[:, 0]))

It is within these same bounds that the objective function will also be evaluated. An objective function of choice and the bounds on each input variable may, therefore, be defined as follows:

# define objective function def obj(x): return 0 # define lower and upper bounds bounds = asarray([-5.0, 5.0])

We can evaluate our initial population of candidate solutions by passing it to the objective function as input argument.

... # evaluate initial population of candidate solutions obj_all = [obj(ind) for ind in pop]

We shall be replacing the values in obj_all with better ones as the population evolves and converges towards an optimal solution.

We can then loop over a predefined number of iterations of the algorithm, such as 100 or 1,000, as specified by parameter, iter, as well as over all candidate solutions.

... # run iterations of the algorithm for i in range(iter): # iterate over all candidate solutions for j in range(pop_size): ...

The first step of the algorithm iteration performs a mutation process. For this purpose, three random candidates, a, b and c, that are not the current one, are randomly selected from the population and a mutated vector is generated by computing: a + F * (b – c). Recall that F ∈ [0, 2] and denotes the mutation scale factor.

... # choose three candidates, a, b and c, that are not the current one candidates = [candidate for candidate in range(pop_size) if candidate != j] a, b, c = pop[choice(candidates, 3, replace=False)]

The mutation process is performed by the function, mutation, to which we pass a, b, c and F as input arguments.

# define mutation operation def mutation(x, F): return x[0] + F * (x[1] - x[2]) ... # perform mutation mutated = mutation([a, b, c], F) ...

Since we are operating within a bounded range of values, we need to check whether the newly mutated vector is also within the specified bounds, and if not, clip its values to the upper or lower limits as necessary. This check is carried out by the function, check_bounds.

# define boundary check operation def check_bounds(mutated, bounds): mutated_bound = [clip(mutated[i], bounds[i, 0], bounds[i, 1]) for i in range(len(bounds))] return mutated_bound

The next step performs crossover, where specific values of the current, target, vector are replaced by the corresponding values in the mutated vector, to create a trial vector. The decision of which values to replace is based on whether a uniform random value generated for each input variable falls below a crossover rate. If it does, then the corresponding values from the mutated vector are copied to the target vector.

The crossover process is implemented by the crossover() function, which takes the mutated and target vectors as input, as well as the crossover rate, cr ∈ [0, 1], and the number of input variables.

# define crossover operation def crossover(mutated, target, dims, cr): # generate a uniform random value for every dimension p = rand(dims) # generate trial vector by binomial crossover trial = [mutated[i] if p[i] < cr else target[i] for i in range(dims)] return trial ... # perform crossover trial = crossover(mutated, pop[j], len(bounds), cr) ...

A final selection step replaces the target vector by the trial vector if the latter yields a lower objective function value. For this purpose, we evaluate both vectors on the objective function and subsequently perform selection, storing the new objective function value in obj_all if the trial vector is found to be the fittest of the two.

... # compute objective function value for target vector obj_target = obj(pop[j]) # compute objective function value for trial vector obj_trial = obj(trial) # perform selection if obj_trial < obj_target: # replace the target vector with the trial vector pop[j] = trial # store the new objective function value obj_all[j] = obj_trial

We can tie all steps together into a differential_evolution() function that takes as input arguments the population size, the bounds of each input variable, the total number of iterations, the mutation scale factor and the crossover rate, and returns the best solution found and its evaluation.

def differential_evolution(pop_size, bounds, iter, F, cr): # initialise population of candidate solutions randomly within the specified bounds pop = bounds[:, 0] + (rand(pop_size, len(bounds)) * (bounds[:, 1] - bounds[:, 0])) # evaluate initial population of candidate solutions obj_all = [obj(ind) for ind in pop] # find the best performing vector of initial population best_vector = pop[argmin(obj_all)] best_obj = min(obj_all) prev_obj = best_obj # run iterations of the algorithm for i in range(iter): # iterate over all candidate solutions for j in range(pop_size): # choose three candidates, a, b and c, that are not the current one candidates = [candidate for candidate in range(pop_size) if candidate != j] a, b, c = pop[choice(candidates, 3, replace=False)] # perform mutation mutated = mutation([a, b, c], F) # check that lower and upper bounds are retained after mutation mutated = check_bounds(mutated, bounds) # perform crossover trial = crossover(mutated, pop[j], len(bounds), cr) # compute objective function value for target vector obj_target = obj(pop[j]) # compute objective function value for trial vector obj_trial = obj(trial) # perform selection if obj_trial < obj_target: # replace the target vector with the trial vector pop[j] = trial # store the new objective function value obj_all[j] = obj_trial # find the best performing vector at each iteration best_obj = min(obj_all) # store the lowest objective function value if best_obj < prev_obj: best_vector = pop[argmin(obj_all)] prev_obj = best_obj # report progress at each iteration print('Iteration: %d f([%s]) = %.5f' % (i, around(best_vector, decimals=5), best_obj)) return [best_vector, best_obj]

Now that we have implemented the differential evolution algorithm, let’s investigate how to use it to optimise an objective function.

In this section, we will apply the differential evolution algorithm to an objective function.

We will use a simple two-dimensional sphere objective function specified within the bounds, [-5, 5]. The sphere function is continuous, convex and unimodal, and is characterised by a single global minimum at f(0, 0) = 0.0.

# define objective function def obj(x): return x[0]**2.0 + x[1]**2.0

We will minimise this objective function with the differential evolution algorithm, based on the strategy DE/rand/1/bin.

In order to do so, we must define values for the algorithm parameters, specifically for the population size, the number of iterations, the mutation scale factor and the crossover rate. We set these values empirically to, 10, 100, 0.5 and 0.7 respectively.

... # define population size pop_size = 10 # define number of iterations iter = 100 # define scale factor for mutation F = 0.5 # define crossover rate for recombination cr = 0.7

We also define the bounds of each input variable.

... # define lower and upper bounds for every dimension bounds = asarray([(-5.0, 5.0), (-5.0, 5.0)])

Next, we carry out the search and report the results.

... # perform differential evolution solution = differential_evolution(pop_size, bounds, iter, F, cr)

Tying this all together, the complete example is listed below.

# differential evolution search of the two-dimensional sphere objective function from numpy.random import rand from numpy.random import choice from numpy import asarray from numpy import clip from numpy import argmin from numpy import min from numpy import around # define objective function def obj(x): return x[0]**2.0 + x[1]**2.0 # define mutation operation def mutation(x, F): return x[0] + F * (x[1] - x[2]) # define boundary check operation def check_bounds(mutated, bounds): mutated_bound = [clip(mutated[i], bounds[i, 0], bounds[i, 1]) for i in range(len(bounds))] return mutated_bound # define crossover operation def crossover(mutated, target, dims, cr): # generate a uniform random value for every dimension p = rand(dims) # generate trial vector by binomial crossover trial = [mutated[i] if p[i] < cr else target[i] for i in range(dims)] return trial def differential_evolution(pop_size, bounds, iter, F, cr): # initialise population of candidate solutions randomly within the specified bounds pop = bounds[:, 0] + (rand(pop_size, len(bounds)) * (bounds[:, 1] - bounds[:, 0])) # evaluate initial population of candidate solutions obj_all = [obj(ind) for ind in pop] # find the best performing vector of initial population best_vector = pop[argmin(obj_all)] best_obj = min(obj_all) prev_obj = best_obj # run iterations of the algorithm for i in range(iter): # iterate over all candidate solutions for j in range(pop_size): # choose three candidates, a, b and c, that are not the current one candidates = [candidate for candidate in range(pop_size) if candidate != j] a, b, c = pop[choice(candidates, 3, replace=False)] # perform mutation mutated = mutation([a, b, c], F) # check that lower and upper bounds are retained after mutation mutated = check_bounds(mutated, bounds) # perform crossover trial = crossover(mutated, pop[j], len(bounds), cr) # compute objective function value for target vector obj_target = obj(pop[j]) # compute objective function value for trial vector obj_trial = obj(trial) # perform selection if obj_trial < obj_target: # replace the target vector with the trial vector pop[j] = trial # store the new objective function value obj_all[j] = obj_trial # find the best performing vector at each iteration best_obj = min(obj_all) # store the lowest objective function value if best_obj < prev_obj: best_vector = pop[argmin(obj_all)] prev_obj = best_obj # report progress at each iteration print('Iteration: %d f([%s]) = %.5f' % (i, around(best_vector, decimals=5), best_obj)) return [best_vector, best_obj] # define population size pop_size = 10 # define lower and upper bounds for every dimension bounds = asarray([(-5.0, 5.0), (-5.0, 5.0)]) # define number of iterations iter = 100 # define scale factor for mutation F = 0.5 # define crossover rate for recombination cr = 0.7 # perform differential evolution solution = differential_evolution(pop_size, bounds, iter, F, cr) print('\nSolution: f([%s]) = %.5f' % (around(solution[0], decimals=5), solution[1]))

Running the example reports the progress of the search including the iteration number, and the response from the objective function each time an improvement is detected.

At the end of the search, the best solution is found and its evaluation is reported.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the algorithm converges very close to f(0.0, 0.0) = 0.0 in about 33 improvements out of 100 iterations.

Iteration: 1 f([[ 0.89709 -0.45082]]) = 1.00800 Iteration: 2 f([[-0.5382 0.29676]]) = 0.37773 Iteration: 3 f([[ 0.41884 -0.21613]]) = 0.22214 Iteration: 4 f([[0.34737 0.29676]]) = 0.20873 Iteration: 5 f([[ 0.20692 -0.1747 ]]) = 0.07334 Iteration: 7 f([[-0.23154 -0.00557]]) = 0.05364 Iteration: 8 f([[ 0.11956 -0.02632]]) = 0.01499 Iteration: 11 f([[ 0.01535 -0.02632]]) = 0.00093 Iteration: 15 f([[0.01918 0.01603]]) = 0.00062 Iteration: 18 f([[0.01706 0.00775]]) = 0.00035 Iteration: 20 f([[0.00467 0.01275]]) = 0.00018 Iteration: 21 f([[ 0.00288 -0.00175]]) = 0.00001 Iteration: 27 f([[ 0.00286 -0.00175]]) = 0.00001 Iteration: 30 f([[-0.00059 0.00044]]) = 0.00000 Iteration: 37 f([[-1.5e-04 8.0e-05]]) = 0.00000 Iteration: 41 f([[-1.e-04 -8.e-05]]) = 0.00000 Iteration: 43 f([[-4.e-05 6.e-05]]) = 0.00000 Iteration: 48 f([[-2.e-05 6.e-05]]) = 0.00000 Iteration: 49 f([[-6.e-05 0.e+00]]) = 0.00000 Iteration: 50 f([[-4.e-05 1.e-05]]) = 0.00000 Iteration: 51 f([[1.e-05 1.e-05]]) = 0.00000 Iteration: 55 f([[1.e-05 0.e+00]]) = 0.00000 Iteration: 64 f([[-0. -0.]]) = 0.00000 Iteration: 68 f([[ 0. -0.]]) = 0.00000 Iteration: 72 f([[-0. 0.]]) = 0.00000 Iteration: 77 f([[-0. 0.]]) = 0.00000 Iteration: 79 f([[0. 0.]]) = 0.00000 Iteration: 84 f([[ 0. -0.]]) = 0.00000 Iteration: 86 f([[-0. -0.]]) = 0.00000 Iteration: 87 f([[-0. -0.]]) = 0.00000 Iteration: 95 f([[-0. 0.]]) = 0.00000 Iteration: 98 f([[-0. 0.]]) = 0.00000 Solution: f([[-0. 0.]]) = 0.00000

We can plot the objective function values returned at every improvement by modifying the differential_evolution() function slightly to keep track of the objective function values and return this in the list, obj_iter.

def differential_evolution(pop_size, bounds, iter, F, cr): # initialise population of candidate solutions randomly within the specified bounds pop = bounds[:, 0] + (rand(pop_size, len(bounds)) * (bounds[:, 1] - bounds[:, 0])) # evaluate initial population of candidate solutions obj_all = [obj(ind) for ind in pop] # find the best performing vector of initial population best_vector = pop[argmin(obj_all)] best_obj = min(obj_all) prev_obj = best_obj # initialise list to store the objective function value at each iteration obj_iter = list() # run iterations of the algorithm for i in range(iter): # iterate over all candidate solutions for j in range(pop_size): # choose three candidates, a, b and c, that are not the current one candidates = [candidate for candidate in range(pop_size) if candidate != j] a, b, c = pop[choice(candidates, 3, replace=False)] # perform mutation mutated = mutation([a, b, c], F) # check that lower and upper bounds are retained after mutation mutated = check_bounds(mutated, bounds) # perform crossover trial = crossover(mutated, pop[j], len(bounds), cr) # compute objective function value for target vector obj_target = obj(pop[j]) # compute objective function value for trial vector obj_trial = obj(trial) # perform selection if obj_trial < obj_target: # replace the target vector with the trial vector pop[j] = trial # store the new objective function value obj_all[j] = obj_trial # find the best performing vector at each iteration best_obj = min(obj_all) # store the lowest objective function value if best_obj < prev_obj: best_vector = pop[argmin(obj_all)] prev_obj = best_obj obj_iter.append(best_obj) # report progress at each iteration print('Iteration: %d f([%s]) = %.5f' % (i, around(best_vector, decimals=5), best_obj)) return [best_vector, best_obj, obj_iter]

We can then create a line plot of these objective function values to see the relative changes at every improvement during the search.

from matplotlib import pyplot ... # perform differential evolution solution = differential_evolution(pop_size, bounds, iter, F, cr) ... # line plot of best objective function values pyplot.plot(solution[2], '.-') pyplot.xlabel('Improvement Number') pyplot.ylabel('Evaluation f(x)') pyplot.show()

Tying this together, the complete example is listed below.

# differential evolution search of the two-dimensional sphere objective function from numpy.random import rand from numpy.random import choice from numpy import asarray from numpy import clip from numpy import argmin from numpy import min from numpy import around from matplotlib import pyplot # define objective function def obj(x): return x[0]**2.0 + x[1]**2.0 # define mutation operation def mutation(x, F): return x[0] + F * (x[1] - x[2]) # define boundary check operation def check_bounds(mutated, bounds): mutated_bound = [clip(mutated[i], bounds[i, 0], bounds[i, 1]) for i in range(len(bounds))] return mutated_bound # define crossover operation def crossover(mutated, target, dims, cr): # generate a uniform random value for every dimension p = rand(dims) # generate trial vector by binomial crossover trial = [mutated[i] if p[i] < cr else target[i] for i in range(dims)] return trial def differential_evolution(pop_size, bounds, iter, F, cr): # initialise population of candidate solutions randomly within the specified bounds pop = bounds[:, 0] + (rand(pop_size, len(bounds)) * (bounds[:, 1] - bounds[:, 0])) # evaluate initial population of candidate solutions obj_all = [obj(ind) for ind in pop] # find the best performing vector of initial population best_vector = pop[argmin(obj_all)] best_obj = min(obj_all) prev_obj = best_obj # initialise list to store the objective function value at each iteration obj_iter = list() # run iterations of the algorithm for i in range(iter): # iterate over all candidate solutions for j in range(pop_size): # choose three candidates, a, b and c, that are not the current one candidates = [candidate for candidate in range(pop_size) if candidate != j] a, b, c = pop[choice(candidates, 3, replace=False)] # perform mutation mutated = mutation([a, b, c], F) # check that lower and upper bounds are retained after mutation mutated = check_bounds(mutated, bounds) # perform crossover trial = crossover(mutated, pop[j], len(bounds), cr) # compute objective function value for target vector obj_target = obj(pop[j]) # compute objective function value for trial vector obj_trial = obj(trial) # perform selection if obj_trial < obj_target: # replace the target vector with the trial vector pop[j] = trial # store the new objective function value obj_all[j] = obj_trial # find the best performing vector at each iteration best_obj = min(obj_all) # store the lowest objective function value if best_obj < prev_obj: best_vector = pop[argmin(obj_all)] prev_obj = best_obj obj_iter.append(best_obj) # report progress at each iteration print('Iteration: %d f([%s]) = %.5f' % (i, around(best_vector, decimals=5), best_obj)) return [best_vector, best_obj, obj_iter] # define population size pop_size = 10 # define lower and upper bounds for every dimension bounds = asarray([(-5.0, 5.0), (-5.0, 5.0)]) # define number of iterations iter = 100 # define scale factor for mutation F = 0.5 # define crossover rate for recombination cr = 0.7 # perform differential evolution solution = differential_evolution(pop_size, bounds, iter, F, cr) print('\nSolution: f([%s]) = %.5f' % (around(solution[0], decimals=5), solution[1])) # line plot of best objective function values pyplot.plot(solution[2], '.-') pyplot.xlabel('Improvement Number') pyplot.ylabel('Evaluation f(x)') pyplot.show()

Running the example creates a line plot.

The line plot shows the objective function evaluation for each improvement, with large changes initially and very small changes towards the end of the search as the algorithm converged on the optima.

This section provides more resources on the topic if you are looking to go deeper.

- A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces, 1997.
- Recent advances in differential evolution: An updated survey, 2016.

- Algorithms for Optimization, 2019.

In this tutorial, you discovered the differential evolution algorithm.

Specifically, you learned:

- Differential evolution is a heuristic approach for the global optimisation of nonlinear and non- differentiable continuous space functions.
- How to implement the differential evolution algorithm from scratch in Python.
- How to apply the differential evolution algorithm to a real-valued 2D objective function.

The post Differential Evolution from Scratch in Python appeared first on Machine Learning Mastery.

]]>The post Modeling Pipeline Optimization With scikit-learn appeared first on Machine Learning Mastery.

]]>A machine learning pipeline can be created by putting together a sequence of steps involved in training a machine learning model. It can be used to automate a machine learning workflow. The pipeline can involve pre-processing, feature selection, classification/regression, and post-processing. More complex applications may need to fit in other necessary steps within this pipeline.

By optimization, we mean tuning the model for the best performance. The success of any learning model rests on the selection of the best parameters that give the best possible results. Optimization can be looked at in terms of a search algorithm, which walks through a space of parameters and hunts down the best out of them.

After completing this tutorial, you should:

- Appreciate the significance of a pipeline and its optimization.
- Be able to set up a machine learning pipeline.
- Be able to optimize the pipeline.
- Know techniques to analyze the results of optimization.

The tutorial is simple and easy to follow. It should not take you too long to go through it. So enjoy!

This tutorial will show you how to

- Set up a pipeline using the Pipeline object from sklearn.pipeline.
- Perform a grid search for the best parameters using GridSearchCV() from sklearn.model_selection
- Analyze the results from the GridSearchCV() and visualize them

Before we demonstrate all the above, let’s write the import section:

from pandas import read_csv # For dataframes from pandas import DataFrame # For dataframes from numpy import ravel # For matrices import matplotlib.pyplot as plt # For plotting data import seaborn as sns # For plotting data from sklearn.model_selection import train_test_split # For train/test splits from sklearn.neighbors import KNeighborsClassifier # The k-nearest neighbor classifier from sklearn.feature_selection import VarianceThreshold # Feature selector from sklearn.pipeline import Pipeline # For setting up pipeline # Various pre-processing steps from sklearn.preprocessing import Normalizer, StandardScaler, MinMaxScaler, PowerTransformer, MaxAbsScaler, LabelEncoder from sklearn.model_selection import GridSearchCV # For optimization

We’ll use the Ecoli Dataset from the UCI Machine Learning Repository to demonstrate all the concepts of this tutorial. This dataset is maintained by Kenta Nakai. Let’s first load the Ecoli dataset in a Pandas DataFrame and view the first few rows.

# Read ecoli dataset from the UCI ML Repository and store in # dataframe df df = read_csv( 'https://archive.ics.uci.edu/ml/machine-learning-databases/ecoli/ecoli.data', sep = '\s+', header=None) print(df.head())

Running the example you should see the following:

0 1 2 3 4 5 6 7 8 0 AAT_ECOLI 0.49 0.29 0.48 0.5 0.56 0.24 0.35 cp 1 ACEA_ECOLI 0.07 0.40 0.48 0.5 0.54 0.35 0.44 cp 2 ACEK_ECOLI 0.56 0.40 0.48 0.5 0.49 0.37 0.46 cp 3 ACKA_ECOLI 0.59 0.49 0.48 0.5 0.52 0.45 0.36 cp 4 ADI_ECOLI 0.23 0.32 0.48 0.5 0.55 0.25 0.35 cp

We’ll ignore the first column, which specifies the sequence name. The last column is the class label. Let’s separate the features from the class label and split the dataset into 2/3 training instances and 1/3 test examples.

... # The data matrix X X = df.iloc[:,1:-1] # The labels y = (df.iloc[:,-1:]) # Encode the labels into unique integers encoder = LabelEncoder() y = encoder.fit_transform(ravel(y)) # Split the data into test and train X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=1/3, random_state=0) print(X_train.shape) print(X_test.shape)

Running the example you should see the following:

(224, 7) (112, 7)

Great! Now we have 224 samples in the training set and 112 samples in the test set. We have chosen a small dataset so that we can focus on the concepts, rather than the data itself.

For this tutorial, we have chosen the k-nearest neighbor classifier to perform the classification of this dataset.

First, let’s just check how the k-nearest neighbor performs on the training and test sets. This would give us a baseline for performance.

... knn = KNeighborsClassifier().fit(X_train, y_train) print('Training set score: ' + str(knn.score(X_train,y_train))) print('Test set score: ' + str(knn.score(X_test,y_test)))

Running the example you should see the following:

Training set score: 0.9017857142857143 Test set score: 0.8482142857142857

We should keep in mind that the true judge of a classifier’s performance is the test set score and not the training set score. The test set score reflects the generalization ability of a classifier.

For this tutorial, we’ll set up a very basic pipeline that consists of the following sequence:

**Scaler**: For pre-processing data, i.e., transform the data to zero mean and unit variance using the StandardScaler().**Feature selector**: Use VarianceThreshold() for selecting features whose variance is less than a certain defined threshold.**Classifier**: KNeighborsClassifier(), which implements the k-nearest neighbor classifier and selects the class of the majority k points, which are closest to the test example.

... pipe = Pipeline([ ('scaler', StandardScaler()), ('selector', VarianceThreshold()), ('classifier', KNeighborsClassifier()) ])

The pipe object is simple to understand. It says, scale first, select features second and classify in the end. Let’s call fit() method of the pipe object on our training data and get the training and test scores.

... pipe.fit(X_train, y_train) print('Training set score: ' + str(pipe.score(X_train,y_train))) print('Test set score: ' + str(pipe.score(X_test,y_test)))

Running the example you should see the following:

Training set score: 0.8794642857142857 Test set score: 0.8392857142857143

So it looks like the performance of this pipeline is worse than the single classifier performance on raw data. Not only did we add extra processing, but it was all in vain. Don’t despair, the real benefit of the pipeline comes from its tuning. The next section explains how to do that.

In the code below, we’ll show the following:

- We can search for the best scalers. Instead of just the StandardScaler(), we can try MinMaxScaler(), Normalizer() and MaxAbsScaler().
- We can search for the best variance threshold to use in the selector, i.e., VarianceThreshold().
- We can search for the best value of k for the KNeighborsClassifier().

The parameters variable below is a dictionary that specifies the key:value pairs. Note the key must be written, with a double underscore __ separating the module name that we selected in the Pipeline() and its parameter. Note the following:

- The scaler has no double underscore, as we have specified a list of objects there.
- We would search for the best threshold for the selector, i.e., VarianceThreshold(). Hence we have specified a list of values [0, 0.0001, 0.001, 0.5] to choose from.
- Different values are specified for the n_neighbors, p and leaf_size parameters of the KNeighborsClassifier().

... parameters = {'scaler': [StandardScaler(), MinMaxScaler(), Normalizer(), MaxAbsScaler()], 'selector__threshold': [0, 0.001, 0.01], 'classifier__n_neighbors': [1, 3, 5, 7, 10], 'classifier__p': [1, 2], 'classifier__leaf_size': [1, 5, 10, 15] }

The pipe along with the above list of parameters are then passed to a GridSearchCV() object, that searches the parameters space for the best set of parameters as shown below:

... grid = GridSearchCV(pipe, parameters, cv=2).fit(X_train, y_train) print('Training set score: ' + str(grid.score(X_train, y_train))) print('Test set score: ' + str(grid.score(X_test, y_test)))

Running the example you should see the following:

Training set score: 0.8928571428571429 Test set score: 0.8571428571428571

By tuning the pipeline, we achieved quite an improvement over a simple classifier and a non-optimized pipeline. It is important to analyze the results of the optimization process.

Don’t worry too much about the warning that you get by running the code above. It is generated because we have very few training samples and the cross-validation object does not have enough samples for a class for one of its folds.

Let’s look at the tuned grid object and gain an understanding of the GridSearchCV() object.

The object is so named because it sets up a multi-dimensional grid, with each corner representing a combination of parameters to try. This defines a parameter space. As an example if we have three values of n_neighbors, i.e., {1,3,5}, two values of leaf_size, i.e., {1,5} and two values of threshold, i.e., {0,0.0001}, then we have a 3D grid with 3x2x2=12 corners. Each corner represents a different combination.

For each corner of the above grid, the GridSearchCV() object computes the mean cross-validation score on the unseen examples and selects the corner/combination of parameters that give the best result. The code below shows how to access the best parameters of the grid and the best pipeline for our task.

... # Access the best set of parameters best_params = grid.best_params_ print(best_params) # Stores the optimum model in best_pipe best_pipe = grid.best_estimator_ print(best_pipe)

Running the example you should see the following:

{'classifier__leaf_size': 1, 'classifier__n_neighbors': 7, 'classifier__p': 2, 'scaler': StandardScaler(), 'selector__threshold': 0} Pipeline(steps=[('scaler', StandardScaler()), ('selector', VarianceThreshold(threshold=0)), ('classifier', KNeighborsClassifier(leaf_size=1, n_neighbors=7))])

Another useful technique for analyzing the results is to construct a DataFrame from the grid.cv_results_. Let’s view the columns of this data frame.

... result_df = DataFrame.from_dict(grid.cv_results_, orient='columns') print(result_df.columns)

Running the example you should see the following:

Index(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_classifier__leaf_size', 'param_classifier__n_neighbors', 'param_classifier__p', 'param_scaler', 'param_selector__threshold', 'params', 'split0_test_score', 'split1_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score'], dtype='object')

This DataFrame is very valuable as it shows us the scores for different parameters. The column with the mean_test_score is the average of the scores on the test set for all the folds during cross-validation. The DataFrame may be too big to visualize manually, hence, it is always a good idea to plot the results. Let’s see how n_neighbors affect the performance for different scalers and for different values of p.

... sns.relplot(data=result_df, kind='line', x='param_classifier__n_neighbors', y='mean_test_score', hue='param_scaler', col='param_classifier__p') plt.show()

Running the example you should see the following:

The plots clearly show that using StandardScaler(), with n_neighbors=7 and p=2, gives the best result. Let’s make one more set of plots with leaf_size.

... sns.relplot(data=result_df, kind='line', x='param_classifier__n_neighbors', y='mean_test_score', hue='param_scaler', col='param_classifier__leaf_size') plt.show()

Running the example you should see the following:

Tying this all together, the complete code example is listed below.

from pandas import read_csv # For dataframes from pandas import DataFrame # For dataframes from numpy import ravel # For matrices import matplotlib.pyplot as plt # For plotting data import seaborn as sns # For plotting data from sklearn.model_selection import train_test_split # For train/test splits from sklearn.neighbors import KNeighborsClassifier # The k-nearest neighbor classifier from sklearn.feature_selection import VarianceThreshold # Feature selector from sklearn.pipeline import Pipeline # For setting up pipeline # Various pre-processing steps from sklearn.preprocessing import Normalizer, StandardScaler, MinMaxScaler, PowerTransformer, MaxAbsScaler, LabelEncoder from sklearn.model_selection import GridSearchCV # For optimization # Read ecoli dataset from the UCI ML Repository and store in # dataframe df df = read_csv( 'https://archive.ics.uci.edu/ml/machine-learning-databases/ecoli/ecoli.data', sep = '\s+', header=None) print(df.head()) # The data matrix X X = df.iloc[:,1:-1] # The labels y = (df.iloc[:,-1:]) # Encode the labels into unique integers encoder = LabelEncoder() y = encoder.fit_transform(ravel(y)) # Split the data into test and train X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=1/3, random_state=0) print(X_train.shape) print(X_test.shape) knn = KNeighborsClassifier().fit(X_train, y_train) print('Training set score: ' + str(knn.score(X_train,y_train))) print('Test set score: ' + str(knn.score(X_test,y_test))) pipe = Pipeline([ ('scaler', StandardScaler()), ('selector', VarianceThreshold()), ('classifier', KNeighborsClassifier()) ]) pipe.fit(X_train, y_train) print('Training set score: ' + str(pipe.score(X_train,y_train))) print('Test set score: ' + str(pipe.score(X_test,y_test))) parameters = {'scaler': [StandardScaler(), MinMaxScaler(), Normalizer(), MaxAbsScaler()], 'selector__threshold': [0, 0.001, 0.01], 'classifier__n_neighbors': [1, 3, 5, 7, 10], 'classifier__p': [1, 2], 'classifier__leaf_size': [1, 5, 10, 15] } grid = GridSearchCV(pipe, parameters, cv=2).fit(X_train, y_train) print('Training set score: ' + str(grid.score(X_train, y_train))) print('Test set score: ' + str(grid.score(X_test, y_test))) # Access the best set of parameters best_params = grid.best_params_ print(best_params) # Stores the optimum model in best_pipe best_pipe = grid.best_estimator_ print(best_pipe) result_df = DataFrame.from_dict(grid.cv_results_, orient='columns') print(result_df.columns) sns.relplot(data=result_df, kind='line', x='param_classifier__n_neighbors', y='mean_test_score', hue='param_scaler', col='param_classifier__p') plt.show() sns.relplot(data=result_df, kind='line', x='param_classifier__n_neighbors', y='mean_test_score', hue='param_scaler', col='param_classifier__leaf_size') plt.show()

In this tutorial we learned the following:

- How to build a machine learning pipeline.
- How to optimize the pipeline using GridSearchCV.
- How to analyze and compare the results attained by using different sets of parameters.

The dataset used for this tutorial is quite small with a few example points but still the results are better than a simple classifier.

For the interested readers, here are a few resources:

- On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation by Gavin Cawley and Nicola Talbot
- A Gentle Introduction to k-fold Cross-Validation
- Model Selection
- k-nearest Neighbors Algorithm

- sklearn.model_selection.ParameterSampler API
- sklearn.model_selection.RandomizedSearchCV API
- sklearn.model_selection.KFold API
- sklearn.model_selection.ParameterGrid API

- UCI Machine Learning Repository maintained by Dua and Graff.
- Ecoli Dataset maintained by Kenta Nakai. Please see this paper for more information.

The post Modeling Pipeline Optimization With scikit-learn appeared first on Machine Learning Mastery.

]]>The post Gradient Descent With AdaGrad From Scratch appeared first on Machine Learning Mastery.

]]>A limitation of gradient descent is that it uses the same step size (learning rate) for each input variable. This can be a problem on objective functions that have different amounts of curvature in different dimensions, and in turn, may require a different sized step to a new point.

**Adaptive Gradients**, or **AdaGrad** for short, is an extension of the gradient descent optimization algorithm that allows the step size in each dimension used by the optimization algorithm to be automatically adapted based on the gradients seen for the variable (partial derivatives) seen over the course of the search.

In this tutorial, you will discover how to develop the gradient descent with adaptive gradients optimization algorithm from scratch.

After completing this tutorial, you will know:

- Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
- Gradient descent can be updated to use an automatically adaptive step size for each input variable in the objective function, called adaptive gradients or AdaGrad.
- How to implement the AdaGrad optimization algorithm from scratch and apply it to an objective function and evaluate the results.

Let’s get started.

This tutorial is divided into three parts; they are:

- Gradient Descent
- Adaptive Gradient (AdaGrad)
- Gradient Descent With AdaGrad
- Two-Dimensional Test Problem
- Gradient Descent Optimization With AdaGrad
- Visualization of AdaGrad

Gradient descent is an optimization algorithm.

It is technically referred to as a first-order optimization algorithm as it explicitly makes use of the first order derivative of the target objective function.

First-order methods rely on gradient information to help direct the search for a minimum …

— Page 69, Algorithms for Optimization, 2019.

The first order derivative, or simply the “*derivative*,” is the rate of change or slope of the target function at a specific point, e.g. for a specific input.

If the target function takes multiple input variables, it is referred to as a multivariate function and the input variables can be thought of as a vector. In turn, the derivative of a multivariate target function may also be taken as a vector and is referred to generally as the “gradient.”

**Gradient**: First-order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a specific input.

Gradient descent refers to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to locate the minimum of the function.

The gradient descent algorithm requires a target function that is being optimized and the derivative function for the objective function. The target function *f()* returns a score for a given set of inputs, and the derivative function *f'()* gives the derivative of the target function for a given set of inputs.

The gradient descent algorithm requires a starting point (*x*) in the problem, such as a randomly selected point in the input space.

The derivative is then calculated and a step is taken in the input space that is expected to result in a downhill movement in the target function, assuming we are minimizing the target function.

A downhill movement is made by first calculating how far to move in the input space, calculated as the step size (called alpha or the learning rate) multiplied by the gradient. This is then subtracted from the current point, ensuring we move against the gradient, or down the target function.

- x = x – step_size * f'(x)

The steeper the objective function at a given point, the larger the magnitude of the gradient, and in turn, the larger the step taken in the search space. The size of the step taken is scaled using a step size hyperparameter.

**Step Size**(*alpha*): Hyperparameter that controls how far to move in the search space against the gradient each iteration of the algorithm.

If the step size is too small, the movement in the search space will be small and the search will take a long time. If the step size is too large, the search may bounce around the search space and skip over the optima.

Now that we are familiar with the gradient descent optimization algorithm, let’s take a look at AdaGrad.

The Adaptive Gradient algorithm, or AdaGrad for short, is an extension to the gradient descent optimization algorithm.

The algorithm was described by John Duchi, et al. in their 2011 paper titled “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.”

It is designed to accelerate the optimization process, e.g. decrease the number of function evaluations required to reach the optima, or to improve the capability of the optimization algorithm, e.g. result in a better final result.

The parameters with the largest partial derivative of the loss have a correspondingly rapid decrease in their learning rate, while parameters with small partial derivatives have a relatively small decrease in their learning rate.

— Page 307, Deep Learning, 2016.

A problem with the gradient descent algorithm is that the step size (learning rate) is the same for each variable or dimension in the search space. It is possible that better performance can be achieved using a step size that is tailored to each variable, allowing larger movements in dimensions with a consistently steep gradient and smaller movements in dimensions with less steep gradients.

AdaGrad is designed to specifically explore the idea of automatically tailoring the step size for each dimension in the search space.

The adaptive subgradient method, or Adagrad, adapts a learning rate for each component of x

— Page 77, Algorithms for Optimization, 2019.

This is achieved by first calculating a step size for a given dimension, then using the calculated step size to make a movement in that dimension using the partial derivative. This process is then repeated for each dimension in the search space.

Adagrad dulls the influence of parameters with consistently high gradients, thereby increasing the influence of parameters with infrequent updates.

— Page 77, Algorithms for Optimization, 2019.

AdaGrad is suited to objective functions where the curvature of the search space is different in different dimensions, allowing a more effective optimization given the customization of the step size in each dimension.

The algorithm requires that you set an initial step size for all input variables as per normal, such as 0.1 or 0.001, or similar. Although, the benefit of the algorithm is that it is not as sensitive to the initial learning rate as the gradient descent algorithm.

Adagrad is far less sensitive to the learning rate parameter alpha. The learning rate parameter is typically set to a default value of 0.01.

— Page 77, Algorithms for Optimization, 2019.

An internal variable is then maintained for each input variable that is the sum of the squared partial derivatives for the input variable observed during the search.

This sum of the squared partial derivatives is then used to calculate the step size for the variable by dividing the initial step size value (e.g. hyperparameter value specified at the start of the run) divided by the square root of the sum of the squared partial derivatives.

- cust_step_size = step_size / sqrt(s)

It is possible for the square root of the sum of squared partial derivatives to result in a value of 0.0, resulting in a divide by zero error. Therefore, a tiny value can be added to the denominator to avoid this possibility, such as 1e-8.

- cust_step_size = step_size / (1e-8 + sqrt(s))

Where *cust_step_size* is the calculated step size for an input variable for a given point during the search, *step_size* is the initial step size, *sqrt()* is the square root operation, and *s* is the sum of the squared partial derivatives for the input variable seen during the search so far.

The custom step size is then used to calculate the value for the variable in the next point or solution in the search.

- x(t+1) = x(t) – cust_step_size * f'(x(t))

This process is then repeated for each input variable until a new point in the search space is created and can be evaluated.

Importantly, the partial derivative for the current solution (iteration of the search) is included in the sum of the square root of partial derivatives.

We could maintain an array of partial derivatives or squared partial derivatives for each input variable, but this is not necessary. Instead, we simply maintain the sum of the squared partial derivatives and add new values to this sum along the way.

Now that we are familiar with the AdaGrad algorithm, let’s explore how we might implement it and evaluate its performance.

In this section, we will explore how to implement the gradient descent optimization algorithm with adaptive gradients.

First, let’s define an optimization function.

We will use a simple two-dimensional function that squares the input of each dimension and define the range of valid inputs from -1.0 to 1.0.

The *objective()* function below implements this function.

# objective function def objective(x, y): return x**2.0 + y**2.0

We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the response surface.

The complete example of plotting the objective function is listed below.

# 3d plot of the test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -1.0, 1.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a three-dimensional surface plot of the objective function.

We can see the familiar bowl shape with the global minima at f(0, 0) = 0.

We can also create a two-dimensional plot of the function. This will be helpful later when we want to plot the progress of the search.

The example below creates a contour plot of the objective function.

# contour plot of the test function from numpy import asarray from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # show the plot pyplot.show()

Running the example creates a two-dimensional contour plot of the objective function.

We can see the bowl shape compressed to contours shown with a color gradient. We will use this plot to plot the specific points explored during the progress of the search.

Now that we have a test objective function, let’s look at how we might implement the AdaGrad optimization algorithm.

We can apply the gradient descent with adaptive gradient algorithm to the test problem.

First, we need a function that calculates the derivative for this function.

- f(x) = x^2
- f'(x) = x * 2

The derivative of x^2 is x * 2 in each dimension.

The *derivative()* function implements this below.

# derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0])

Next, we can implement gradient descent with adaptive gradients.

First, we can select a random point in the bounds of the problem as a starting point for the search.

This assumes we have an array that defines the bounds of the search with one row for each dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

... # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

Next, we need to initialize the sum of the squared partial derivatives for each dimension to 0.0 values.

... # list of the sum square gradients for each variable sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

We can then enumerate a fixed number of iterations of the search optimization algorithm defined by a “*n_iter*” hyperparameter.

... # run the gradient descent for it in range(n_iter): ...

The first step is to calculate the gradient for the current solution using the *derivative()* function.

... # calculate gradient gradient = derivative(solution[0], solution[1])

We then need to calculate the square of the partial derivative of each variable and add them to the running sum of these values.

... # update the sum of the squared partial derivatives for i in range(gradient.shape[0]): sq_grad_sums[i] += gradient[i]**2.0

We can then use the sum squared partial derivatives and gradient to calculate the next point.

We will do this one variable at a time, first calculating the step size for the variable, then the new value for the variable. These values are built up in an array until we have a completely new solution that is in the steepest descent direction from the current point using the custom step sizes.

... # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # calculate the step size for this variable alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i])) # calculate the new position in this variable value = solution[i] - alpha * gradient[i] # store this variable new_solution.append(value)

This new solution can then be evaluated using the *objective()* function and the performance of the search can be reported.

... # evaluate candidate point solution = asarray(new_solution) solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

And that’s it.

We can tie all of this together into a function named *adagrad()* that takes the names of the objective function and the derivative function, an array with the bounds of the domain, and hyperparameter values for the total number of algorithm iterations and the initial learning rate, and returns the final solution and its evaluation.

This complete function is listed below.

# gradient descent algorithm with adagrad def adagrad(objective, derivative, bounds, n_iter, step_size): # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of the sum square gradients for each variable sq_grad_sums = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate gradient gradient = derivative(solution[0], solution[1]) # update the sum of the squared partial derivatives for i in range(gradient.shape[0]): sq_grad_sums[i] += gradient[i]**2.0 # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # calculate the step size for this variable alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i])) # calculate the new position in this variable value = solution[i] - alpha * gradient[i] # store this variable new_solution.append(value) # evaluate candidate point solution = asarray(new_solution) solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return [solution, solution_eval]

**Note**: we have intentionally used lists and imperative coding style instead of vectorized operations for readability. Feel free to adapt the implementation to a vectorized implementation with NumPy arrays for better performance.

We can then define our hyperparameters and call the *adagrad()* function to optimize our test objective function.

In this case, we will use 50 iterations of the algorithm and an initial learning rate of 0.1, both chosen after a little trial and error.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 50 # define the step size step_size = 0.1 # perform the gradient descent search with adagrad best, score = adagrad(objective, derivative, bounds, n_iter, step_size) print('Done!') print('f(%s) = %f' % (best, score))

Tying all of this together, the complete example of gradient descent optimization with adaptive gradients is listed below.

# gradient descent optimization with adagrad for a two-dimensional test function from math import sqrt from numpy import asarray from numpy.random import rand from numpy.random import seed # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with adagrad def adagrad(objective, derivative, bounds, n_iter, step_size): # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of the sum square gradients for each variable sq_grad_sums = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate gradient gradient = derivative(solution[0], solution[1]) # update the sum of the squared partial derivatives for i in range(gradient.shape[0]): sq_grad_sums[i] += gradient[i]**2.0 # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # calculate the step size for this variable alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i])) # calculate the new position in this variable value = solution[i] - alpha * gradient[i] # store this variable new_solution.append(value) # evaluate candidate point solution = asarray(new_solution) solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return [solution, solution_eval] # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 50 # define the step size step_size = 0.1 # perform the gradient descent search with adagrad best, score = adagrad(objective, derivative, bounds, n_iter, step_size) print('Done!') print('f(%s) = %f' % (best, score))

Running the example applies the AdaGrad optimization algorithm to our test problem and reports the performance of the search for each iteration of the algorithm.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a near-optimal solution was found after perhaps 35 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.

>0 f([-0.06595599 0.34064899]) = 0.12039 >1 f([-0.02902286 0.27948766]) = 0.07896 >2 f([-0.0129815 0.23463749]) = 0.05522 >3 f([-0.00582483 0.1993997 ]) = 0.03979 >4 f([-0.00261527 0.17071256]) = 0.02915 >5 f([-0.00117437 0.14686138]) = 0.02157 >6 f([-0.00052736 0.12676134]) = 0.01607 >7 f([-0.00023681 0.10966762]) = 0.01203 >8 f([-0.00010634 0.09503809]) = 0.00903 >9 f([-4.77542704e-05 8.24607972e-02]) = 0.00680 >10 f([-2.14444463e-05 7.16123835e-02]) = 0.00513 >11 f([-9.62980437e-06 6.22327049e-02]) = 0.00387 >12 f([-4.32434258e-06 5.41085063e-02]) = 0.00293 >13 f([-1.94188148e-06 4.70624414e-02]) = 0.00221 >14 f([-8.72017797e-07 4.09453989e-02]) = 0.00168 >15 f([-3.91586740e-07 3.56309531e-02]) = 0.00127 >16 f([-1.75845235e-07 3.10112252e-02]) = 0.00096 >17 f([-7.89647442e-08 2.69937139e-02]) = 0.00073 >18 f([-3.54597657e-08 2.34988084e-02]) = 0.00055 >19 f([-1.59234984e-08 2.04577993e-02]) = 0.00042 >20 f([-7.15057749e-09 1.78112581e-02]) = 0.00032 >21 f([-3.21102543e-09 1.55077005e-02]) = 0.00024 >22 f([-1.44193729e-09 1.35024688e-02]) = 0.00018 >23 f([-6.47513760e-10 1.17567908e-02]) = 0.00014 >24 f([-2.90771361e-10 1.02369798e-02]) = 0.00010 >25 f([-1.30573263e-10 8.91375193e-03]) = 0.00008 >26 f([-5.86349941e-11 7.76164047e-03]) = 0.00006 >27 f([-2.63305247e-11 6.75849105e-03]) = 0.00005 >28 f([-1.18239380e-11 5.88502652e-03]) = 0.00003 >29 f([-5.30963626e-12 5.12447017e-03]) = 0.00003 >30 f([-2.38433568e-12 4.46221948e-03]) = 0.00002 >31 f([-1.07070548e-12 3.88556303e-03]) = 0.00002 >32 f([-4.80809073e-13 3.38343471e-03]) = 0.00001 >33 f([-2.15911255e-13 2.94620023e-03]) = 0.00001 >34 f([-9.69567190e-14 2.56547145e-03]) = 0.00001 >35 f([-4.35392094e-14 2.23394494e-03]) = 0.00000 >36 f([-1.95516389e-14 1.94526160e-03]) = 0.00000 >37 f([-8.77982370e-15 1.69388439e-03]) = 0.00000 >38 f([-3.94265180e-15 1.47499203e-03]) = 0.00000 >39 f([-1.77048011e-15 1.28438640e-03]) = 0.00000 >40 f([-7.95048604e-16 1.11841198e-03]) = 0.00000 >41 f([-3.57023093e-16 9.73885702e-04]) = 0.00000 >42 f([-1.60324146e-16 8.48035867e-04]) = 0.00000 >43 f([-7.19948720e-17 7.38448972e-04]) = 0.00000 >44 f([-3.23298874e-17 6.43023418e-04]) = 0.00000 >45 f([-1.45180009e-17 5.59929193e-04]) = 0.00000 >46 f([-6.51942732e-18 4.87572776e-04]) = 0.00000 >47 f([-2.92760228e-18 4.24566574e-04]) = 0.00000 >48 f([-1.31466380e-18 3.69702307e-04]) = 0.00000 >49 f([-5.90360555e-19 3.21927835e-04]) = 0.00000 Done! f([-5.90360555e-19 3.21927835e-04]) = 0.000000

We can plot the progress of the search on a contour plot of the domain.

This can provide an intuition for the progress of the search over the iterations of the algorithm.

We must update the *adagrad()* function to maintain a list of all solutions found during the search, then return this list at the end of the search.

The updated version of the function with these changes is listed below.

# gradient descent algorithm with adagrad def adagrad(objective, derivative, bounds, n_iter, step_size): # track all solutions solutions = list() # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of the sum square gradients for each variable sq_grad_sums = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate gradient gradient = derivative(solution[0], solution[1]) # update the sum of the squared partial derivatives for i in range(gradient.shape[0]): sq_grad_sums[i] += gradient[i]**2.0 # build solution new_solution = list() for i in range(solution.shape[0]): # calculate the learning rate for this variable alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i])) # calculate the new position in this variable value = solution[i] - alpha * gradient[i] new_solution.append(value) # store the new solution solution = asarray(new_solution) solutions.append(solution) # evaluate candidate point solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return solutions

We can then execute the search as before, and this time retrieve the list of solutions instead of the best final solution.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 50 # define the step size step_size = 0.1 # perform the gradient descent search solutions = adagrad(objective, derivative, bounds, n_iter, step_size)

We can then create a contour plot of the objective function, as before.

... # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet')

Finally, we can plot each solution found during the search as a white dot connected by a line.

... # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

Tying this all together, the complete example of performing the AdaGrad optimization on the test problem and plotting the results on a contour plot is listed below.

# example of plotting the adagrad search on a contour plot of the test function from math import sqrt from numpy import asarray from numpy import arange from numpy.random import rand from numpy.random import seed from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with adagrad def adagrad(objective, derivative, bounds, n_iter, step_size): # track all solutions solutions = list() # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of the sum square gradients for each variable sq_grad_sums = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate gradient gradient = derivative(solution[0], solution[1]) # update the sum of the squared partial derivatives for i in range(gradient.shape[0]): sq_grad_sums[i] += gradient[i]**2.0 # build solution new_solution = list() for i in range(solution.shape[0]): # calculate the learning rate for this variable alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i])) # calculate the new position in this variable value = solution[i] - alpha * gradient[i] new_solution.append(value) # store the new solution solution = asarray(new_solution) solutions.append(solution) # evaluate candidate point solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return solutions # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 50 # define the step size step_size = 0.1 # perform the gradient descent search solutions = adagrad(objective, derivative, bounds, n_iter, step_size) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w') # show the plot pyplot.show()

Running the example performs the search as before, except in this case, a contour plot of the objective function is created and a white dot is shown for each solution found during the search, starting above the optima and progressively getting closer to the optima at the center of the plot.

This section provides more resources on the topic if you are looking to go deeper.

- Algorithms for Optimization, 2019.
- Deep Learning, 2016.

- Gradient descent, Wikipedia.
- Stochastic gradient descent, Wikipedia.
- An overview of gradient descent optimization algorithms, 2016.

In this tutorial, you discovered how to develop the gradient descent with adaptive gradients optimization algorithm from scratch.

Specifically, you learned:

- Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
- Gradient descent can be updated to use an automatically adaptive step size for each input variable in the objective function, called adaptive gradients or AdaGrad.
- How to implement the AdaGrad optimization algorithm from scratch and apply it to an objective function and evaluate the results.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Gradient Descent With AdaGrad From Scratch appeared first on Machine Learning Mastery.

]]>The post Gradient Descent Optimization With AMSGrad From Scratch appeared first on Machine Learning Mastery.

]]>A limitation of gradient descent is that a single step size (learning rate) is used for all input variables. Extensions to gradient descent like the Adaptive Movement Estimation (Adam) algorithm use a separate step size for each input variable but may result in a step size that rapidly decreases to very small values.

**AMSGrad** is an extension to the Adam version of gradient descent that attempts to improve the convergence properties of the algorithm, avoiding large abrupt changes in the learning rate for each input variable.

In this tutorial, you will discover how to develop gradient descent optimization with AMSGrad from scratch.

After completing this tutorial, you will know:

- Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
- AMSGrad is an extension of the Adam version of gradient descent designed to accelerate the optimization process.
- How to implement the AMSGrad optimization algorithm from scratch and apply it to an objective function and evaluate the results.

Let’s get started.

This tutorial is divided into three parts; they are:

- Gradient Descent
- AMSGrad Optimization Algorithm
- Gradient Descent With AMSGrad
- Two-Dimensional Test Problem
- Gradient Descent Optimization With AMSGrad
- Visualization of AMSGrad Optimization

Gradient descent is an optimization algorithm.

It is technically referred to as a first-order optimization algorithm as it explicitly makes use of the first-order derivative of the target objective function.

First-order methods rely on gradient information to help direct the search for a minimum …

— Page 69, Algorithms for Optimization, 2019.

The first-order derivative, or simply the “derivative,” is the rate of change or slope of the target function at a specific point, e.g. for a specific input.

If the target function takes multiple input variables, it is referred to as a multivariate function and the input variables can be thought of as a vector. In turn, the derivative of a multivariate target function may also be taken as a vector and is referred to generally as the gradient.

**Gradient**: First-order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a specific input.

Gradient descent refers to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to locate the minimum of the function.

The gradient descent algorithm requires a target function that is being optimized and the derivative function for the objective function. The target function *f()* returns a score for a given set of inputs, and the derivative function *f'()* gives the derivative of the target function for a given set of inputs.

The gradient descent algorithm requires a starting point (*x*) in the problem, such as a randomly selected point in the input space.

The derivative is then calculated and a step is taken in the input space that is expected to result in a downhill movement in the target function, assuming we are minimizing the target function.

A downhill movement is made by first calculating how far to move in the input space, calculated as the step size (called alpha or the learning rate) multiplied by the gradient. This is then subtracted from the current point, ensuring we move against the gradient, or down the target function.

- x(t) = x(t-1) – step_size * f'(x(t))

The steeper the objective function at a given point, the larger the magnitude of the gradient, and in turn, the larger the step taken in the search space. The size of the step taken is scaled using a step size hyperparameter.

**Step Size**: Hyperparameter that controls how far to move in the search space against the gradient each iteration of the algorithm.

If the step size is too small, the movement in the search space will be small and the search will take a long time. If the step size is too large, the search may bounce around the search space and skip over the optima.

Now that we are familiar with the gradient descent optimization algorithm, let’s take a look at the AMSGrad algorithm.

AMSGrad algorithm is an extension to the Adaptive Movement Estimation (Adam) optimization algorithm. More broadly, is an extension to the Gradient Descent Optimization algorithm.

The algorithm was described in the 2018 paper by J. Sashank, et al. titled “On the Convergence of Adam and Beyond.”

Generally, Adam automatically adapts a separate step size (learning rate) for each parameter in the optimization problem.

A limitation of Adam is that it can both decrease the step size when getting close to the optima, which is good, but it also increases the step size in some cases, which is bad.

AdaGrad addresses this specifically.

… ADAM aggressively increases the learning rate, however, […] this can be detrimental to the overall performance of the algorithm. […] In contrast, AMSGRAD neither increases nor decreases the learning rate and furthermore, decreases vt which can potentially lead to non-decreasing learning rate even if gradient is large in the future iterations.

— On the Convergence of Adam and Beyond, 2018.

AdaGrad is an extension to Adam that maintains a maximum of the second moment vector and uses it to bias-correct the gradient used to update the parameter, instead of the moment vector itself. This helps to stop the optimization slowing down too quickly (e.g. premature convergence).

The key difference of AMSGRAD with ADAM is that it maintains the maximum of all vt until the present time step and uses this maximum value for normalizing the running average of the gradient instead of vt in ADAM.

— On the Convergence of Adam and Beyond, 2018.

Let’s step through each element of the algorithm.

First, we must maintain a first and second moment vector as well as a max second moment vector for each parameter being optimized as part of the search, referred to as *m* and *v* (Geek letter nu but we will use v), and *vhat* respectively.

They are initialized to 0.0 at the start of the search.

- m = 0
- v = 0
- vhat = 0

The algorithm is executed iteratively over time t starting at *t=1*, and each iteration involves calculating a new set of parameter values *x*, e.g. going from *x(t-1)* to *x(t)*.

It is perhaps easy to understand the algorithm if we focus on updating one parameter, which generalizes to updating all parameters via vector operations.

First, the gradients (partial derivatives) are calculated for the current time step.

- g(t) = f'(x(t-1))

Next, the first moment vector is updated using the gradient and a hyperparameter *beta1*.

- m(t) = beta1(t) * m(t-1) + (1 – beta1(t)) * g(t)

The *beta1* hyperparameter can be held constant or can be decayed exponentially over the course of the search, such as:

- beta1(t) = beta1^(t)

Or, alternately:

- beta1(t) = beta1 / t

The second moment vector is updated using the square of the gradient and a hyperparameter *beta2*.

- v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2

Next, the maximum for the second moment vector is updated.

- vhat(t) = max(vhat(t-1), v(t))

Where *max()* calculates the maximum of the provided values.

The parameter value can then be updated using the calculated terms and the step size hyperparameter *alpha*.

- x(t) = x(t-1) – alpha(t) * m(t) / sqrt(vhat(t)))

Where *sqrt()* is the square root function.

The step size may also be held constant or decayed exponentially.

To review, there are three hyperparameters for the algorithm; they are:

**alpha**: Initial step size (learning rate), a typical value is 0.002.**beta1**: Decay factor for first momentum, a typical value is 0.9.**beta2**: Decay factor for infinity norm, a typical value is 0.999.

And that’s it.

For full derivation of the AMSGrad algorithm in the context of the Adam algorithm, I recommend reading the paper.

Next, let’s look at how we might implement the algorithm from scratch in Python.

In this section, we will explore how to implement the gradient descent optimization algorithm with AMSGrad Momentum.

First, let’s define an optimization function.

We will use a simple two-dimensional function that squares the input of each dimension and define the range of valid inputs from -1.0 to 1.0.

The *objective()* function below implements this.

# objective function def objective(x, y): return x**2.0 + y**2.0

We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the response surface.

The complete example of plotting the objective function is listed below.

# 3d plot of the test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -1.0, 1.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a three-dimensional surface plot of the objective function.

We can see the familiar bowl shape with the global minima at f(0, 0) = 0.

We can also create a two-dimensional plot of the function. This will be helpful later when we want to plot the progress of the search.

The example below creates a contour plot of the objective function.

# contour plot of the test function from numpy import asarray from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # show the plot pyplot.show()

Running the example creates a two-dimensional contour plot of the objective function.

We can see the bowl shape compressed to contours shown with a color gradient. We will use this plot to plot the specific points explored during the progress of the search.

Now that we have a test objective function, let’s look at how we might implement the AMSGrad optimization algorithm.

We can apply gradient descent with AMSGrad to the test problem.

First, we need a function that calculates the derivative for this function.

The derivative of *x^2* is *x * 2* in each dimension.

- f(x) = x^2
- f'(x) = x * 2

The *derivative()* function implements this below.

# derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0])

Next, we can implement gradient descent optimization with AMSGrad.

First, we can select a random point in the bounds of the problem as a starting point for the search.

This assumes we have an array that defines the bounds of the search with one row for each dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

... # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

Next, we need to initialize the moment vectors.

... # initialize moment vectors m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] vhat = [0.0 for _ in range(bounds.shape[0])]

We then run a fixed number of iterations of the algorithm defined by the “*n_iter*” hyperparameter.

... # run iterations of gradient descent for t in range(n_iter): ...

The first step is to calculate the derivative for the current set of parameters.

... # calculate gradient g(t) g = derivative(x[0], x[1])

Next, we need to perform the AMSGrad update calculations. We will perform these calculations one variable at a time using an imperative programming style for readability.

In practice, I recommend using NumPy vector operations for efficiency.

... # build a solution one variable at a time for i in range(x.shape[0]): ...

First, we need to calculate the first moment vector.

... # m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t) m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i]

Next, we need to calculate the second moment vector.

... # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2

Then the maximum of the second moment vector with the previous iteration and the current value.

... # vhat(t) = max(vhat(t-1), v(t)) vhat[i] = max(vhat[i], v[i])

Finally, we can calculate the new value for the variable.

... # x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t))) x[i] = x[i] - alpha * m[i] / sqrt(vhat[i])

We may want to add a small value to the denominator to avoid a divide by zero error; for example:

... # x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t))) x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8)

This is then repeated for each parameter that is being optimized.

At the end of the iteration, we can evaluate the new parameter values and report the performance of the search.

... # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score))

We can tie all of this together into a function named *amsgrad()* that takes the names of the objective and derivative functions as well as the algorithm hyperparameters, and returns the best solution found at the end of the search and its evaluation.

# gradient descent algorithm with amsgrad def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vectors m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] vhat = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # update variables one at a time for i in range(x.shape[0]): # m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t) m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2 # vhat(t) = max(vhat(t-1), v(t)) vhat[i] = max(vhat[i], v[i]) # x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t))) x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8) # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return [x, score]

We can then define the bounds of the function and the hyperparameters and call the function to perform the optimization.

In this case, we will run the algorithm for 100 iterations with an initial learning rate of 0.007, beta of 0.9, and a beta2 of 0.99, found after a little trial and error.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 100 # steps size alpha = 0.007 # factor for average gradient beta1 = 0.9 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with amsgrad best, score = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

At the end of the run, we will report the best solution found.

... # summarize the result print('Done!') print('f(%s) = %f' % (best, score))

Tying all of this together, the complete example of AMSGrad gradient descent applied to our test problem is listed below.

# gradient descent optimization with amsgrad for a two-dimensional test function from math import sqrt from numpy import asarray from numpy.random import rand from numpy.random import seed # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with amsgrad def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vectors m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] vhat = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # update variables one at a time for i in range(x.shape[0]): # m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t) m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2 # vhat(t) = max(vhat(t-1), v(t)) vhat[i] = max(vhat[i], v[i]) # x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t))) x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8) # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return [x, score] # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 100 # steps size alpha = 0.007 # factor for average gradient beta1 = 0.9 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with amsgrad best, score = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2) print('Done!') print('f(%s) = %f' % (best, score))

Running the example applies the optimization algorithm with AMSGrad to our test problem and reports the performance of the search for each iteration of the algorithm.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a near-optimal solution was found after perhaps 90 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.

... >90 f([-5.74880707e-11 2.16227707e-03]) = 0.00000 >91 f([-4.53359947e-11 2.03974264e-03]) = 0.00000 >92 f([-3.57526928e-11 1.92415218e-03]) = 0.00000 >93 f([-2.81951584e-11 1.81511216e-03]) = 0.00000 >94 f([-2.22351711e-11 1.71225138e-03]) = 0.00000 >95 f([-1.75350316e-11 1.61521966e-03]) = 0.00000 >96 f([-1.38284262e-11 1.52368665e-03]) = 0.00000 >97 f([-1.09053366e-11 1.43734076e-03]) = 0.00000 >98 f([-8.60013947e-12 1.35588802e-03]) = 0.00000 >99 f([-6.78222208e-12 1.27905115e-03]) = 0.00000 Done! f([-6.78222208e-12 1.27905115e-03]) = 0.000002

We can plot the progress of the AMSGrad search on a contour plot of the domain.

This can provide an intuition for the progress of the search over the iterations of the algorithm.

We must update the *amsgrad()* function to maintain a list of all solutions found during the search, then return this list at the end of the search.

The updated version of the function with these changes is listed below.

# gradient descent algorithm with amsgrad def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vectors m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] vhat = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # update variables one at a time for i in range(x.shape[0]): # m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t) m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2 # vhat(t) = max(vhat(t-1), v(t)) vhat[i] = max(vhat[i], v[i]) # x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t))) x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8) # evaluate candidate point score = objective(x[0], x[1]) # keep track of all solutions solutions.append(x.copy()) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return solutions

We can then execute the search as before, and this time retrieve the list of solutions instead of the best final solution.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 100 # steps size alpha = 0.007 # factor for average gradient beta1 = 0.9 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with amsgrad solutions = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

We can then create a contour plot of the objective function, as before.

... # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet')

Finally, we can plot each solution found during the search as a white dot connected by a line.

... # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

Tying this all together, the complete example of performing the AMSGrad optimization on the test problem and plotting the results on a contour plot is listed below.

# example of plotting the amsgrad search on a contour plot of the test function from math import sqrt from numpy import asarray from numpy import arange from numpy.random import rand from numpy.random import seed from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with amsgrad def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vectors m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] vhat = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # update variables one at a time for i in range(x.shape[0]): # m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t) m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2 # vhat(t) = max(vhat(t-1), v(t)) vhat[i] = max(vhat[i], v[i]) # x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t))) x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8) # evaluate candidate point score = objective(x[0], x[1]) # keep track of all solutions solutions.append(x.copy()) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return solutions # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 100 # steps size alpha = 0.007 # factor for average gradient beta1 = 0.9 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with amsgrad solutions = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w') # show the plot pyplot.show()

Running the example performs the search as before, except in this case, the contour plot of the objective function is created.

In this case, we can see that a white dot is shown for each solution found during the search, starting above the optima and progressively getting closer to the optima at the center of the plot.

This section provides more resources on the topic if you are looking to go deeper.

- On the Convergence of Adam and Beyond, 2018.
- An Overview Of Gradient Descent Optimization Algorithms, 2016.

- Algorithms for Optimization, 2019.
- Deep Learning, 2016.

- Gradient descent, Wikipedia.
- Stochastic gradient descent, Wikipedia.
- An overview of gradient descent optimization algorithms, 2016.
- Experiments with AMSGrad, 2017.

In this tutorial, you discovered how to develop gradient descent optimization with AMSGrad from scratch.

Specifically, you learned:

- AMSGrad is an extension of the Adam version of gradient descent designed to accelerate the optimization process.
- How to implement the AMSGrad optimization algorithm from scratch and apply it to an objective function and evaluate the results.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Gradient Descent Optimization With AMSGrad From Scratch appeared first on Machine Learning Mastery.

]]>The post Gradient Descent Optimization With AdaMax From Scratch appeared first on Machine Learning Mastery.

]]>A limitation of gradient descent is that a single step size (learning rate) is used for all input variables. Extensions to gradient descent, like the Adaptive Movement Estimation (Adam) algorithm, use a separate step size for each input variable but may result in a step size that rapidly decreases to very small values.

**AdaMax** is an extension to the Adam version of gradient descent that generalizes the approach to the infinite norm (max) and may result in a more effective optimization on some problems.

In this tutorial, you will discover how to develop gradient descent optimization with AdaMax from scratch.

After completing this tutorial, you will know:

- AdaMax is an extension of the Adam version of gradient descent designed to accelerate the optimization process.
- How to implement the AdaMax optimization algorithm from scratch and apply it to an objective function and evaluate the results.

Let’s get started.

This tutorial is divided into three parts; they are:

- Gradient Descent
- AdaMax Optimization Algorithm
- Gradient Descent With AdaMax
- Two-Dimensional Test Problem
- Gradient Descent Optimization With AdaMax
- Visualization of AdaMax Optimization

Gradient descent is an optimization algorithm.

It is technically referred to as a first-order optimization algorithm as it explicitly makes use of the first-order derivative of the target objective function.

First-order methods rely on gradient information to help direct the search for a minimum …

— Page 69, Algorithms for Optimization, 2019.

The first-order derivative, or simply the “derivative,” is the rate of change or slope of the target function at a specific point, e.g. for a specific input.

If the target function takes multiple input variables, it is referred to as a multivariate function and the input variables can be thought of as a vector. In turn, the derivative of a multivariate target function may also be taken as a vector and is referred to generally as the gradient.

**Gradient**: First-order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a specific input.

Gradient descent refers to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to locate the minimum of the function.

The gradient descent algorithm requires a target function that is being optimized and the derivative function for the objective function. The target function f() returns a score for a given set of inputs, and the derivative function f'() gives the derivative of the target function for a given set of inputs.

The gradient descent algorithm requires a starting point (x) in the problem, such as a randomly selected point in the input space.

The derivative is then calculated and a step is taken in the input space that is expected to result in a downhill movement in the target function, assuming we are minimizing the target function.

A downhill movement is made by first calculating how far to move in the input space, calculated as the step size (called alpha or the learning rate) multiplied by the gradient. This is then subtracted from the current point, ensuring we move against the gradient, or down the target function.

- x(t) = x(t-1) – step_size * f'(x(t))

The steeper the objective function at a given point, the larger the magnitude of the gradient, and in turn, the larger the step taken in the search space. The size of the step taken is scaled using a step size hyperparameter.

**Step Size**: Hyperparameter that controls how far to move in the search space against the gradient each iteration of the algorithm.

If the step size is too small, the movement in the search space will be small and the search will take a long time. If the step size is too large, the search may bounce around the search space and skip over the optima.

Now that we are familiar with the gradient descent optimization algorithm, let’s take a look at the AdaMax algorithm.

AdaMax algorithm is an extension to the Adaptive Movement Estimation (Adam) Optimization algorithm. More broadly, is an extension to the Gradient Descent Optimization algorithm.

The algorithm was described in the 2014 paper by Diederik Kingma and Jimmy Lei Ba titled “Adam: A Method for Stochastic Optimization.”

Adam can be understood as updating weights inversely proportional to the scaled L2 norm (squared) of past gradients. AdaMax extends this to the so-called infinite norm (max) of past gradients.

In Adam, the update rule for individual weights is to scale their gradients inversely proportional to a (scaled) L^2 norm of their individual current and past gradients

— Adam: A Method for Stochastic Optimization, 2014.

Generally, AdaMax automatically adapts a separate step size (learning rate) for each parameter in the optimization problem.

Let’s step through each element of the algorithm.

First, we must maintain a moment vector and exponentially weighted infinity norm for each parameter being optimized as part of the search, referred to as *m* and *u* respectively.

They are initialized to 0.0 at the start of the search.

- m = 0
- u = 0

The algorithm is executed iteratively over time t starting at t=1, and each iteration involves calculating a new set of parameter values x, e.g. going from *x(t-1)* to *x(t)*.

It is perhaps easy to understand the algorithm if we focus on updating one parameter, which generalizes to updating all parameters via vector operations.

First, the gradient (partial derivatives) are calculated for the current time step.

- g(t) = f'(x(t-1))

Next, the moment vector is updated using the gradient and a hyperparameter *beta1*.

- m(t) = beta1 * m(t-1) + (1 – beta1) * g(t)

The exponentially weighted infinity norm is updated using the *beta2* hyperparameter.

- u(t) = max(beta2 * u(t-1), abs(g(t)))

Where *max()* selects the maximum of the parameters and *abs()* calculates the absolute value.

We can then update the parameter value. This can be broken down into three pieces; the first calculates the step size parameter, the second the gradient, and the third uses the step size and gradient to calculate the new parameter value.

Let’s start with calculating the step size for the parameter using an initial step size hyperparameter called *alpha* and a version of *beta1* that is decaying over time with a specific value for this time step *beta1(t)*:

- step_size(t) = alpha / (1 – beta1(t))

The gradient used for updating the parameter is calculated as follows:

- delta(t) = m(t) / u(t)

Finally, we can calculate the value for the parameter for this iteration.

- x(t) = x(t-1) – step_size(t) * delta(t)

Or the complete update equation can be stated as:

- x(t) = x(t-1) – (alpha / (1 – beta1(t))) * m(t) / u(t)

To review, there are three hyperparameters for the algorithm; they are:

**alpha**: Initial step size (learning rate), a typical value is 0.002.**beta1**: Decay factor for first momentum, a typical value is 0.9.**beta2**: Decay factor for infinity norm, a typical value is 0.999.

The decay schedule for beta1(t) suggested in the paper is calculated using the initial beta1 value raised to the power t, although other decay schedules could be used such as holding the value constant or decaying more aggressively.

- beta1(t) = beta1^t

And that’s it.

For the full derivation of the AdaMax algorithm in the context of the Adam algorithm, I recommend reading the paper:

Next, let’s look at how we might implement the algorithm from scratch in Python.

In this section, we will explore how to implement the gradient descent optimization algorithm with AdaMax Momentum.

First, let’s define an optimization function.

We will use a simple two-dimensional function that squares the input of each dimension and define the range of valid inputs from -1.0 to 1.0.

The *objective()* function below implements this.

# objective function def objective(x, y): return x**2.0 + y**2.0

We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the response surface.

The complete example of plotting the objective function is listed below.

# 3d plot of the test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -1.0, 1.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a three-dimensional surface plot of the objective function.

We can see the familiar bowl shape with the global minima at f(0, 0) = 0.

We can also create a two-dimensional plot of the function. This will be helpful later when we want to plot the progress of the search.

The example below creates a contour plot of the objective function.

# contour plot of the test function from numpy import asarray from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # show the plot pyplot.show()

Running the example creates a two-dimensional contour plot of the objective function.

We can see the bowl shape compressed to contours shown with a color gradient. We will use this plot to plot the specific points explored during the progress of the search.

Now that we have a test objective function, let’s look at how we might implement the AdaMax optimization algorithm.

We can apply the gradient descent with AdaMax to the test problem.

First, we need a function that calculates the derivative for this function.

The derivative of x^2 is x * 2 in each dimension.

- f(x) = x^2
- f'(x) = x * 2

The derivative() function implements this below.

# derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0])

Next, we can implement gradient descent optimization with AdaMax.

First, we can select a random point in the bounds of the problem as a starting point for the search.

This assumes we have an array that defines the bounds of the search with one row for each dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

... # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

Next, we need to initialize the moment vector and exponentially weighted infinity norm.

... # initialize moment vector and weighted infinity norm m = [0.0 for _ in range(bounds.shape[0])] u = [0.0 for _ in range(bounds.shape[0])]

We then run a fixed number of iterations of the algorithm defined by the “*n_iter*” hyperparameter.

... # run iterations of gradient descent for t in range(n_iter): ...

The first step is to calculate the derivative for the current set of parameters.

... # calculate gradient g(t) g = derivative(x[0], x[1])

Next, we need to perform the AdaMax update calculations. We will perform these calculations one variable at a time using an imperative programming style for readability.

In practice, I recommend using NumPy vector operations for efficiency.

... # build a solution one variable at a time for i in range(x.shape[0]): ...

First, we need to calculate the moment vector.

... # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]

Next, we need to calculate the exponentially weighted infinity norm.

... # u(t) = max(beta2 * u(t-1), abs(g(t))) u[i] = max(beta2 * u[i], abs(g[i]))

Then the step size used in the update.

... # step_size(t) = alpha / (1 - beta1(t)) step_size = alpha / (1.0 - beta1**(t+1))

And the change in variable.

... # delta(t) = m(t) / u(t) delta = m[i] / u[i]

Finally, we can calculate the new value for the variable.

... # x(t) = x(t-1) - step_size(t) * delta(t) x[i] = x[i] - step_size * delta

This is then repeated for each parameter that is being optimized.

At the end of the iteration, we can evaluate the new parameter values and report the performance of the search.

... # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score))

We can tie all of this together into a function named *adamax()* that takes the names of the objective and derivative functions as well as the algorithm hyperparameters and returns the best solution found at the end of the search and its evaluation.

# gradient descent algorithm with adamax def adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vector and weighted infinity norm m = [0.0 for _ in range(bounds.shape[0])] u = [0.0 for _ in range(bounds.shape[0])] # run iterations of gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(x.shape[0]): # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i] # u(t) = max(beta2 * u(t-1), abs(g(t))) u[i] = max(beta2 * u[i], abs(g[i])) # step_size(t) = alpha / (1 - beta1(t)) step_size = alpha / (1.0 - beta1**(t+1)) # delta(t) = m(t) / u(t) delta = m[i] / u[i] # x(t) = x(t-1) - step_size(t) * delta(t) x[i] = x[i] - step_size * delta # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return [x, score]

We can then define the bounds of the function and the hyperparameters and call the function to perform the optimization.

In this case, we will run the algorithm for 60 iterations with an initial learning rate of 0.02, beta of 0.8, and a beta2 of 0.99, found after a little trial and error.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with adamax best, score = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

At the end of the run, we will report the best solution found.

... # summarize the result print('Done!') print('f(%s) = %f' % (best, score))

Tying all of this together, the complete example of AdaMax gradient descent applied to our test problem is listed below.

# gradient descent optimization with adamax for a two-dimensional test function from numpy import asarray from numpy.random import rand from numpy.random import seed # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with adamax def adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vector and weighted infinity norm m = [0.0 for _ in range(bounds.shape[0])] u = [0.0 for _ in range(bounds.shape[0])] # run iterations of gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(x.shape[0]): # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i] # u(t) = max(beta2 * u(t-1), abs(g(t))) u[i] = max(beta2 * u[i], abs(g[i])) # step_size(t) = alpha / (1 - beta1(t)) step_size = alpha / (1.0 - beta1**(t+1)) # delta(t) = m(t) / u(t) delta = m[i] / u[i] # x(t) = x(t-1) - step_size(t) * delta(t) x[i] = x[i] - step_size * delta # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return [x, score] # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with adamax best, score = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2) # summarize the result print('Done!') print('f(%s) = %f' % (best, score))

Running the example applies the optimization algorithm with AdaMax to our test problem and reports the performance of the search for each iteration of the algorithm.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a near-optimal solution was found after perhaps 35 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.

... >33 f([-0.00122185 0.00427944]) = 0.00002 >34 f([-0.00045147 0.00289913]) = 0.00001 >35 f([0.00022176 0.00165754]) = 0.00000 >36 f([0.00073314 0.00058534]) = 0.00000 >37 f([ 0.00105092 -0.00030082]) = 0.00000 >38 f([ 0.00117382 -0.00099624]) = 0.00000 >39 f([ 0.00112512 -0.00150609]) = 0.00000 >40 f([ 0.00094497 -0.00184321]) = 0.00000 >41 f([ 0.00068206 -0.002026 ]) = 0.00000 >42 f([ 0.00038579 -0.00207647]) = 0.00000 >43 f([ 9.99977780e-05 -2.01849176e-03]) = 0.00000 >44 f([-0.00014145 -0.00187632]) = 0.00000 >45 f([-0.00031698 -0.00167338]) = 0.00000 >46 f([-0.00041753 -0.00143134]) = 0.00000 >47 f([-0.00044531 -0.00116942]) = 0.00000 >48 f([-0.00041125 -0.00090399]) = 0.00000 >49 f([-0.00033193 -0.00064834]) = 0.00000 Done! f([-0.00033193 -0.00064834]) = 0.000001

We can plot the progress of the AdaMax search on a contour plot of the domain.

This can provide an intuition for the progress of the search over the iterations of the algorithm.

We must update the adamax() function to maintain a list of all solutions found during the search, then return this list at the end of the search.

The updated version of the function with these changes is listed below.

# gradient descent algorithm with adamax def adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vector and weighted infinity norm m = [0.0 for _ in range(bounds.shape[0])] u = [0.0 for _ in range(bounds.shape[0])] # run iterations of gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(x.shape[0]): # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i] # u(t) = max(beta2 * u(t-1), abs(g(t))) u[i] = max(beta2 * u[i], abs(g[i])) # step_size(t) = alpha / (1 - beta1(t)) step_size = alpha / (1.0 - beta1**(t+1)) # delta(t) = m(t) / u(t) delta = m[i] / u[i] # x(t) = x(t-1) - step_size(t) * delta(t) x[i] = x[i] - step_size * delta # evaluate candidate point score = objective(x[0], x[1]) solutions.append(x.copy()) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return solutions

We can then execute the search as before, and this time retrieve the list of solutions instead of the best final solution.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with adamax solutions = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

We can then create a contour plot of the objective function, as before.

... # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet')

Finally, we can plot each solution found during the search as a white dot connected by a line.

... # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

Tying this all together, the complete example of performing the AdaMax optimization on the test problem and plotting the results on a contour plot is listed below.

# example of plotting the adamax search on a contour plot of the test function from numpy import asarray from numpy import arange from numpy.random import rand from numpy.random import seed from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with adamax def adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vector and weighted infinity norm m = [0.0 for _ in range(bounds.shape[0])] u = [0.0 for _ in range(bounds.shape[0])] # run iterations of gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(x.shape[0]): # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i] # u(t) = max(beta2 * u(t-1), abs(g(t))) u[i] = max(beta2 * u[i], abs(g[i])) # step_size(t) = alpha / (1 - beta1(t)) step_size = alpha / (1.0 - beta1**(t+1)) # delta(t) = m(t) / u(t) delta = m[i] / u[i] # x(t) = x(t-1) - step_size(t) * delta(t) x[i] = x[i] - step_size * delta # evaluate candidate point score = objective(x[0], x[1]) solutions.append(x.copy()) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return solutions # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with adamax solutions = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w') # show the plot pyplot.show()

Running the example performs the search as before, except in this case, the contour plot of the objective function is created.

In this case, we can see that a white dot is shown for each solution found during the search, starting above the optima and progressively getting closer to the optima at the center of the plot.

This section provides more resources on the topic if you are looking to go deeper.

- Adam: A Method for Stochastic Optimization, 2014.
- An Overview Of Gradient Descent Optimization Algorithms, 2016.

- Algorithms for Optimization, 2019.
- Deep Learning, 2016.

- Gradient descent, Wikipedia.
- Stochastic gradient descent, Wikipedia.
- An overview of gradient descent optimization algorithms, 2016.

In this tutorial, you discovered how to develop the gradient descent optimization with AdaMax from scratch.

Specifically, you learned:

- AdaMax is an extension of the Adam version of gradient descent designed to accelerate the optimization process.
- How to implement the AdaMax optimization algorithm from scratch and apply it to an objective function and evaluate the results.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Gradient Descent Optimization With AdaMax From Scratch appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Premature Convergence appeared first on Machine Learning Mastery.

]]>It can also be a useful empirical tool when exploring the learning dynamics of an optimization algorithm, and machine learning algorithms trained using an optimization algorithm, such as deep learning neural networks. This motivates the investigation of learning curves and techniques, such as early stopping.

If optimization is a process that generates candidate solutions, then convergence represents a stable point at the end of the process when no further changes or improvements are expected. **Premature convergence** refers to a failure mode for an optimization algorithm where the process stops at a stable point that does not represent a globally optimal solution.

In this tutorial, you will discover a gentle introduction to premature convergence in machine learning.

After completing this tutorial, you will know:

- Convergence refers to the stable point found at the end of a sequence of solutions via an iterative optimization algorithm.
- Premature convergence refers to a stable point found too soon, perhaps close to the starting point of the search, and with a worse evaluation than expected.
- Greediness of an optimization algorithm provides a control over the rate of convergence of an algorithm.

Let’s get started.

This tutorial is divided into three parts; they are:

- Convergence in Machine Learning
- Premature Convergence
- Addressing Premature Convergence

Convergence generally refers to the values of a process that have a tendency in behavior over time.

It is a useful idea when working with optimization algorithms.

Optimization refers to a type of problem that requires finding a set of inputs that result in the maximum or minimum value from an objective function. Optimization is an iterative process that produces a sequence of candidate solutions until ultimately arriving upon a final solution at the end of the process.

This behavior or dynamics of the optimization algorithm arriving on a stable-point final solution is referred to as convergence, e.g. the convergence of the optimization algorithms. In this way, convergence defines the termination of the optimization algorithm.

Local descent involves iteratively choosing a descent direction and then taking a step in that direction and repeating that process until convergence or some termination condition is met.

— Page 13, Algorithms for Optimization, 2019.

**Convergence**: Stop condition for an optimization algorithm where a stable point is located and further iterations of the algorithm are unlikely to result in further improvement.

We might measure and explore the convergence of an optimization algorithm empirically, such as using learning curves. Additionally, we might also explore the convergence of an optimization algorithm analytically, such as a convergence proof or average case computational complexity.

Strong selection pressure results in rapid, but possibly premature, convergence. Weakening the selection pressure slows down the search process …

— Page 78, Evolutionary Computation: A Unified Approach, 2002.

Optimization, and the convergence of optimization algorithms, is an important concept in machine learning for those algorithms that fit (learn) on a training dataset via an iterative optimization algorithm, such as logistic regression and artificial neural networks.

As such, we may choose optimization algorithms that result in better convergence behavior than other algorithms, or spend a lot of time tuning the convergence dynamics (learning dynamics) of an optimization algorithm via the hyperparameters of the optimization (e.g. learning rate).

Convergence behavior can be compared, often in terms of the number of iterations of an algorithm required until convergence, to the objective function evaluation of the stable point found at convergence, and combinations of these concerns.

Premature convergence refers to the convergence of a process that has occurred too soon.

In optimization, it refers to the algorithm converging upon a stable point that has worse performance than expected.

Premature convergence typically afflicts complex optimization tasks where the objective function is non-convex, meaning that the response surface contains many different good solutions (stable points), perhaps with one (or a few) best solutions.

If we consider the response surface of an objective function under optimization as a geometrical landscape and we are seeking a minimum of the function, then premature optimization refers to finding a valley close to the starting point of the search that has less depth than the deepest valley in the problem domain.

For problems that exhibit highly multi-modal (rugged) fitness landscapes or landscapes that change over time, too much exploitation generally results in premature convergence to suboptimal peaks in the space.

— Page 60, Evolutionary Computation: A Unified Approach, 2002.

In this way, premature convergence is described as finding a locally optimal solution instead of the globally optimal solution for an optimization algorithm. It is a specific failure case for an optimization algorithm.

**Premature Convergence**: Convergence of an optimization algorithm to a worse than optimal stable point that is likely close to the starting point.

Put another way, convergence signifies the end of the search process, e.g. a stable point was located and further iterations of the algorithm are not likely to improve upon the solution. Premature convergence refers to reaching this stop condition of an optimization algorithm at a less than desirable stationary point.

Premature convergence may be a relevant concern on any reasonably challenging optimization task.

For example, a majority of research into the field of evolutionary computation and genetic algorithms involves identifying and overcoming the premature convergence of the algorithm on an optimization task.

If selection focuses on the most-fit individuals, the selection pressure may cause premature convergence due to reduced diversity of the new populations.

— Page 139, Computational Intelligence: An Introduction, 2nd edition, 2007.

Population-based optimization algorithms, like evolutionary algorithms and swarm intelligence, often describe their dynamics in terms of the interplay between selective pressures and convergence. For example, strong selective pressures result in faster convergence and likely premature convergence. Weaker selective pressures may result in a slower convergence (greater computational cost) although perhaps locate a better or even global optima.

An operator with a high selective pressure decreases diversity in the population more rapidly than operators with a low selective pressure, which may lead to premature convergence to suboptimal solutions. A high selective pressure limits the exploration abilities of the population.

— Page 135, Computational Intelligence: An Introduction, 2nd edition, 2007.

This idea of selective pressure is helpful more generally in understanding the learning dynamics of optimization algorithms. For example, an optimization that is configured to be too greedy (e.g. via hyperparameters such as the step size or learning rate) may fail due to premature convergence, whereas the same algorithm that is configured to be less greedy may overcome premature convergence and discover a better or globally optimal solution.

Premature convergence may be encountered when using stochastic gradient descent to train a neural network model, signified by a learning curve that drops exponentially quickly then stops improving.

The number of updates required to reach convergence usually increases with training set size. However, as m approaches infinity, the model will eventually converge to its best possible test error before SGD has sampled every example in the training set.

— Page 153, Deep Learning, 2016.

The fact that fitting neural networks are subject to premature convergence motivates the use of methods such as learning curves to monitor and diagnose issues with the convergence of a model on a training dataset, and the use of regularization, such as early stopping, that halts the optimization algorithm prior to finding a stable point comes at the expense of worse performance on a holdout dataset.

As such, much research into deep learning neural networks is ultimately directed at overcoming premature convergence.

Empirically, it is often found that ‘tanh’ activation functions give rise to faster convergence of training algorithms than logistic functions.

— Page 127, Neural Networks for Pattern Recognition, 1995.

This includes techniques such as work on weight initialization, which is critical because the initial weights of a neural network define the starting point of the optimization process, and poor initialization can lead to premature convergence.

The initial point can determine whether the algorithm converges at all, with some initial points being so unstable that the algorithm encounters numerical difficulties and fails altogether.

— Page 301, Deep Learning, 2016.

This also includes the vast number of variations and extensions of the stochastic gradient descent optimization algorithm, such as the addition of momentum so that the algorithm does not overshoot the optima (stable point), and Adam that adds an automatically adapted step size hyperparameter (learning rate) for each parameter that is being optimized, dramatically speeding up convergence.

This section provides more resources on the topic if you are looking to go deeper.

- Algorithms for Optimization, 2019.
- An Introduction to Genetic Algorithms, 1998.
- Computational Intelligence: An Introduction, 2nd edition, 2007.
- Deep Learning, 2016.
- Evolutionary Computation: A Unified Approach, 2002.
- Neural Networks for Pattern Recognition, 1995.
- Probabilistic Machine Learning: An Introduction 2020.

- Limit of a sequence, Wikipedia.
- Convergence of random variables, Wikipedia.
- Premature convergence, Wikipedia.

In this tutorial, you discovered a gentle introduction to premature convergence in machine learning.

Specifically, you learned:

- Convergence refers to the stable point found at the end of a sequence of solutions via an iterative optimization algorithm.
- Premature convergence refers to a stable point found too soon, perhaps close to the starting point of the search, and with a worse evaluation than expected.
- Greediness of an optimization algorithm provides a control over the rate of convergence of an algorithm.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Premature Convergence appeared first on Machine Learning Mastery.

]]>The post Why Optimization Is Important in Machine Learning appeared first on Machine Learning Mastery.

]]>This problem can be described as approximating a function that maps examples of inputs to examples of outputs. Approximating a function can be solved by framing the problem as function optimization. This is where a machine learning algorithm defines a parameterized mapping function (e.g. a weighted sum of inputs) and an optimization algorithm is used to fund the values of the parameters (e.g. model coefficients) that minimize the error of the function when used to map inputs to outputs.

This means that each time we fit a machine learning algorithm on a training dataset, we are solving an optimization problem.

In this tutorial, you will discover the central role of optimization in machine learning.

After completing this tutorial, you will know:

- Machine learning algorithms perform function approximation, which is solved using function optimization.
- Function optimization is the reason why we minimize error, cost, or loss when fitting a machine learning algorithm.
- Optimization is also performed during data preparation, hyperparameter tuning, and model selection in a predictive modeling project.

Let’s get started.

This tutorial is divided into three parts; they are:

- Machine Learning and Optimization
- Learning as Optimization
- Optimization in a Machine Learning Project
- Data Preparation as Optimization
- Hyperparameter Tuning as Optimization
- Model Selection as Optimization

Function optimization is the problem of finding the set of inputs to a target objective function that result in the minimum or maximum of the function.

It can be a challenging problem as the function may have tens, hundreds, thousands, or even millions of inputs, and the structure of the function is unknown, and often non-differentiable and noisy.

**Function Optimization**: Find the set of inputs that results in the minimum or maximum of an objective function.

Machine learning can be described as function approximation. That is, approximating the unknown underlying function that maps examples of inputs to outputs in order to make predictions on new data.

It can be challenging as there is often a limited number of examples from which we can approximate the function, and the structure of the function that is being approximated is often nonlinear, noisy, and may even contain contradictions.

**Function Approximation**: Generalize from specific examples to a reusable mapping function for making predictions on new examples.

Function optimization is often simpler than function approximation.

Importantly, in machine learning, we often solve the problem of function approximation using function optimization.

At the core of nearly all machine learning algorithms is an optimization algorithm.

In addition, the process of working through a predictive modeling problem involves optimization at multiple steps in addition to learning a model, including:

- Choosing the hyperparameters of a model.
- Choosing the transforms to apply to the data prior to modeling
- Choosing the modeling pipeline to use as the final model.

Now that we know that optimization plays a central role in machine learning, let’s look at some examples of learning algorithms and how they use optimization.

Predictive modeling problems involve making a prediction from an example of input.

A numeric quantity must be predicted in the case of a regression problem, whereas a class label must be predicted in the case of a classification problem.

The problem of predictive modeling is sufficiently challenging that we cannot write code to make predictions. Instead, we must use a learning algorithm applied to historical data to learn a “*program*” called a predictive model that we can use to make predictions on new data.

In statistical learning, a statistical perspective on machine learning, the problem is framed as the learning of a mapping function (f) given examples of input data (*X*) and associated output data (*y*).

- y = f(X)

Given new examples of input (*Xhat*), we must map each example onto the expected output value (*yhat*) using our learned function (*fhat*).

- yhat = fhat(Xhat)

The learned mapping will be imperfect. No model is perfect, and some prediction error is expected given the difficulty of the problem, noise in the observed data, and the choice of learning algorithm.

Mathematically, learning algorithms solve the problem of approximating the mapping function by solving a function optimization problem.

Specifically, given examples of inputs and outputs, find the set of inputs to the mapping function that results in the minimum loss, minimum cost, or minimum prediction error.

The more biased or constrained the choice of mapping function, the easier the optimization is to solve.

Let’s look at some examples to make this clear.

A linear regression (for regression problems) is a highly constrained model and can be solved analytically using linear algebra. The inputs to the mapping function are the coefficients of the model.

We can use an optimization algorithm, like a quasi-Newton local search algorithm, but it will almost always be less efficient than the analytical solution.

**Linear Regression**: Function inputs are model coefficients, optimization problems that can be solved analytically.

A logistic regression (for classification problems) is slightly less constrained and must be solved as an optimization problem, although something about the structure of the optimization function being solved is known given the constraints imposed by the model.

This means a local search algorithm like a quasi-Newton method can be used. We could use a global search like stochastic gradient descent, but it will almost always be less efficient.

**Logistic Regression**: Function inputs are model coefficients, optimization problems that require an iterative local search algorithm.

A neural network model is a very flexible learning algorithm that imposes few constraints. The inputs to the mapping function are the network weights. A local search algorithm cannot be used given the search space is multimodal and highly nonlinear; instead, a global search algorithm must be used.

A global optimization algorithm is commonly used, specifically stochastic gradient descent, and the updates are made in a way that is aware of the structure of the model (backpropagation and the chain rule). We could use a global search algorithm that is oblivious of the structure of the model, like a genetic algorithm, but it will almost always be less efficient.

**Neural Network**: Function inputs are model weights, optimization problems that require an iterative global search algorithm.

We can see that each algorithm makes different assumptions about the form of the mapping function, which influences the type of optimization problem to be solved.

We can also see that the default optimization algorithm used for each machine learning algorithm is not arbitrary; it represents the most efficient algorithm for solving the specific optimization problem framed by the algorithm, e.g. stochastic gradient descent for neural nets instead of a genetic algorithm. Deviating from these defaults requires a good reason.

Not all machine learning algorithms solve an optimization problem. A notable example is the k-nearest neighbors algorithm that stores the training dataset and does a lookup for the k best matches to each new example in order to make a prediction.

Now that we are familiar with learning in machine learning algorithms as optimization, let’s look at some related examples of optimization in a machine learning project.

Optimization plays an important part in a machine learning project in addition to fitting the learning algorithm on the training dataset.

The step of preparing the data prior to fitting the model and the step of tuning a chosen model also can be framed as an optimization problem. In fact, an entire predictive modeling project can be thought of as one large optimization problem.

Let’s take a closer look at each of these cases in turn.

Data preparation involves transforming raw data into a form that is most appropriate for the learning algorithms.

This might involve scaling values, handling missing values, and changing the probability distribution of variables.

Transforms can be made to change representation of the historical data to meet the expectations or requirements of specific learning algorithms. Yet, sometimes good or best results can be achieved when the expectations are violated or when an unrelated transform to the data is performed.

We can think of choosing transforms to apply to the training data as a search or optimization problem of best exposing the unknown underlying structure of the data to the learning algorithm.

**Data Preparation**: Function inputs are sequences of transforms, optimization problems that require an iterative global search algorithm.

This optimization problem is often performed manually with human-based trial and error. Nevertheless, it is possible to automate this task using a global optimization algorithm where the inputs to the function are the types and order of transforms applied to the training data.

The number and permutations of data transforms are typically quite limited and it may be possible to perform an exhaustive search or a grid search of commonly used sequences.

For more on this topic, see the tutorial:

Machine learning algorithms have hyperparameters that can be configured to tailor the algorithm to a specific dataset.

Although the dynamics of many hyperparameters are known, the specific effect they will have on the performance of the resulting model on a given dataset is not known. As such, it is a standard practice to test a suite of values for key algorithm hyperparameters for a chosen machine learning algorithm.

This is called hyperparameter tuning or hyperparameter optimization.

It is common to use a naive optimization algorithm for this purpose, such as a random search algorithm or a grid search algorithm.

**Hyperparameter Tuning**: Function inputs are algorithm hyperparameters, optimization problems that require an iterative global search algorithm.

For more on this topic, see the tutorial:

Nevertheless, it is becoming increasingly common to use an iterative global search algorithm for this optimization problem. A popular choice is a Bayesian optimization algorithm that is capable of simultaneously approximating the target function that is being optimized (using a surrogate function) while optimizing it.

This is desirable as evaluating a single combination of model hyperparameters is expensive, requiring fitting the model on the entire training dataset one or many times, depending on the choice of model evaluation procedure (e.g. repeated k-fold cross-validation).

For more on this topic, see the tutorial:

Model selection involves choosing one from among many candidate machine learning models for a predictive modeling problem.

Really, it involves choosing the machine learning algorithm or machine learning pipeline that produces a model. This is then used to train a final model that can then be used in the desired application to make predictions on new data.

This process of model selection is often a manual process performed by a machine learning practitioner involving tasks such as preparing data, evaluating candidate models, tuning well-performing models, and finally choosing the final model.

This can be framed as an optimization problem that subsumes part of or the entire predictive modeling project.

**Model Selection**: Function inputs are data transform, machine learning algorithm, and algorithm hyperparameters; optimization problem that requires an iterative global search algorithm.

Increasingly, this is the case with automated machine learning (AutoML) algorithms being used to choose an algorithm, an algorithm and hyperparameters, or data preparation, algorithm and hyperparameters, with very little user intervention.

For more on AutoML see the tutorial:

Like hyperparameter tuning, it is common to use a global search algorithm that also approximates the objective function, such as Bayesian optimization, given that each function evaluation is expensive.

This automated optimization approach to machine learning also underlies modern machine learning as a service (MLaaS) products provided by companies such as Google, Microsoft, and Amazon.

Although fast and efficient, such approaches are still unable to outperform hand-crafted models prepared by highly skilled experts, such as those participating in machine learning competitions.

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to Applied Machine Learning as a Search Problem
- How to Grid Search Data Preparation Techniques
- Hyperparameter Optimization With Random Search and Grid Search
- How to Implement Bayesian Optimization from Scratch in Python
- Automated Machine Learning (AutoML) Libraries for Python

- Mathematical optimization, Wikipedia.
- Function approximation, Wikipedia.
- Least-squares function approximation, Wikipedia.
- Hyperparameter optimization, Wikipedia.
- Model selection, Wikipedia.

In this tutorial, you discovered the central role of optimization in machine learning.

Specifically, you learned:

- Machine learning algorithms perform function approximation, which is solved using function optimization.
- Function optimization is the reason why we minimize error, cost, or loss when fitting a machine learning algorithm.
- Optimization is also performed during data preparation, hyperparameter tuning, and model selection in a predictive modeling project.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Why Optimization Is Important in Machine Learning appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Function Optimization appeared first on Machine Learning Mastery.

]]>Importantly, function optimization is central to almost all machine learning algorithms, and predictive modeling projects. As such, it is critical to understand what function optimization is, the terminology used in the field, and the elements that constitute a function optimization problem.

In this tutorial, you will discover a gentle introduction to function optimization.

After completing this tutorial, you will know:

- The three elements of function optimization as candidate solutions, objective functions, and cost.
- The conceptualization of function optimization as navigating a search space and response surface.
- The difference between global optima and local optima when solving a function optimization problem.

Let’s get started.

This tutorial is divided into four parts; they are:

- Function Optimization
- Candidate Solutions
- Objective Functions
- Evaluation Costs

Function optimization is a subfield of mathematics, and in modern times is addressed using numerical computing methods.

**Continuous function optimization** (“*function optimization*” here for short) belongs to a broader field of study called mathematical optimization.

It is distinct from other types of optimization as it involves finding optimal candidate solutions composed of numeric input variables, as opposed to candidate solutions composed of sequences or combinations (e.g. combinatorial optimization).

Function optimization is a widely used tool bag of techniques employed in practically all scientific and engineering disciplines.

People optimize. Investors seek to create portfolios that avoid excessive risk while achieving a high rate of return. […] Optimization is an important tool in decision science and in the analysis of physical systems.

— Page 2, Numerical Optimization, 2006.

It plays a central role in machine learning, as almost all machine learning algorithms use function optimization to fit a model to a training dataset.

For example, fitting a line to a collection of points requires solving an optimization problem. As does fitting a linear regression or a neural network model on a training dataset.

In this way, optimization provides a tool to adapt a general model to a specific situation. Learning is treated as an optimization or search problem.

Practically, function optimization describes a class of problems for finding the input to a given function that results in the minimum or maximum output from the function.

The objective depends on certain characteristics of the system, called variables or unknowns. Our goal is to find values of the variables that optimize the objective.

— Page 2, Numerical Optimization, 2006.

Function Optimization involves three elements: the input to the function (e.g. *x*), the objective function itself (e.g. *f()*) and the output from the function (e.g. *cost*).

**Input (x)**: The input to the function to be evaluated, e.g. a candidate solution.**Function (f())**: The objective function or target function that evaluates inputs.**Cost**: The result of evaluating a candidate solution with the objective function, minimized or maximized.

Let’s take a closer look at each element in turn.

A candidate solution is a single input to the objective function.

The form of a candidate solution depends on the specifics of the objective function. It may be a single floating point number, a vector of numbers, a matrix of numbers, or as complex as needed for the specific problem domain.

Most commonly, vectors of numbers. For a test problem, the vector represents the specific values of each input variable to the function (*x = x1, x2, x3, …, xn*). For a machine learning model, the vector may represent model coefficients or weights.

Mathematically speaking, optimization is the minimization or maximization of a function subject to constraints on its variables.

— Page 2, Numerical Optimization, 2006.

There may be constraints imposed by the problem domain or the objective function on the candidate solutions. This might include aspects such as:

- The number of variables (1, 20, 1,000,000, etc.)
- The data type of variables (integer, binary, real-valued, etc.)
- The range of accepted values (between 0 and 1, etc.)

Importantly, candidate solutions are discrete and there are many of them.

The universe of candidate solutions may be vast, too large to enumerate. Instead, the best we can do is sample candidate solutions in the search space. As a practitioner, we seek an optimization algorithm that makes the best use of the information available about the problem to effectively sample the search space and locate a good or best candidate solution.

**Search Space**: Universe of candidate solutions defined by the number, type, and range of accepted inputs to the objective function.

Finally, candidate solutions can be rank-ordered based on their evaluation by the objective function, meaning that some are better than others.

The objective function is specific to the problem domain.

It may be a test function, e.g. a well-known equation with a specific number of input variables, the calculation of which returns the cost of the input. The optima of test functions are known, allowing algorithms to be compared based on their ability to navigate the search space efficiently.

In machine learning, the objective function may involve plugging the candidate solution into a model and evaluating it against a portion of the training dataset, and the cost may be an error score, often called the loss of the model.

The objective function is easy to define, although expensive to evaluate. Efficiency in function optimization refers to minimizing the total number of function evaluations.

Although the objective function is easy to define, it may be challenging to optimize. The difficulty of an objective function may range from being able to analytically solve the function directly using calculus or linear algebra (easy), to using a local search algorithm (moderate), to using a global search algorithm (hard).

The difficulty of an objective function is based on how much is known about the function. This often cannot be determined by simply reviewing the equation or code for evaluating candidate solutions. Instead, it refers to the structure of the response surface.

The response surface (or search landscape) is the geometrical structure of the cost in relation to the search space of candidate solutions. For example, a smooth response surface suggests that small changes to the input (candidate solutions) result in small changes to the output (cost) from the objective function.

**Response Surface**: Geometrical properties of the cost from the objective function in response to changes to the candidate solutions.

The response surface can be visualized in low dimensions, e.g. for candidate solutions with one or two input variables. A one-dimensional input can be plotted as a 2D scatter plot with input values on the x-axis and the cost on the y-axis. A two-dimensional input can be plotted as a 3D surface plot with input variables on the x and y-axis, and the height of the surface representing the cost.

In a minimization problem, poor solutions would be represented as hills in the response surface and good solutions would be represented by valleys. This would be inverted for maximizing problems.

The structure and shape of this response surface determine the difficulty an algorithm will have in navigating the search space to a solution.

The complexity of real objective functions means we cannot analyze the surface analytically, and the high dimensionality of the inputs and computational cost of function evaluations makes mapping and plotting real objective functions intractable.

The cost of a candidate solution is almost always a single real-valued number.

The scale of the cost values will vary depending on the specifics of the objective function. In general, the only meaningful comparison of cost values is to other cost values calculated by the same objective function.

The minimum or maximum output from the function is called the optima of the function, typically simplified to simply the minimum. Any function we wish to maximize can be converted to minimizing by adding a negative sign to the front of the cost returned from the function.

In global optimization, the true global solution of the optimization problem is found; the compromise is efficiency. The worst-case complexity of global optimization methods grows exponentially with the problem sizes …

— Page 10, Convex Optimization, 2004.

An objective function may have a single best solution, referred to as the global optimum of the objective function. Alternatively, the objective function may have many global optima, in which case we may be interested in locating one or all of them.

Many numerical optimization methods seek local minima. Local minima are locally optimal, but we do not generally know whether a local minimum is a global minimum.

— Page 8, Algorithms for Optimization, 2019.

In addition to a global optima, a function may have local optima, which are good candidate solutions that may be relatively easy to locate, but not as good as the global optima. Local optima may appear to be global optima to a search algorithm, e.g. may be in a valley of the response surface, in which case we might refer to them as deceptive as the algorithm will easily locate them and get stuck, failing to locate the global optima.

**Global Optima**: The candidate solution with the best cost from the objective function.**Local Optima**. Candidate solutions are good but not as good as the global optima.

The relative nature of cost values means that a baseline in performance on challenging problems can be established using a naive search algorithm (e.g. random) and “goodness” of optimal solutions found by more sophisticated search algorithms can be compared relative to the baseline.

Candidate solutions are often very simple to describe and very easy to construct. The challenging part of function optimization is evaluating candidate solutions.

Solving a function optimization problem or objective function refers to finding the optima. The whole goal of the project is to locate a specific candidate solution with a good or best cost, give the time and resources available. In simple and moderate problems, we may be able to locate the optimal candidate solution exactly and have some confidence that we have done so.

Many algorithms for nonlinear optimization problems seek only a local solution, a point at which the objective function is smaller than at all other feasible nearby points. They do not always find the global solution, which is the point with lowest function value among all feasible points. Global solutions are needed in some applications, but for many problems they are difficult to recognize and even more difficult to locate.

— Page 6, Numerical Optimization, 2006.

On more challenging problems, we may be happy with a relatively good candidate solution (e.g. good enough) given the time available for the project.

This section provides more resources on the topic if you are looking to go deeper.

- Algorithms for Optimization, 2019.
- Convex Optimization, 2004.
- Numerical Optimization, 2006.

In this tutorial, you discovered a gentle introduction to function optimization.

Specifically, you learned:

- The three elements of function optimization as candidate solutions, objective functions and cost.
- The conceptualization of function optimization as navigating a search space and response surface.
- The difference between global optima and local optima when solving a function optimization problem.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Function Optimization appeared first on Machine Learning Mastery.

]]>The post One-Dimensional (1D) Test Functions for Function Optimization appeared first on Machine Learning Mastery.

]]>There are a large number of optimization algorithms and it is important to study and develop intuitions for optimization algorithms on simple and easy-to-visualize test functions.

**One-dimensional functions** take a single input value and output a single evaluation of the input. They may be the simplest type of test function to use when studying function optimization.

The benefit of one-dimensional functions is that they can be visualized as a two-dimensional plot with inputs to the function on the x-axis and outputs of the function on the y-axis. The known optima of the function and any sampling of the function can also be drawn on the same plot.

In this tutorial, you will discover standard one-dimensional functions you can use when studying function optimization.

Let’s get started.

There are many different types of simple one-dimensional test functions we could use.

Nevertheless, there are standard test functions that are commonly used in the field of function optimization. There are also specific properties of test functions that we may wish to select when testing different algorithms.

We will explore a small number of simple one-dimensional test functions in this tutorial and organize them by their properties with five different groups; they are:

- Convex Unimodal Functions
- Non-Convex Unimodal Functions
- Multimodal Functions
- Discontinuous Functions (Non-Smooth)
- Noisy Functions

Each function will be presented using Python code with a function implementation of the target objective function and a sampling of the function that is shown as a line plot with the optima of the function clearly marked.

All functions are presented as a minimization problem, e.g. find the input that results in the minimum (smallest value) output of the function. Any maximizing function can be made a minimization function by adding a negative sign to all output. Similarly, any minimizing function can be made maximizing in the same way.

I did not invent these functions; they are taken from the literature. See the further reading section for references.

You can then choose and copy-paste the code of one or more functions to use in your own project to study or compare the behavior of optimization algorithms.

A convex function is a function where a line can be drawn between any two points in the domain and the line remains in the domain.

For a one-dimensional function shown as a two-dimensional plot, this means the function has a bowl shape and the line between two remains above the bowl.

Unimodal means that the function has a single optima. A convex function may or may not be unimodal; similarly, a unimodal function may or may not be convex.

The range for the function below is bounded to -5.0 and 5.0 and the optimal input value is 0.0.

# convex unimodal optimization function from numpy import arange from matplotlib import pyplot # objective function def objective(x): return x**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # compute targets results = objective(inputs) # create a line plot of input vs result pyplot.plot(inputs, results) # define optimal input value x_optima = 0.0 # draw a vertical line at the optimal input pyplot.axvline(x=x_optima, ls='--', color='red') # show the plot pyplot.show()

Running the example creates a line plot of the function and marks the optima with a red line.

This function can be shifted forward or backward on the number line by adding or subtracting a constant value, e.g. 5 + x^2.

This can be useful if there is a desire to move the optimal input away from a value of 0.0.

A function is non-convex if a line cannot be drawn between two points in the domain and the line remains in the domain.

This means it is possible to find two points in the domain where a line between them crosses a line plot of the function.

Typically, if a plot of a one-dimensional function has more than one hill or valley, then we know immediately that the function is non-convex. Nevertheless, a non-convex function may or may not be unimodal.

Most real functions that we’re interested in optimizing are non-convex.

The range for the function below is bounded to -10.0 and 10.0 and the optimal input value is 0.67956.

# non-convex unimodal optimization function from numpy import arange from numpy import sin from numpy import exp from matplotlib import pyplot # objective function def objective(x): return -(x + sin(x)) * exp(-x**2.0) # define range for input r_min, r_max = -10.0, 10.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # compute targets results = objective(inputs) # create a line plot of input vs result pyplot.plot(inputs, results) # define optimal input value x_optima = 0.67956 # draw a vertical line at the optimal input pyplot.axvline(x=x_optima, ls='--', color='red') # show the plot pyplot.show()

Running the example creates a line plot of the function and marks the optima with a red line.

A multi-modal function means a function with more than one “*mode*” or optima (e.g. valley).

Multimodal functions are non-convex.

There may be one global optima and one or more local or deceptive optima. Alternately, there may be multiple global optima, i.e. multiple different inputs that result in the same minimal output of the function.

Let’s look at a few examples of multi-modal functions.

The range is bounded to -2.7 and 7.5 and the optimal input value is 5.145735.

# multimodal function from numpy import sin from numpy import arange from matplotlib import pyplot # objective function def objective(x): return sin(x) + sin((10.0 / 3.0) * x) # define range for input r_min, r_max = -2.7, 7.5 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # compute targets results = objective(inputs) # create a line plot of input vs result pyplot.plot(inputs, results) # define optimal input value x_optima = 5.145735 # draw a vertical line at the optimal input pyplot.axvline(x=x_optima, ls='--', color='red') # show the plot pyplot.show()

Running the example creates a line plot of the function and marks the optima with a red line.

The range is bounded to 0.0 and 1.2 and the optimal input value is 0.96609.

# multimodal function from numpy import sin from numpy import arange from matplotlib import pyplot # objective function def objective(x): return -(1.4 - 3.0 * x) * sin(18.0 * x) # define range for input r_min, r_max = 0.0, 1.2 # sample input range uniformly at 0.01 increments inputs = arange(r_min, r_max, 0.01) # compute targets results = objective(inputs) # create a line plot of input vs result pyplot.plot(inputs, results) # define optimal input value x_optima = 0.96609 # draw a vertical line at the optimal input pyplot.axvline(x=x_optima, ls='--', color='red') # show the plot pyplot.show()

Running the example creates a line plot of the function and marks the optima with a red line.

The range is bounded to 0.0 and 10.0 and the optimal input value is 7.9787.

# multimodal function from numpy import sin from numpy import arange from matplotlib import pyplot # objective function def objective(x): return -x * sin(x) # define range for input r_min, r_max = 0.0, 10.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # compute targets results = objective(inputs) # create a line plot of input vs result pyplot.plot(inputs, results) # define optimal input value x_optima = 7.9787 # draw a vertical line at the optimal input pyplot.axvline(x=x_optima, ls='--', color='red') # show the plot pyplot.show()

Running the example creates a line plot of the function and marks the optima with a red line.

A function may have a discontinuity, meaning that the smooth change in inputs to the function may result in non-smooth changes in the output.

We might refer to functions with this property as non-smooth functions or discontinuous functions.

There are many different types of discontinuity, although one common example is a jump or acute change in direction in the output values of the function, which is easy to see in a plot of the function.

Discontinuous Function

The range is bounded to -2.0 and 2.0 and the optimal input value is 1.0.

# non-smooth optimization function from numpy import arange from matplotlib import pyplot # objective function def objective(x): if x > 1.0: return x**2.0 elif x == 1.0: return 0.0 return 2.0 - x # define range for input r_min, r_max = -2.0, 2.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # compute targets results = [objective(x) for x in inputs] # create a line plot of input vs result pyplot.plot(inputs, results) # define optimal input value x_optima = 1.0 # draw a vertical line at the optimal input pyplot.axvline(x=x_optima, ls='--', color='red') # show the plot pyplot.show()

Running the example creates a line plot of the function and marks the optima with a red line.

A function may have noise, meaning that each evaluation may have a stochastic component, which changes the output of the function slightly each time.

Any non-noisy function can be made noisy by adding small Gaussian random numbers to the input values.

The range for the function below is bounded to -5.0 and 5.0 and the optimal input value is 0.0.

# noisy optimization function from numpy import arange from numpy.random import randn from matplotlib import pyplot # objective function def objective(x): return (x + randn(len(x))*0.3)**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments inputs = arange(r_min, r_max, 0.1) # compute targets results = objective(inputs) # create a line plot of input vs result pyplot.plot(inputs, results) # define optimal input value x_optima = 0.0 # draw a vertical line at the optimal input pyplot.axvline(x=x_optima, ls='--', color='red') # show the plot pyplot.show()

Running the example creates a line plot of the function and marks the optima with a red line.

This section provides more resources on the topic if you are looking to go deeper.

- Test functions for optimization, Wikipedia.
- Virtual Library of Simulation Experiments: Test Functions and Datasets
- Global Optimization Benchmarks and AMPGO, 1-D Test Functions

In this tutorial, you discovered standard one-dimensional functions you can use when studying function optimization.

**Are you using any of the above functions?**

Let me know which one in the comments below.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post One-Dimensional (1D) Test Functions for Function Optimization appeared first on Machine Learning Mastery.

]]>The post Line Search Optimization With Python appeared first on Machine Learning Mastery.

]]>It provides a way to use a univariate optimization algorithm, like a bisection search on a multivariate objective function, by using the search to locate the optimal step size in each dimension from a known point to the optima.

In this tutorial, you will discover how to perform a line search optimization in Python.

After completing this tutorial, you will know:

- Linear search is an optimization algorithm for univariate and multivariate optimization problems.
- The SciPy library provides an API for performing a line search that requires that you know how to calculate the first derivative of your objective function.
- How to perform a line search on an objective function and use the result.

Let’s get started.

This tutorial is divided into three parts; they are:

- What Is a Line Search
- Line Search in Python
- How to Perform a Line Search
- Define the Objective Function
- Perform the Line Search
- Handling Line Search Failure Cases

Line search is an optimization algorithm for univariate or multivariate optimization.

The algorithm requires an initial position in the search space and a direction along which to search. It will then choose the next position in the search space from the initial position that results in a better or best objective function evaluation.

The direction is a magnitude indicating both the sign (positive or negative) along the line and the maximum extent to which to search. Therefore, the direction is better thought of as the candidate search region and must be large enough to encompass the optima, or a point better than the starting point.

The line search will automatically choose the scale factor called alpha for the step size (the direction) from the current position that minimizes the objective function. This involves using another univariate optimization algorithm to find the optimal point in the chosen direction in order to select the appropriate alpha.

One approach is to use line search, which selects the step factor that minimizes the one-dimensional function […] We can apply the univariate optimization method of our choice.

— Page 54, Algorithms for Optimization, 2019.

Alpha is a scale factor for the direction, as such only values in the range between 0.0 and 1.0 are considered in the search. A single step of the line search solves a minimization problem that minimizes the objective function for the current position plus the scaled direction:

- minimize objective(position + alpha * direction)

As such, the line search operates in one dimension at a time and returns the distance to move in a chosen direction.

Each iteration of a line search method computes a search direction pk and then decides how far to move along that direction.

— Page 30, Numerical Optimization, 2006.

The line search can be called repeatedly to navigate a search space to a solution and can fail if the chosen direction does not contain a point with a lower objective function value, e.g. if the algorithm is directed to search uphill.

The solution is approximate or inexact and may not be the global solution depending on the shape of the search space. The conditions under which this algorithm is appropriate are referred to as the Wolf conditions.

Now that we are familiar with the line search, let’s explore how we might perform a line search in Python.

We can perform a line search manually in Python using the line_search() function.

It supports univariate optimization, as well as multivariate optimization problems.

This function takes the name of the objective function and the name of the gradient for the objective function, as well as the current position in the search space and the direction to move.

As such, you must know the first derivative for your objective function. You must also have some idea of where to start the search and the extent to which to perform the search. Recall, you can perform the search multiple times with different directions (sign and magnitude).

... result = line_search(objective, gradient, point, direction)

The function returns a tuple of six elements, including the scale factor for the direction called alpha and the number of function evaluations that were performed, among other values.

The first element in the result tuple contains the alpha. If the search fails to converge, the alpha will have the value *None*.

... # retrieve the alpha value found as part of the line search alpha = result[0]

The *alpha*, starting point, and *direction* can be used to construct the endpoint of a single line search.

... # construct the end point of a line search end = point + alpha * direction

For optimization problems with more than one input variable, e.g. multivariate optimization, the line_search() function will return a single alpha value for all dimensions.

This means the function assumes that the optima is equidistant from the starting point in all dimensions, which is a significant limitation.

Now that we are familiar with how to perform a line search in Python, let’s explore a worked example.

We can demonstrate how to use the line search with a simple univariate objective function and its derivative.

This section is divided into multiple parts, including defining the test function, performing the line search, and handling failing cases where the optima is not located.

First, we can define the objective function.

In this case, we will use a one-dimensional objective function, specifically x^2 shifted by a small amount away from zero. This is a convex function and was chosen because it is easy to understand and to calculate the first derivative.

- objective(x) = (-5 + x)^2

Note that the line search is not limited to one-dimensional functions or convex functions.

The implementation of this function is listed below.

# objective function def objective(x): return (-5.0 + x)**2.0

The first derivative for this function can be calculated analytically, as follows:

- gradient(x) = 2 * (-5 + x)

The gradient for each input value just indicates the slope towards the optima at each point. The implementation of this function is listed below.

# gradient for the objective function def gradient(x): return 2.0 * (-5.0 + x)

We can define an input range for x from -10 to 20 and calculate the objective value for each input.

... # define range r_min, r_max = -10.0, 20.0 # prepare inputs inputs = arange(r_min, r_max, 0.1) # compute targets targets = [objective(x) for x in inputs]

We can then plot the input values versus the objective values to get an idea of the shape of the function.

... # plot inputs vs objective pyplot.plot(inputs, targets, '-', label='objective') pyplot.legend() pyplot.show()

Tying this together, the complete example is listed below.

# plot a convex objective function from numpy import arange from matplotlib import pyplot # objective function def objective(x): return (-5.0 + x)**2.0 # gradient for the objective function def gradient(x): return 2.0 * (-5.0 + x) # define range r_min, r_max = -10.0, 20.0 # prepare inputs inputs = arange(r_min, r_max, 0.1) # compute targets targets = [objective(x) for x in inputs] # plot inputs vs objective pyplot.plot(inputs, targets, '-', label='objective') pyplot.legend() pyplot.show()

Running the example evaluates input values (x) in the range from -10 to 20 and creates a plot showing the familiar parabola U-shape.

The optima for the function appears to be at x=5.0 with an objective value of 0.0.

Next, we can perform a line search on the function.

First, we must define the starting point for the search and the direction to search.

In this case, we will use a starting point of x=-5, which is about 10 units from the optima. We will take a large step to the right, e.g. the positive direction, in this case 100 units, which will greatly overshoot the optima.

Recall the direction is like the step size and the search will scale the step size to find the optima.

... # define the starting point point = -5.0 # define the direction to move direction = 100.0 # print the initial conditions print('start=%.1f, direction=%.1f' % (point, direction)) # perform the line search result = line_search(objective, gradient, point, direction)

The search will then seek out the optima and return the alpha or distance to modify the direction.

We can retrieve the alpha from the result, as well as the number of function evaluations that were performed.

... # summarize the result alpha = result[0] print('Alpha: %.3f' % alpha) print('Function evaluations: %d' % result[1])

We can use the alpha, along with our starting point and the step size to calculate the location of the optima and calculate the objective function at that point (which we would expect would equal 0.0).

... # define objective function minima end = point + alpha * direction # evaluate objective function minima print('f(end) = %.3f' % objective(end))

Then, for fun, we can plot the function again and show the starting point as a green square and the endpoint as a red square.

... # define range r_min, r_max = -10.0, 20.0 # prepare inputs inputs = arange(r_min, r_max, 0.1) # compute targets targets = [objective(x) for x in inputs] # plot inputs vs objective pyplot.plot(inputs, targets, '--', label='objective') # plot start and end of the search pyplot.plot([point], [objective(point)], 's', color='g') pyplot.plot([end], [objective(end)], 's', color='r') pyplot.legend() pyplot.show()

Tying this together, the complete example of performing a line search on the convex objective function is listed below.

# perform a line search on a convex objective function from numpy import arange from scipy.optimize import line_search from matplotlib import pyplot # objective function def objective(x): return (-5.0 + x)**2.0 # gradient for the objective function def gradient(x): return 2.0 * (-5.0 + x) # define the starting point point = -5.0 # define the direction to move direction = 100.0 # print the initial conditions print('start=%.1f, direction=%.1f' % (point, direction)) # perform the line search result = line_search(objective, gradient, point, direction) # summarize the result alpha = result[0] print('Alpha: %.3f' % alpha) print('Function evaluations: %d' % result[1]) # define objective function minima end = point + alpha * direction # evaluate objective function minima print('f(end) = f(%.3f) = %.3f' % (end, objective(end))) # define range r_min, r_max = -10.0, 20.0 # prepare inputs inputs = arange(r_min, r_max, 0.1) # compute targets targets = [objective(x) for x in inputs] # plot inputs vs objective pyplot.plot(inputs, targets, '--', label='objective') # plot start and end of the search pyplot.plot([point], [objective(point)], 's', color='g') pyplot.plot([end], [objective(end)], 's', color='r') pyplot.legend() pyplot.show()

Running the example first reports the starting point and the direction.

The search is performed and an alpha is located that modifies the direction to locate the optima, in this case, 0.1, which was found after three function evaluations.

The point for the optima is located at 5.0, which evaluates to 0.0, as expected.

start=-5.0, direction=100.0 Alpha: 0.100 Function evaluations: 3 f(end) = f(5.000) = 0.000

Finally, a plot of the function is created showing both the starting point (green) and the objective (red).

The search is not guaranteed to find the optima of the function.

This can happen if a direction is specified that is not large enough to encompass the optima.

For example, if we use a direction of three, then the search will fail to find the optima. We can demonstrate this with a complete example, listed below.

# perform a line search on a convex objective function with a direction that is too small from numpy import arange from scipy.optimize import line_search from matplotlib import pyplot # objective function def objective(x): return (-5.0 + x)**2.0 # gradient for the objective function def gradient(x): return 2.0 * (-5.0 + x) # define the starting point point = -5.0 # define the direction to move direction = 3.0 # print the initial conditions print('start=%.1f, direction=%.1f' % (point, direction)) # perform the line search result = line_search(objective, gradient, point, direction) # summarize the result alpha = result[0] print('Alpha: %.3f' % alpha) # define objective function minima end = point + alpha * direction # evaluate objective function minima print('f(end) = f(%.3f) = %.3f' % (end, objective(end)))

Running the example, the search reaches a limit of an alpha of 1.0 which gives an end point of -2 evaluating to 49. A long way from the optima at f(5) = 0.0.

start=-5.0, direction=3.0 Alpha: 1.000 f(end) = f(-2.000) = 49.000

Additionally, we can choose the wrong direction that only results in worse evaluations than the starting point.

In this case, the wrong direction would be negative away from the optima, e.g. all uphill from the starting point.

... # define the starting point point = -5.0 # define the direction to move direction = -3.0

The expectation is that the search would not converge as it is unable to locate any points better than the starting point.

The complete example of the search that fails to converge is listed below.

# perform a line search on a convex objective function that does not converge from numpy import arange from scipy.optimize import line_search from matplotlib import pyplot # objective function def objective(x): return (-5.0 + x)**2.0 # gradient for the objective function def gradient(x): return 2.0 * (-5.0 + x) # define the starting point point = -5.0 # define the direction to move direction = -3.0 # print the initial conditions print('start=%.1f, direction=%.1f' % (point, direction)) # perform the line search result = line_search(objective, gradient, point, direction) # summarize the result print('Alpha: %s' % result[0])

Running the example results in a LineSearchWarning indicating that the search could not converge, as expected.

The alpha value returned from the search is None.

start=-5.0, direction=-3.0 LineSearchWarning: The line search algorithm did not converge warn('The line search algorithm did not converge', LineSearchWarning) Alpha: None

This section provides more resources on the topic if you are looking to go deeper.

- Algorithms for Optimization, 2019.
- Numerical Optimization, 2006.

- Optimization and root finding (scipy.optimize)
- Optimization (scipy.optimize)
- scipy.optimize.line_search API.

In this tutorial, you discovered how to perform a line search optimization in Python.

Specifically, you learned:

- Linear search is an optimization algorithm for univariate and multivariate optimization problems.
- The SciPy library provides an API for performing a line search that requires that you know how to calculate the first derivative of your objective function.
- How to perform a line search on an objective function and use the result.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Line Search Optimization With Python appeared first on Machine Learning Mastery.

]]>