Gradient descent is an optimization algorithm that follows the negative gradient of an objective function in order to locate the minimum of the function.

A limitation of gradient descent is that a single step size (learning rate) is used for all input variables. Extensions to gradient descent like the Adaptive Movement Estimation (Adam) algorithm use a separate step size for each input variable but may result in a step size that rapidly decreases to very small values.

**AMSGrad** is an extension to the Adam version of gradient descent that attempts to improve the convergence properties of the algorithm, avoiding large abrupt changes in the learning rate for each input variable.

In this tutorial, you will discover how to develop gradient descent optimization with AMSGrad from scratch.

After completing this tutorial, you will know:

- Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
- AMSGrad is an extension of the Adam version of gradient descent designed to accelerate the optimization process.
- How to implement the AMSGrad optimization algorithm from scratch and apply it to an objective function and evaluate the results.

**Kick-start your project** with my new book Optimization for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

## Tutorial Overview

This tutorial is divided into three parts; they are:

- Gradient Descent
- AMSGrad Optimization Algorithm
- Gradient Descent With AMSGrad
- Two-Dimensional Test Problem
- Gradient Descent Optimization With AMSGrad
- Visualization of AMSGrad Optimization

## Gradient Descent

Gradient descent is an optimization algorithm.

It is technically referred to as a first-order optimization algorithm as it explicitly makes use of the first-order derivative of the target objective function.

First-order methods rely on gradient information to help direct the search for a minimum …

— Page 69, Algorithms for Optimization, 2019.

The first-order derivative, or simply the “derivative,” is the rate of change or slope of the target function at a specific point, e.g. for a specific input.

If the target function takes multiple input variables, it is referred to as a multivariate function and the input variables can be thought of as a vector. In turn, the derivative of a multivariate target function may also be taken as a vector and is referred to generally as the gradient.

**Gradient**: First-order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a specific input.

Gradient descent refers to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to locate the minimum of the function.

The gradient descent algorithm requires a target function that is being optimized and the derivative function for the objective function. The target function *f()* returns a score for a given set of inputs, and the derivative function *f'()* gives the derivative of the target function for a given set of inputs.

The gradient descent algorithm requires a starting point (*x*) in the problem, such as a randomly selected point in the input space.

The derivative is then calculated and a step is taken in the input space that is expected to result in a downhill movement in the target function, assuming we are minimizing the target function.

A downhill movement is made by first calculating how far to move in the input space, calculated as the step size (called alpha or the learning rate) multiplied by the gradient. This is then subtracted from the current point, ensuring we move against the gradient, or down the target function.

- x(t) = x(t-1) – step_size * f'(x(t))

The steeper the objective function at a given point, the larger the magnitude of the gradient, and in turn, the larger the step taken in the search space. The size of the step taken is scaled using a step size hyperparameter.

**Step Size**: Hyperparameter that controls how far to move in the search space against the gradient each iteration of the algorithm.

If the step size is too small, the movement in the search space will be small and the search will take a long time. If the step size is too large, the search may bounce around the search space and skip over the optima.

Now that we are familiar with the gradient descent optimization algorithm, let’s take a look at the AMSGrad algorithm.

### Want to Get Started With Optimization Algorithms?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## AMSGrad Optimization Algorithm

AMSGrad algorithm is an extension to the Adaptive Movement Estimation (Adam) optimization algorithm. More broadly, is an extension to the Gradient Descent Optimization algorithm.

The algorithm was described in the 2018 paper by J. Sashank, et al. titled “On the Convergence of Adam and Beyond.”

Generally, Adam automatically adapts a separate step size (learning rate) for each parameter in the optimization problem.

A limitation of Adam is that it can both decrease the step size when getting close to the optima, which is good, but it also increases the step size in some cases, which is bad.

AdaGrad addresses this specifically.

… ADAM aggressively increases the learning rate, however, […] this can be detrimental to the overall performance of the algorithm. […] In contrast, AMSGRAD neither increases nor decreases the learning rate and furthermore, decreases vt which can potentially lead to non-decreasing learning rate even if gradient is large in the future iterations.

— On the Convergence of Adam and Beyond, 2018.

AdaGrad is an extension to Adam that maintains a maximum of the second moment vector and uses it to bias-correct the gradient used to update the parameter, instead of the moment vector itself. This helps to stop the optimization slowing down too quickly (e.g. premature convergence).

The key difference of AMSGRAD with ADAM is that it maintains the maximum of all vt until the present time step and uses this maximum value for normalizing the running average of the gradient instead of vt in ADAM.

— On the Convergence of Adam and Beyond, 2018.

Let’s step through each element of the algorithm.

First, we must maintain a first and second moment vector as well as a max second moment vector for each parameter being optimized as part of the search, referred to as *m* and *v* (Geek letter nu but we will use v), and *vhat* respectively.

They are initialized to 0.0 at the start of the search.

- m = 0
- v = 0
- vhat = 0

The algorithm is executed iteratively over time t starting at *t=1*, and each iteration involves calculating a new set of parameter values *x*, e.g. going from *x(t-1)* to *x(t)*.

It is perhaps easy to understand the algorithm if we focus on updating one parameter, which generalizes to updating all parameters via vector operations.

First, the gradients (partial derivatives) are calculated for the current time step.

- g(t) = f'(x(t-1))

Next, the first moment vector is updated using the gradient and a hyperparameter *beta1*.

- m(t) = beta1(t) * m(t-1) + (1 – beta1(t)) * g(t)

The *beta1* hyperparameter can be held constant or can be decayed exponentially over the course of the search, such as:

- beta1(t) = beta1^(t)

Or, alternately:

- beta1(t) = beta1 / t

The second moment vector is updated using the square of the gradient and a hyperparameter *beta2*.

- v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2

Next, the maximum for the second moment vector is updated.

- vhat(t) = max(vhat(t-1), v(t))

Where *max()* calculates the maximum of the provided values.

The parameter value can then be updated using the calculated terms and the step size hyperparameter *alpha*.

- x(t) = x(t-1) – alpha(t) * m(t) / sqrt(vhat(t)))

Where *sqrt()* is the square root function.

The step size may also be held constant or decayed exponentially.

To review, there are three hyperparameters for the algorithm; they are:

**alpha**: Initial step size (learning rate), a typical value is 0.002.**beta1**: Decay factor for first momentum, a typical value is 0.9.**beta2**: Decay factor for infinity norm, a typical value is 0.999.

And that’s it.

For full derivation of the AMSGrad algorithm in the context of the Adam algorithm, I recommend reading the paper.

Next, let’s look at how we might implement the algorithm from scratch in Python.

## Gradient Descent With AMSGrad

In this section, we will explore how to implement the gradient descent optimization algorithm with AMSGrad Momentum.

### Two-Dimensional Test Problem

First, let’s define an optimization function.

We will use a simple two-dimensional function that squares the input of each dimension and define the range of valid inputs from -1.0 to 1.0.

The *objective()* function below implements this.

1 2 3 |
# objective function def objective(x, y): return x**2.0 + y**2.0 |

We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the response surface.

The complete example of plotting the objective function is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# 3d plot of the test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -1.0, 1.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show() |

Running the example creates a three-dimensional surface plot of the objective function.

We can see the familiar bowl shape with the global minima at f(0, 0) = 0.

We can also create a two-dimensional plot of the function. This will be helpful later when we want to plot the progress of the search.

The example below creates a contour plot of the objective function.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# contour plot of the test function from numpy import asarray from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # show the plot pyplot.show() |

Running the example creates a two-dimensional contour plot of the objective function.

We can see the bowl shape compressed to contours shown with a color gradient. We will use this plot to plot the specific points explored during the progress of the search.

Now that we have a test objective function, let’s look at how we might implement the AMSGrad optimization algorithm.

### Gradient Descent Optimization With AMSGrad

We can apply gradient descent with AMSGrad to the test problem.

First, we need a function that calculates the derivative for this function.

The derivative of *x^2* is *x * 2* in each dimension.

- f(x) = x^2
- f'(x) = x * 2

The *derivative()* function implements this below.

1 2 3 |
# derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) |

Next, we can implement gradient descent optimization with AMSGrad.

First, we can select a random point in the bounds of the problem as a starting point for the search.

This assumes we have an array that defines the bounds of the search with one row for each dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

1 2 3 |
... # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) |

Next, we need to initialize the moment vectors.

1 2 3 4 5 |
... # initialize moment vectors m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] vhat = [0.0 for _ in range(bounds.shape[0])] |

We then run a fixed number of iterations of the algorithm defined by the “*n_iter*” hyperparameter.

1 2 3 4 |
... # run iterations of gradient descent for t in range(n_iter): ... |

The first step is to calculate the derivative for the current set of parameters.

1 2 3 |
... # calculate gradient g(t) g = derivative(x[0], x[1]) |

Next, we need to perform the AMSGrad update calculations. We will perform these calculations one variable at a time using an imperative programming style for readability.

In practice, I recommend using NumPy vector operations for efficiency.

1 2 3 4 |
... # build a solution one variable at a time for i in range(x.shape[0]): ... |

First, we need to calculate the first moment vector.

1 2 3 |
... # m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t) m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i] |

Next, we need to calculate the second moment vector.

1 2 3 |
... # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2 |

Then the maximum of the second moment vector with the previous iteration and the current value.

1 2 3 |
... # vhat(t) = max(vhat(t-1), v(t)) vhat[i] = max(vhat[i], v[i]) |

Finally, we can calculate the new value for the variable.

1 2 3 |
... # x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t))) x[i] = x[i] - alpha * m[i] / sqrt(vhat[i]) |

We may want to add a small value to the denominator to avoid a divide by zero error; for example:

1 2 3 |
... # x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t))) x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8) |

This is then repeated for each parameter that is being optimized.

At the end of the iteration, we can evaluate the new parameter values and report the performance of the search.

1 2 3 4 5 |
... # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) |

We can tie all of this together into a function named *amsgrad()* that takes the names of the objective and derivative functions as well as the algorithm hyperparameters, and returns the best solution found at the end of the search and its evaluation.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# gradient descent algorithm with amsgrad def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vectors m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] vhat = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # update variables one at a time for i in range(x.shape[0]): # m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t) m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2 # vhat(t) = max(vhat(t-1), v(t)) vhat[i] = max(vhat[i], v[i]) # x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t))) x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8) # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return [x, score] |

We can then define the bounds of the function and the hyperparameters and call the function to perform the optimization.

In this case, we will run the algorithm for 100 iterations with an initial learning rate of 0.007, beta of 0.9, and a beta2 of 0.99, found after a little trial and error.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 100 # steps size alpha = 0.007 # factor for average gradient beta1 = 0.9 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with amsgrad best, score = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2) |

At the end of the run, we will report the best solution found.

1 2 3 4 |
... # summarize the result print('Done!') print('f(%s) = %f' % (best, score)) |

Tying all of this together, the complete example of AMSGrad gradient descent applied to our test problem is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
# gradient descent optimization with amsgrad for a two-dimensional test function from math import sqrt from numpy import asarray from numpy.random import rand from numpy.random import seed # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with amsgrad def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vectors m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] vhat = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # update variables one at a time for i in range(x.shape[0]): # m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t) m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2 # vhat(t) = max(vhat(t-1), v(t)) vhat[i] = max(vhat[i], v[i]) # x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t))) x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8) # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return [x, score] # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 100 # steps size alpha = 0.007 # factor for average gradient beta1 = 0.9 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with amsgrad best, score = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2) print('Done!') print('f(%s) = %f' % (best, score)) |

Running the example applies the optimization algorithm with AMSGrad to our test problem and reports the performance of the search for each iteration of the algorithm.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a near-optimal solution was found after perhaps 90 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
... >90 f([-5.74880707e-11 2.16227707e-03]) = 0.00000 >91 f([-4.53359947e-11 2.03974264e-03]) = 0.00000 >92 f([-3.57526928e-11 1.92415218e-03]) = 0.00000 >93 f([-2.81951584e-11 1.81511216e-03]) = 0.00000 >94 f([-2.22351711e-11 1.71225138e-03]) = 0.00000 >95 f([-1.75350316e-11 1.61521966e-03]) = 0.00000 >96 f([-1.38284262e-11 1.52368665e-03]) = 0.00000 >97 f([-1.09053366e-11 1.43734076e-03]) = 0.00000 >98 f([-8.60013947e-12 1.35588802e-03]) = 0.00000 >99 f([-6.78222208e-12 1.27905115e-03]) = 0.00000 Done! f([-6.78222208e-12 1.27905115e-03]) = 0.000002 |

### Visualization of AMSGrad Optimization

We can plot the progress of the AMSGrad search on a contour plot of the domain.

This can provide an intuition for the progress of the search over the iterations of the algorithm.

We must update the *amsgrad()* function to maintain a list of all solutions found during the search, then return this list at the end of the search.

The updated version of the function with these changes is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# gradient descent algorithm with amsgrad def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vectors m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] vhat = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # update variables one at a time for i in range(x.shape[0]): # m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t) m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2 # vhat(t) = max(vhat(t-1), v(t)) vhat[i] = max(vhat[i], v[i]) # x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t))) x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8) # evaluate candidate point score = objective(x[0], x[1]) # keep track of all solutions solutions.append(x.copy()) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return solutions |

We can then execute the search as before, and this time retrieve the list of solutions instead of the best final solution.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 100 # steps size alpha = 0.007 # factor for average gradient beta1 = 0.9 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with amsgrad solutions = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2) |

We can then create a contour plot of the objective function, as before.

1 2 3 4 5 6 7 8 9 10 |
... # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') |

Finally, we can plot each solution found during the search as a white dot connected by a line.

1 2 3 4 |
... # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w') |

Tying this all together, the complete example of performing the AMSGrad optimization on the test problem and plotting the results on a contour plot is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
# example of plotting the amsgrad search on a contour plot of the test function from math import sqrt from numpy import asarray from numpy import arange from numpy.random import rand from numpy.random import seed from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with amsgrad def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vectors m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] vhat = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # update variables one at a time for i in range(x.shape[0]): # m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t) m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2 # vhat(t) = max(vhat(t-1), v(t)) vhat[i] = max(vhat[i], v[i]) # x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t))) x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8) # evaluate candidate point score = objective(x[0], x[1]) # keep track of all solutions solutions.append(x.copy()) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return solutions # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 100 # steps size alpha = 0.007 # factor for average gradient beta1 = 0.9 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with amsgrad solutions = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w') # show the plot pyplot.show() |

Running the example performs the search as before, except in this case, the contour plot of the objective function is created.

In this case, we can see that a white dot is shown for each solution found during the search, starting above the optima and progressively getting closer to the optima at the center of the plot.

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Papers

- On the Convergence of Adam and Beyond, 2018.
- An Overview Of Gradient Descent Optimization Algorithms, 2016.

### Books

- Algorithms for Optimization, 2019.
- Deep Learning, 2016.

### APIs

### Articles

- Gradient descent, Wikipedia.
- Stochastic gradient descent, Wikipedia.
- An overview of gradient descent optimization algorithms, 2016.
- Experiments with AMSGrad, 2017.

## Summary

In this tutorial, you discovered how to develop gradient descent optimization with AMSGrad from scratch.

Specifically, you learned:

- Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
- AMSGrad is an extension of the Adam version of gradient descent designed to accelerate the optimization process.
- How to implement the AMSGrad optimization algorithm from scratch and apply it to an objective function and evaluate the results.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

Hi Jason,

Small typo in content under title “AMSGrad Optimization Algorithm”, used AdaGrad instead of AMSGrad at 2 places.

Not a typo as AMSGrad is an extension of AdaGrad, therefore we introduce AMSGrad first.

Thanks for Enlightenment!

You’re welcome!