Gradient Descent With AdaGrad From Scratch

By Jason Brownlee on October 12, 2021 in Optimization 2

Gradient descent is an optimization algorithm that follows the negative gradient of an objective function in order to locate the minimum of the function.

A limitation of gradient descent is that it uses the same step size (learning rate) for each input variable. This can be a problem on objective functions that have different amounts of curvature in different dimensions, and in turn, may require a different sized step to a new point.

Adaptive Gradients, or AdaGrad for short, is an extension of the gradient descent optimization algorithm that allows the step size in each dimension used by the optimization algorithm to be automatically adapted based on the gradients seen for the variable (partial derivatives) seen over the course of the search.

In this tutorial, you will discover how to develop the gradient descent with adaptive gradients optimization algorithm from scratch.

After completing this tutorial, you will know:

Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
Gradient descent can be updated to use an automatically adaptive step size for each input variable in the objective function, called adaptive gradients or AdaGrad.
How to implement the AdaGrad optimization algorithm from scratch and apply it to an objective function and evaluate the results.

Kick-start your project with my new book Optimization for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Gradient Descent With AdaGrad From Scratch
Photo by Maurits Verbiest, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Gradient Descent
Adaptive Gradient (AdaGrad)
Gradient Descent With AdaGrad
1. Two-Dimensional Test Problem
2. Gradient Descent Optimization With AdaGrad
3. Visualization of AdaGrad

Gradient Descent

Gradient descent is an optimization algorithm.

It is technically referred to as a first-order optimization algorithm as it explicitly makes use of the first order derivative of the target objective function.

First-order methods rely on gradient information to help direct the search for a minimum …

— Page 69, Algorithms for Optimization, 2019.

The first order derivative, or simply the “derivative,” is the rate of change or slope of the target function at a specific point, e.g. for a specific input.

If the target function takes multiple input variables, it is referred to as a multivariate function and the input variables can be thought of as a vector. In turn, the derivative of a multivariate target function may also be taken as a vector and is referred to generally as the “gradient.”

Gradient: First-order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a specific input.

Gradient descent refers to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to locate the minimum of the function.

The gradient descent algorithm requires a target function that is being optimized and the derivative function for the objective function. The target function f() returns a score for a given set of inputs, and the derivative function f'() gives the derivative of the target function for a given set of inputs.

The gradient descent algorithm requires a starting point (x) in the problem, such as a randomly selected point in the input space.

The derivative is then calculated and a step is taken in the input space that is expected to result in a downhill movement in the target function, assuming we are minimizing the target function.

A downhill movement is made by first calculating how far to move in the input space, calculated as the step size (called alpha or the learning rate) multiplied by the gradient. This is then subtracted from the current point, ensuring we move against the gradient, or down the target function.

x = x – step_size * f'(x)

The steeper the objective function at a given point, the larger the magnitude of the gradient, and in turn, the larger the step taken in the search space. The size of the step taken is scaled using a step size hyperparameter.

Step Size (alpha): Hyperparameter that controls how far to move in the search space against the gradient each iteration of the algorithm.

If the step size is too small, the movement in the search space will be small and the search will take a long time. If the step size is too large, the search may bounce around the search space and skip over the optima.

Now that we are familiar with the gradient descent optimization algorithm, let’s take a look at AdaGrad.

Want to Get Started With Optimization Algorithms?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Adaptive Gradient (AdaGrad)

The Adaptive Gradient algorithm, or AdaGrad for short, is an extension to the gradient descent optimization algorithm.

The algorithm was described by John Duchi, et al. in their 2011 paper titled “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.”

It is designed to accelerate the optimization process, e.g. decrease the number of function evaluations required to reach the optima, or to improve the capability of the optimization algorithm, e.g. result in a better final result.

The parameters with the largest partial derivative of the loss have a correspondingly rapid decrease in their learning rate, while parameters with small partial derivatives have a relatively small decrease in their learning rate.

— Page 307, Deep Learning, 2016.

A problem with the gradient descent algorithm is that the step size (learning rate) is the same for each variable or dimension in the search space. It is possible that better performance can be achieved using a step size that is tailored to each variable, allowing larger movements in dimensions with a consistently steep gradient and smaller movements in dimensions with less steep gradients.

AdaGrad is designed to specifically explore the idea of automatically tailoring the step size for each dimension in the search space.

The adaptive subgradient method, or Adagrad, adapts a learning rate for each component of x

— Page 77, Algorithms for Optimization, 2019.

This is achieved by first calculating a step size for a given dimension, then using the calculated step size to make a movement in that dimension using the partial derivative. This process is then repeated for each dimension in the search space.

Adagrad dulls the influence of parameters with consistently high gradients, thereby increasing the influence of parameters with infrequent updates.

— Page 77, Algorithms for Optimization, 2019.

AdaGrad is suited to objective functions where the curvature of the search space is different in different dimensions, allowing a more effective optimization given the customization of the step size in each dimension.

The algorithm requires that you set an initial step size for all input variables as per normal, such as 0.1 or 0.001, or similar. Although, the benefit of the algorithm is that it is not as sensitive to the initial learning rate as the gradient descent algorithm.

Adagrad is far less sensitive to the learning rate parameter alpha. The learning rate parameter is typically set to a default value of 0.01.

— Page 77, Algorithms for Optimization, 2019.

An internal variable is then maintained for each input variable that is the sum of the squared partial derivatives for the input variable observed during the search.

This sum of the squared partial derivatives is then used to calculate the step size for the variable by dividing the initial step size value (e.g. hyperparameter value specified at the start of the run) divided by the square root of the sum of the squared partial derivatives.

cust_step_size = step_size / sqrt(s)

It is possible for the square root of the sum of squared partial derivatives to result in a value of 0.0, resulting in a divide by zero error. Therefore, a tiny value can be added to the denominator to avoid this possibility, such as 1e-8.

cust_step_size = step_size / (1e-8 + sqrt(s))

Where cust_step_size is the calculated step size for an input variable for a given point during the search, step_size is the initial step size, sqrt() is the square root operation, and s is the sum of the squared partial derivatives for the input variable seen during the search so far.

The custom step size is then used to calculate the value for the variable in the next point or solution in the search.

x(t+1) = x(t) – cust_step_size * f'(x(t))

This process is then repeated for each input variable until a new point in the search space is created and can be evaluated.

Importantly, the partial derivative for the current solution (iteration of the search) is included in the sum of the square root of partial derivatives.

We could maintain an array of partial derivatives or squared partial derivatives for each input variable, but this is not necessary. Instead, we simply maintain the sum of the squared partial derivatives and add new values to this sum along the way.

Now that we are familiar with the AdaGrad algorithm, let’s explore how we might implement it and evaluate its performance.

Gradient Descent With AdaGrad

In this section, we will explore how to implement the gradient descent optimization algorithm with adaptive gradients.

Two-Dimensional Test Problem

First, let’s define an optimization function.

We will use a simple two-dimensional function that squares the input of each dimension and define the range of valid inputs from -1.0 to 1.0.

The objective() function below implements this function.

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# objective function

def objective(x, y):

return x**2.0 + y**2.0

We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the response surface.

The complete example of plotting the objective function is listed below.

# 3d plot of the test function
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# define range for input
r_min, r_max = -1.0, 1.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a surface plot with the jet color scheme
figure = pyplot.figure()
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, results, cmap='jet')
# show the plot
pyplot.show()

# 3d plot of the test function

from numpy import arange

from numpy import meshgrid

from matplotlib import pyplot

# objective function

def objective(x, y):

return x**2.0 + y**2.0

# define range for input

r_min, r_max = -1.0, 1.0

# sample input range uniformly at 0.1 increments

xaxis = arange(r_min, r_max, 0.1)

yaxis = arange(r_min, r_max, 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a surface plot with the jet color scheme

figure = pyplot.figure()

axis = figure.gca(projection='3d')

axis.plot_surface(x, y, results, cmap='jet')

# show the plot

pyplot.show()

Running the example creates a three-dimensional surface plot of the objective function.

We can see the familiar bowl shape with the global minima at f(0, 0) = 0.

Three-Dimensional Plot of the Test Objective Function

We can also create a two-dimensional plot of the function. This will be helpful later when we want to plot the progress of the search.

The example below creates a contour plot of the objective function.

# contour plot of the test function
from numpy import asarray
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# show the plot
pyplot.show()

# contour plot of the test function

from numpy import asarray

from numpy import arange

from numpy import meshgrid

from matplotlib import pyplot

# objective function

def objective(x, y):

return x**2.0 + y**2.0

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# sample input range uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a filled contour plot with 50 levels and jet color scheme

pyplot.contourf(x, y, results, levels=50, cmap='jet')

# show the plot

pyplot.show()

Running the example creates a two-dimensional contour plot of the objective function.

We can see the bowl shape compressed to contours shown with a color gradient. We will use this plot to plot the specific points explored during the progress of the search.

Two-Dimensional Contour Plot of the Test Objective Function

Now that we have a test objective function, let’s look at how we might implement the AdaGrad optimization algorithm.

Gradient Descent Optimization With AdaGrad

We can apply the gradient descent with adaptive gradient algorithm to the test problem.

First, we need a function that calculates the derivative for this function.

f(x) = x^2
f'(x) = x * 2

The derivative of x^2 is x * 2 in each dimension.

The derivative() function implements this below.

# derivative of objective function
def derivative(x, y):
	return asarray([x * 2.0, y * 2.0])

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

Next, we can implement gradient descent with adaptive gradients.

First, we can select a random point in the bounds of the problem as a starting point for the search.

This assumes we have an array that defines the bounds of the search with one row for each dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

...
# generate an initial point
solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

...

# generate an initial point

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

Next, we need to initialize the sum of the squared partial derivatives for each dimension to 0.0 values.

...
# list of the sum square gradients for each variable
sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

...

# list of the sum square gradients for each variable

sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

We can then enumerate a fixed number of iterations of the search optimization algorithm defined by a “n_iter” hyperparameter.

...
# run the gradient descent
for it in range(n_iter):
	...

...

# run the gradient descent

for it in range(n_iter):

...

The first step is to calculate the gradient for the current solution using the derivative() function.

...
# calculate gradient
gradient = derivative(solution[0], solution[1])

...

# calculate gradient

gradient = derivative(solution[0], solution[1])

We then need to calculate the square of the partial derivative of each variable and add them to the running sum of these values.

...
# update the sum of the squared partial derivatives
for i in range(gradient.shape[0]):
	sq_grad_sums[i] += gradient[i]**2.0

...

# update the sum of the squared partial derivatives

for i in range(gradient.shape[0]):

sq_grad_sums[i] += gradient[i]**2.0

We can then use the sum squared partial derivatives and gradient to calculate the next point.

We will do this one variable at a time, first calculating the step size for the variable, then the new value for the variable. These values are built up in an array until we have a completely new solution that is in the steepest descent direction from the current point using the custom step sizes.

...
# build a solution one variable at a time
new_solution = list()
for i in range(solution.shape[0]):
	# calculate the step size for this variable
	alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))
	# calculate the new position in this variable
	value = solution[i] - alpha * gradient[i]
	# store this variable
	new_solution.append(value)

...

# build a solution one variable at a time

new_solution = list()

for i in range(solution.shape[0]):

# calculate the step size for this variable

alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))

# calculate the new position in this variable

value = solution[i] - alpha * gradient[i]

# store this variable

new_solution.append(value)

This new solution can then be evaluated using the objective() function and the performance of the search can be reported.

...
# evaluate candidate point
solution = asarray(new_solution)
solution_eval = objective(solution[0], solution[1])
# report progress
print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

...

# evaluate candidate point

solution = asarray(new_solution)

solution_eval = objective(solution[0], solution[1])

# report progress

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

And that’s it.

We can tie all of this together into a function named adagrad() that takes the names of the objective function and the derivative function, an array with the bounds of the domain, and hyperparameter values for the total number of algorithm iterations and the initial learning rate, and returns the final solution and its evaluation.

This complete function is listed below.

# gradient descent algorithm with adagrad
def adagrad(objective, derivative, bounds, n_iter, step_size):
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of the sum square gradients for each variable
	sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate gradient
		gradient = derivative(solution[0], solution[1])
		# update the sum of the squared partial derivatives
		for i in range(gradient.shape[0]):
			sq_grad_sums[i] += gradient[i]**2.0
		# build a solution one variable at a time
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the step size for this variable
			alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))
			# calculate the new position in this variable
			value = solution[i] - alpha * gradient[i]
			# store this variable
			new_solution.append(value)
		# evaluate candidate point
		solution = asarray(new_solution)
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return [solution, solution_eval]

# gradient descent algorithm with adagrad

def adagrad(objective, derivative, bounds, n_iter, step_size):

# generate an initial point

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

# list of the sum square gradients for each variable

sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for it in range(n_iter):

# calculate gradient

gradient = derivative(solution[0], solution[1])

# update the sum of the squared partial derivatives

for i in range(gradient.shape[0]):

sq_grad_sums[i] += gradient[i]**2.0

# build a solution one variable at a time

new_solution = list()

for i in range(solution.shape[0]):

# calculate the step size for this variable

alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))

# calculate the new position in this variable

value = solution[i] - alpha * gradient[i]

# store this variable

new_solution.append(value)

# evaluate candidate point

solution = asarray(new_solution)

solution_eval = objective(solution[0], solution[1])

# report progress

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

return [solution, solution_eval]

Note: we have intentionally used lists and imperative coding style instead of vectorized operations for readability. Feel free to adapt the implementation to a vectorized implementation with NumPy arrays for better performance.

We can then define our hyperparameters and call the adagrad() function to optimize our test objective function.

In this case, we will use 50 iterations of the algorithm and an initial learning rate of 0.1, both chosen after a little trial and error.

...
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# define the step size
step_size = 0.1
# perform the gradient descent search with adagrad
best, score = adagrad(objective, derivative, bounds, n_iter, step_size)
print('Done!')
print('f(%s) = %f' % (best, score))

...

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 50

# define the step size

step_size = 0.1

# perform the gradient descent search with adagrad

best, score = adagrad(objective, derivative, bounds, n_iter, step_size)

print('Done!')

print('f(%s) = %f' % (best, score))

Tying all of this together, the complete example of gradient descent optimization with adaptive gradients is listed below.

# gradient descent optimization with adagrad for a two-dimensional test function
from math import sqrt
from numpy import asarray
from numpy.random import rand
from numpy.random import seed

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# derivative of objective function
def derivative(x, y):
	return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with adagrad
def adagrad(objective, derivative, bounds, n_iter, step_size):
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of the sum square gradients for each variable
	sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate gradient
		gradient = derivative(solution[0], solution[1])
		# update the sum of the squared partial derivatives
		for i in range(gradient.shape[0]):
			sq_grad_sums[i] += gradient[i]**2.0
		# build a solution one variable at a time
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the step size for this variable
			alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))
			# calculate the new position in this variable
			value = solution[i] - alpha * gradient[i]
			# store this variable
			new_solution.append(value)
		# evaluate candidate point
		solution = asarray(new_solution)
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return [solution, solution_eval]

# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# define the step size
step_size = 0.1
# perform the gradient descent search with adagrad
best, score = adagrad(objective, derivative, bounds, n_iter, step_size)
print('Done!')
print('f(%s) = %f' % (best, score))

# gradient descent optimization with adagrad for a two-dimensional test function

from math import sqrt

from numpy import asarray

from numpy.random import rand

from numpy.random import seed

# objective function

def objective(x, y):

return x**2.0 + y**2.0

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with adagrad

def adagrad(objective, derivative, bounds, n_iter, step_size):

# generate an initial point

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

# list of the sum square gradients for each variable

sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for it in range(n_iter):

# calculate gradient

gradient = derivative(solution[0], solution[1])

# update the sum of the squared partial derivatives

for i in range(gradient.shape[0]):

sq_grad_sums[i] += gradient[i]**2.0

# build a solution one variable at a time

new_solution = list()

for i in range(solution.shape[0]):

# calculate the step size for this variable

alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))

# calculate the new position in this variable

value = solution[i] - alpha * gradient[i]

# store this variable

new_solution.append(value)

# evaluate candidate point

solution = asarray(new_solution)

solution_eval = objective(solution[0], solution[1])

# report progress

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

return [solution, solution_eval]

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 50

# define the step size

step_size = 0.1

# perform the gradient descent search with adagrad

best, score = adagrad(objective, derivative, bounds, n_iter, step_size)

print('Done!')

print('f(%s) = %f' % (best, score))

Running the example applies the AdaGrad optimization algorithm to our test problem and reports the performance of the search for each iteration of the algorithm.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a near-optimal solution was found after perhaps 35 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.

>0 f([-0.06595599  0.34064899]) = 0.12039
>1 f([-0.02902286  0.27948766]) = 0.07896
>2 f([-0.0129815   0.23463749]) = 0.05522
>3 f([-0.00582483  0.1993997 ]) = 0.03979
>4 f([-0.00261527  0.17071256]) = 0.02915
>5 f([-0.00117437  0.14686138]) = 0.02157
>6 f([-0.00052736  0.12676134]) = 0.01607
>7 f([-0.00023681  0.10966762]) = 0.01203
>8 f([-0.00010634  0.09503809]) = 0.00903
>9 f([-4.77542704e-05  8.24607972e-02]) = 0.00680
>10 f([-2.14444463e-05  7.16123835e-02]) = 0.00513
>11 f([-9.62980437e-06  6.22327049e-02]) = 0.00387
>12 f([-4.32434258e-06  5.41085063e-02]) = 0.00293
>13 f([-1.94188148e-06  4.70624414e-02]) = 0.00221
>14 f([-8.72017797e-07  4.09453989e-02]) = 0.00168
>15 f([-3.91586740e-07  3.56309531e-02]) = 0.00127
>16 f([-1.75845235e-07  3.10112252e-02]) = 0.00096
>17 f([-7.89647442e-08  2.69937139e-02]) = 0.00073
>18 f([-3.54597657e-08  2.34988084e-02]) = 0.00055
>19 f([-1.59234984e-08  2.04577993e-02]) = 0.00042
>20 f([-7.15057749e-09  1.78112581e-02]) = 0.00032
>21 f([-3.21102543e-09  1.55077005e-02]) = 0.00024
>22 f([-1.44193729e-09  1.35024688e-02]) = 0.00018
>23 f([-6.47513760e-10  1.17567908e-02]) = 0.00014
>24 f([-2.90771361e-10  1.02369798e-02]) = 0.00010
>25 f([-1.30573263e-10  8.91375193e-03]) = 0.00008
>26 f([-5.86349941e-11  7.76164047e-03]) = 0.00006
>27 f([-2.63305247e-11  6.75849105e-03]) = 0.00005
>28 f([-1.18239380e-11  5.88502652e-03]) = 0.00003
>29 f([-5.30963626e-12  5.12447017e-03]) = 0.00003
>30 f([-2.38433568e-12  4.46221948e-03]) = 0.00002
>31 f([-1.07070548e-12  3.88556303e-03]) = 0.00002
>32 f([-4.80809073e-13  3.38343471e-03]) = 0.00001
>33 f([-2.15911255e-13  2.94620023e-03]) = 0.00001
>34 f([-9.69567190e-14  2.56547145e-03]) = 0.00001
>35 f([-4.35392094e-14  2.23394494e-03]) = 0.00000
>36 f([-1.95516389e-14  1.94526160e-03]) = 0.00000
>37 f([-8.77982370e-15  1.69388439e-03]) = 0.00000
>38 f([-3.94265180e-15  1.47499203e-03]) = 0.00000
>39 f([-1.77048011e-15  1.28438640e-03]) = 0.00000
>40 f([-7.95048604e-16  1.11841198e-03]) = 0.00000
>41 f([-3.57023093e-16  9.73885702e-04]) = 0.00000
>42 f([-1.60324146e-16  8.48035867e-04]) = 0.00000
>43 f([-7.19948720e-17  7.38448972e-04]) = 0.00000
>44 f([-3.23298874e-17  6.43023418e-04]) = 0.00000
>45 f([-1.45180009e-17  5.59929193e-04]) = 0.00000
>46 f([-6.51942732e-18  4.87572776e-04]) = 0.00000
>47 f([-2.92760228e-18  4.24566574e-04]) = 0.00000
>48 f([-1.31466380e-18  3.69702307e-04]) = 0.00000
>49 f([-5.90360555e-19  3.21927835e-04]) = 0.00000
Done!
f([-5.90360555e-19  3.21927835e-04]) = 0.000000

>0 f([-0.06595599 0.34064899]) = 0.12039

>1 f([-0.02902286 0.27948766]) = 0.07896

>2 f([-0.0129815 0.23463749]) = 0.05522

>3 f([-0.00582483 0.1993997 ]) = 0.03979

>4 f([-0.00261527 0.17071256]) = 0.02915

>5 f([-0.00117437 0.14686138]) = 0.02157

>6 f([-0.00052736 0.12676134]) = 0.01607

>7 f([-0.00023681 0.10966762]) = 0.01203

>8 f([-0.00010634 0.09503809]) = 0.00903

>9 f([-4.77542704e-05 8.24607972e-02]) = 0.00680

>10 f([-2.14444463e-05 7.16123835e-02]) = 0.00513

>11 f([-9.62980437e-06 6.22327049e-02]) = 0.00387

>12 f([-4.32434258e-06 5.41085063e-02]) = 0.00293

>13 f([-1.94188148e-06 4.70624414e-02]) = 0.00221

>14 f([-8.72017797e-07 4.09453989e-02]) = 0.00168

>15 f([-3.91586740e-07 3.56309531e-02]) = 0.00127

>16 f([-1.75845235e-07 3.10112252e-02]) = 0.00096

>17 f([-7.89647442e-08 2.69937139e-02]) = 0.00073

>18 f([-3.54597657e-08 2.34988084e-02]) = 0.00055

>19 f([-1.59234984e-08 2.04577993e-02]) = 0.00042

>20 f([-7.15057749e-09 1.78112581e-02]) = 0.00032

>21 f([-3.21102543e-09 1.55077005e-02]) = 0.00024

>22 f([-1.44193729e-09 1.35024688e-02]) = 0.00018

>23 f([-6.47513760e-10 1.17567908e-02]) = 0.00014

>24 f([-2.90771361e-10 1.02369798e-02]) = 0.00010

>25 f([-1.30573263e-10 8.91375193e-03]) = 0.00008

>26 f([-5.86349941e-11 7.76164047e-03]) = 0.00006

>27 f([-2.63305247e-11 6.75849105e-03]) = 0.00005

>28 f([-1.18239380e-11 5.88502652e-03]) = 0.00003

>29 f([-5.30963626e-12 5.12447017e-03]) = 0.00003

>30 f([-2.38433568e-12 4.46221948e-03]) = 0.00002

>31 f([-1.07070548e-12 3.88556303e-03]) = 0.00002

>32 f([-4.80809073e-13 3.38343471e-03]) = 0.00001

>33 f([-2.15911255e-13 2.94620023e-03]) = 0.00001

>34 f([-9.69567190e-14 2.56547145e-03]) = 0.00001

>35 f([-4.35392094e-14 2.23394494e-03]) = 0.00000

>36 f([-1.95516389e-14 1.94526160e-03]) = 0.00000

>37 f([-8.77982370e-15 1.69388439e-03]) = 0.00000

>38 f([-3.94265180e-15 1.47499203e-03]) = 0.00000

>39 f([-1.77048011e-15 1.28438640e-03]) = 0.00000

>40 f([-7.95048604e-16 1.11841198e-03]) = 0.00000

>41 f([-3.57023093e-16 9.73885702e-04]) = 0.00000

>42 f([-1.60324146e-16 8.48035867e-04]) = 0.00000

>43 f([-7.19948720e-17 7.38448972e-04]) = 0.00000

>44 f([-3.23298874e-17 6.43023418e-04]) = 0.00000

>45 f([-1.45180009e-17 5.59929193e-04]) = 0.00000

>46 f([-6.51942732e-18 4.87572776e-04]) = 0.00000

>47 f([-2.92760228e-18 4.24566574e-04]) = 0.00000

>48 f([-1.31466380e-18 3.69702307e-04]) = 0.00000

>49 f([-5.90360555e-19 3.21927835e-04]) = 0.00000

Done!

f([-5.90360555e-19 3.21927835e-04]) = 0.000000

Visualization of AdaGrad

We can plot the progress of the search on a contour plot of the domain.

This can provide an intuition for the progress of the search over the iterations of the algorithm.

We must update the adagrad() function to maintain a list of all solutions found during the search, then return this list at the end of the search.

The updated version of the function with these changes is listed below.

# gradient descent algorithm with adagrad
def adagrad(objective, derivative, bounds, n_iter, step_size):
	# track all solutions
	solutions = list()
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of the sum square gradients for each variable
	sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate gradient
		gradient = derivative(solution[0], solution[1])
		# update the sum of the squared partial derivatives
		for i in range(gradient.shape[0]):
			sq_grad_sums[i] += gradient[i]**2.0
		# build solution
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the learning rate for this variable
			alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))
			# calculate the new position in this variable
			value = solution[i] - alpha * gradient[i]
			new_solution.append(value)
		# store the new solution
		solution = asarray(new_solution)
		solutions.append(solution)
		# evaluate candidate point
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return solutions

# gradient descent algorithm with adagrad

def adagrad(objective, derivative, bounds, n_iter, step_size):

# track all solutions

solutions = list()

# generate an initial point

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

# list of the sum square gradients for each variable

sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for it in range(n_iter):

# calculate gradient

gradient = derivative(solution[0], solution[1])

# update the sum of the squared partial derivatives

for i in range(gradient.shape[0]):

sq_grad_sums[i] += gradient[i]**2.0

# build solution

new_solution = list()

for i in range(solution.shape[0]):

# calculate the learning rate for this variable

alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))

# calculate the new position in this variable

value = solution[i] - alpha * gradient[i]

new_solution.append(value)

# store the new solution

solution = asarray(new_solution)

solutions.append(solution)

# evaluate candidate point

solution_eval = objective(solution[0], solution[1])

# report progress

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

return solutions

We can then execute the search as before, and this time retrieve the list of solutions instead of the best final solution.

...
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# define the step size
step_size = 0.1
# perform the gradient descent search
solutions = adagrad(objective, derivative, bounds, n_iter, step_size)

...

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 50

# define the step size

step_size = 0.1

# perform the gradient descent search

solutions = adagrad(objective, derivative, bounds, n_iter, step_size)

We can then create a contour plot of the objective function, as before.

...
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')

...

# sample input range uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a filled contour plot with 50 levels and jet color scheme

pyplot.contourf(x, y, results, levels=50, cmap='jet')

Finally, we can plot each solution found during the search as a white dot connected by a line.

...
# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

...

# plot the sample as black circles

solutions = asarray(solutions)

pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

Tying this all together, the complete example of performing the AdaGrad optimization on the test problem and plotting the results on a contour plot is listed below.

# example of plotting the adagrad search on a contour plot of the test function
from math import sqrt
from numpy import asarray
from numpy import arange
from numpy.random import rand
from numpy.random import seed
from numpy import meshgrid
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# derivative of objective function
def derivative(x, y):
	return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with adagrad
def adagrad(objective, derivative, bounds, n_iter, step_size):
	# track all solutions
	solutions = list()
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of the sum square gradients for each variable
	sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate gradient
		gradient = derivative(solution[0], solution[1])
		# update the sum of the squared partial derivatives
		for i in range(gradient.shape[0]):
			sq_grad_sums[i] += gradient[i]**2.0
		# build solution
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the learning rate for this variable
			alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))
			# calculate the new position in this variable
			value = solution[i] - alpha * gradient[i]
			new_solution.append(value)
		# store the new solution
		solution = asarray(new_solution)
		solutions.append(solution)
		# evaluate candidate point
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return solutions

# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# define the step size
step_size = 0.1
# perform the gradient descent search
solutions = adagrad(objective, derivative, bounds, n_iter, step_size)
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')
# show the plot
pyplot.show()

# example of plotting the adagrad search on a contour plot of the test function

from math import sqrt

from numpy import asarray

from numpy import arange

from numpy.random import rand

from numpy.random import seed

from numpy import meshgrid

from matplotlib import pyplot

from mpl_toolkits.mplot3d import Axes3D

# objective function

def objective(x, y):

return x**2.0 + y**2.0

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with adagrad

def adagrad(objective, derivative, bounds, n_iter, step_size):

# track all solutions

solutions = list()

# generate an initial point

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

# list of the sum square gradients for each variable

sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for it in range(n_iter):

# calculate gradient

gradient = derivative(solution[0], solution[1])

# update the sum of the squared partial derivatives

for i in range(gradient.shape[0]):

sq_grad_sums[i] += gradient[i]**2.0

# build solution

new_solution = list()

for i in range(solution.shape[0]):

# calculate the learning rate for this variable

alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))

# calculate the new position in this variable

value = solution[i] - alpha * gradient[i]

new_solution.append(value)

# store the new solution

solution = asarray(new_solution)

solutions.append(solution)

# evaluate candidate point

solution_eval = objective(solution[0], solution[1])

# report progress

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

return solutions

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 50

# define the step size

step_size = 0.1

# perform the gradient descent search

solutions = adagrad(objective, derivative, bounds, n_iter, step_size)

# sample input range uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a filled contour plot with 50 levels and jet color scheme

pyplot.contourf(x, y, results, levels=50, cmap='jet')

# plot the sample as black circles

solutions = asarray(solutions)

pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

# show the plot

pyplot.show()

Running the example performs the search as before, except in this case, a contour plot of the objective function is created and a white dot is shown for each solution found during the search, starting above the optima and progressively getting closer to the optima at the center of the plot.

Contour Plot of the Test Objective Function With AdaGrad Search Results Shown

Summary

In this tutorial, you discovered how to develop the gradient descent with adaptive gradients optimization algorithm from scratch.

Specifically, you learned:

Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
Gradient descent can be updated to use an automatically adaptive step size for each input variable in the objective function, called adaptive gradients or AdaGrad.
How to implement the AdaGrad optimization algorithm from scratch and apply it to an objective function and evaluate the results.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

2 Responses to Gradient Descent With AdaGrad From Scratch

Marc November 20, 2023 at 1:30 am #

Thank you for the article.

Concerning following part:
“[…] allowing larger movements in dimensions with a consistently steep gradient and smaller movements in dimensions with less steep gradients.”

Shouldn’t AdaGrad use bigger steps in flat directions (less steep gradients) and smaller steps along steep directions? That way more progress can be made in flat areas where the likelihood of jumping over minima is low and moving slowly at steep directions will not accidentally jump out of minima.

- James Carmichael November 20, 2023 at 9:15 am #
  
  Hi Marc…You are very welcocome! The following resource will allow you to investigate AdaGrad on a deeper level.
  
  https://optimization.cbe.cornell.edu/index.php?title=AdaGrad

Navigation

Gradient Descent With AdaGrad From Scratch

Tutorial Overview

Gradient Descent

Want to Get Started With Optimization Algorithms?

Adaptive Gradient (AdaGrad)

Gradient Descent With AdaGrad

Two-Dimensional Test Problem

Gradient Descent Optimization With AdaGrad

Visualization of AdaGrad

Further Reading

Papers

Books

APIs

Articles

Summary

Get a Handle on Modern Optimization Algorithms!

Develop Your Understanding of Optimization

Bring Modern Optimization Algorithms to
Your Machine Learning Projects

More On This Topic

2 Responses to Gradient Descent With AdaGrad From Scratch

Leave a Reply Click here to cancel reply.

Navigation

Tutorial Overview

Gradient Descent

Want to Get Started With Optimization Algorithms?

Adaptive Gradient (AdaGrad)

Gradient Descent With AdaGrad

Two-Dimensional Test Problem

Gradient Descent Optimization With AdaGrad

Visualization of AdaGrad

Further Reading

Papers

Books

APIs

Articles

Summary

Get a Handle on Modern Optimization Algorithms!

Develop Your Understanding of Optimization

Bring Modern Optimization Algorithms to Your Machine Learning Projects

More On This Topic

2 Responses to Gradient Descent With AdaGrad From Scratch

Leave a Reply Click here to cancel reply.

Bring Modern Optimization Algorithms to
Your Machine Learning Projects