How to Implement Gradient Descent Optimization from Scratch

Gradient descent is an optimization algorithm that follows the negative gradient of an objective function in order to locate the minimum of the function.

It is a simple and effective technique that can be implemented with just a few lines of code. It also provides the basis for many extensions and modifications that can result in better performance. The algorithm also provides the basis for the widely used extension called stochastic gradient descent, used to train deep learning neural networks.

In this tutorial, you will discover how to implement gradient descent optimization from scratch.

After completing this tutorial, you will know:

  • Gradient descent is a general procedure for optimizing a differentiable objective function.
  • How to implement the gradient descent algorithm from scratch in Python.
  • How to apply the gradient descent algorithm to an objective function.

Kick-start your project with my new book Optimization for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Implement Gradient Descent Optimization from Scratch

How to Implement Gradient Descent Optimization from Scratch
Photo by Bernd Thaller, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Gradient Descent
  2. Gradient Descent Algorithm
  3. Gradient Descent Worked Example

Gradient Descent Optimization

Gradient descent is an optimization algorithm.

It is technically referred to as a first-order optimization algorithm as it explicitly makes use of the first-order derivative of the target objective function.

First-order methods rely on gradient information to help direct the search for a minimum …

— Page 69, Algorithms for Optimization, 2019.

The first-order derivative, or simply the “derivative,” is the rate of change or slope of the target function at a specific point, e.g. for a specific input.

If the target function takes multiple input variables, it is referred to as a multivariate function and the input variables can be thought of as a vector. In turn, the derivative of a multivariate target function may also be taken as a vector and is referred to generally as the “gradient.”

  • Gradient: First order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for an input.

The gradient points in the direction of steepest ascent of the tangent hyperplane …

— Page 21, Algorithms for Optimization, 2019.

Specifically, the sign of the gradient tells you if the target function is increasing or decreasing at that point.

  • Positive Gradient: Function is increasing at that point.
  • Negative Gradient: Function is decreasing at that point.

Gradient descent refers to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to locate the minimum of the function.

Similarly, we may refer to gradient ascent for the maximization version of the optimization algorithm that follows the gradient uphill to the maximum of the target function.

  • Gradient Descent: Minimization optimization that follows the negative of the gradient to the minimum of the target function.
  • Gradient Ascent: Maximization optimization that follows the gradient to the maximum of the target function.

Central to gradient descent algorithms is the idea of following the gradient of the target function.

By definition, the optimization algorithm is only appropriate for target functions where the derivative function is available and can be calculated for all input values. This does not apply to all target functions, only so-called differentiable functions.

The main benefit of the gradient descent algorithm is that it is easy to implement and effective on a wide range of optimization problems.

Gradient methods are simple to implement and often perform well.

— Page 115, An Introduction to Optimization, 2001.

Gradient descent refers to a family of algorithms that use the first-order derivative to navigate to the optima (minimum or maximum) of a target function.

There are many extensions to the main approach that are typically named for the feature added to the algorithm, such as gradient descent with momentum, gradient descent with adaptive gradients, and so on.

Gradient descent is also the basis for the optimization algorithm used to train deep learning neural networks, referred to as stochastic gradient descent, or SGD. In this variation, the target function is an error function and the function gradient is approximated from prediction error on samples from the problem domain.

Now that we are familiar with a high-level idea of gradient descent optimization, let’s look at how we might implement the algorithm.

Want to Get Started With Optimization Algorithms?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Gradient Descent Algorithm

In this section, we will take a closer look at the gradient descent algorithm.

The gradient descent algorithm requires a target function that is being optimized and the derivative function for the target function.

The target function f() returns a score for a given set of inputs, and the derivative function f'() gives the derivative of the target function for a given set of inputs.

  • Objective Function: Calculates a score for a given set of input parameters.
    Derivative Function: Calculates derivative (gradient) of the objective function for a given set of inputs.

The gradient descent algorithm requires a starting point (x) in the problem, such as a randomly selected point in the input space.

The derivative is then calculated and a step is taken in the input space that is expected to result in a downhill movement in the target function, assuming we are minimizing the target function.

A downhill movement is made by first calculating how far to move in the input space, calculated as the step size (called alpha or the learning rate) multiplied by the gradient. This is then subtracted from the current point, ensuring we move against the gradient, or down the target function.

  • x_new = x – alpha * f'(x)

The steeper the objective function at a given point, the larger the magnitude of the gradient, and in turn, the larger the step taken in the search space.

The size of the step taken is scaled using a step size hyperparameter.

  • Step Size (alpha): Hyperparameter that controls how far to move in the search space against the gradient each iteration of the algorithm.

If the step size is too small, the movement in the search space will be small and the search will take a long time. If the step size is too large, the search may bounce around the search space and skip over the optima.

We have the option of either taking very small steps and re-evaluating the gradient at every step, or we can take large steps each time. The first approach results in a laborious method of reaching the minimizer, whereas the second approach may result in a more zigzag path to the minimizer.

— Page 114, An Introduction to Optimization, 2001.

Finding a good step size may take some trial and error for the specific target function.

The difficulty of choosing the step size can make finding the exact optima of the target function hard. Many extensions involve adapting the learning rate over time to take smaller steps or different sized steps in different dimensions and so on to allow the algorithm to hone in on the function optima.

The process of calculating the derivative of a point and calculating a new point in the input space is repeated until some stop condition is met. This might be a fixed number of steps or target function evaluations, a lack of improvement in target function evaluation over some number of iterations, or the identification of a flat (stationary) area of the search space signified by a gradient of zero.

  • Stop Condition: Decision when to end the search procedure.

Let’s look at how we might implement the gradient descent algorithm in Python.

First, we can define an initial point as a randomly selected point in the input space defined by a bounds.

The bounds can be defined along with an objective function as an array with a min and max value for each dimension. The rand() NumPy function can be used to generate a vector of random numbers in the range 0-1.

We can then calculate the derivative of the point using a function named derivative().

And take a step in the search space to a new point down the hill of the current point.

The new position is calculated using the calculated gradient and the step_size hyperparameter.

We can then evaluate this point and report the performance.

This process can be repeated for a fixed number of iterations controlled via an n_iter hyperparameter.

We can tie all of this together into a function named gradient_descent().

The function takes the name of the objective and gradient functions, as well as the bounds on the inputs to the objective function, number of iterations and step size, then returns the solution and its evaluation at the end of the search.

The complete gradient descent optimization algorithm implemented as a function is listed below.

Now that we are familiar with the gradient descent algorithm, let’s look at a worked example.

Gradient Descent Worked Example

In this section, we will work through an example of applying gradient descent to a simple test optimization function.

First, let’s define an optimization function.

We will use a simple one-dimensional function that squares the input and defines the range of valid inputs from -1.0 to 1.0.

The objective() function below implements this function.

We can then sample all inputs in the range and calculate the objective function value for each.

Finally, we can create a line plot of the inputs (x-axis) versus the objective function values (y-axis) to get an intuition for the shape of the objective function that we will be searching.

The example below ties this together and provides an example of plotting the one-dimensional test function.

Running the example creates a line plot of the inputs to the function (x-axis) and the calculated output of the function (y-axis).

We can see the familiar U-shaped called a parabola.

Line Plot of Simple One-Dimensional Function

Line Plot of Simple One-Dimensional Function

Next, we can apply the gradient descent algorithm to the problem.

First, we need a function that calculates the derivative for this function.

The derivative of x^2 is x * 2 and the derivative() function implements this below.

We can then define the bounds of the objective function, the step size, and the number of iterations for the algorithm.

We will use a step size of 0.1 and 30 iterations, both found after a little experimentation.

Tying this together, the complete example of applying gradient descent optimization to our one-dimensional test function is listed below.

Running the example starts with a random point in the search space then applies the gradient descent algorithm, reporting performance along the way.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the algorithm finds a good solution after about 20-30 iterations with a function evaluation of about 0.0. Note the optima for this function is at f(0.0) = 0.0.

Now, let’s get a feeling for the importance of good step size.

Set the step size to a large value, such as 1.0, and re-run the search.

Run the example with the larger step size and inspect the results.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the search does not find the optima, and instead bounces around the domain, in this case between the values 0.64820935 and -0.64820935.

Now, try a much smaller step size, such as 1e-8.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Re-running the search, we can see that the algorithm moves very slowly down the slope of the objective function from the starting point.

These two quick examples highlight the problems in selecting a step size that is too large or too small and the general importance of testing many different step size values for a given objective function.

Finally, we can change the learning rate back to 0.1 and visualize the progress of the search on a plot of the target function.

First, we can update the gradient_descent() function to store all solutions and their score found during the optimization as lists and return them at the end of the search instead of the best solution found.

The function can be called, and we can get the lists of the solutions and their scores found during the search.

We can create a line plot of the objective function, as before.

Finally, we can plot each solution found as a red dot and connect the dots with a line so we can see how the search moved downhill.

Tying this all together, the complete example of plotting the result of the gradient descent search on the one-dimensional test function is listed below.

Running the example performs the gradient descent search on the objective function as before, except in this case, each point found during the search is plotted.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the search started about halfway up the left part of the function and stepped downhill to the bottom of the basin.

We can see that in the parts of the objective function with the larger curve, the derivative (gradient) is larger, and in turn, larger steps are taken. Similarly, the gradient is smaller as we get closer to the optima, and in turn, smaller steps are taken.

This highlights that the step size is used as a scale factor on the magnitude of the gradient (curvature) of the objective function.

Plot of the Progress of Gradient Descent on a One Dimensional Objective Function

Plot of the Progress of Gradient Descent on a One Dimensional Objective Function

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

APIs

Articles

Summary

In this tutorial, you discovered how to implement gradient descent optimization from scratch.

Specifically, you learned:

  • Gradient descent is a general procedure for optimizing a differentiable objective function.
  • How to implement the gradient descent algorithm from scratch in Python.
  • How to apply the gradient descent algorithm to an objective function.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Modern Optimization Algorithms!

Optimization for Maching Learning

Develop Your Understanding of Optimization

...with just a few lines of python code

Discover how in my new Ebook:
Optimization for Machine Learning

It provides self-study tutorials with full working code on:
Gradient Descent, Genetic Algorithms, Hill Climbing, Curve Fitting, RMSProp, Adam, and much more...

Bring Modern Optimization Algorithms to
Your Machine Learning Projects


See What's Inside

17 Responses to How to Implement Gradient Descent Optimization from Scratch

  1. Avatar
    Justin April 16, 2021 at 12:53 pm #

    Good stuff. Your “from scratch” blogs are my favorite!

  2. Avatar
    sara April 17, 2021 at 1:52 pm #

    Thank you Jason. When you have time, could you please write a blog about using pre-trained model for timeseries forecasting/prediction. I saw your post about CNN. I was wondering if you can do a post for pre-trained LSTM or random forest or MLP model

  3. Avatar
    Scott April 27, 2021 at 5:46 am #

    This is a good overview. What I would like to see is a discussion of what to do when you don’t know what the derivative of your target function is. After all, finding the minimum of a parabola is rather trivial. What happens when you have a large, multidimensional space to search and computation time is of the essence? How do you determine the gradient? How do you decide what the step size should be in each dimension? Thank you for this tutorial.

    • Avatar
      Jason Brownlee April 28, 2021 at 5:52 am #

      Thanks.

      You can use a direct search method (nelder mead) or a stochastic method like a genetic algorithm, simulated annealing, differential evolution, etc if you don’t have a derivative.

      Large space means more searching. If eval is slow, you can use a surrogate function/proxy function.

      • Avatar
        julien G September 7, 2021 at 7:33 am #

        Isn’t it possible to use anyway a gradient descent and estimate the gradient numerically ? using finite difference for instance …

  4. Avatar
    Can May 10, 2021 at 3:05 pm #

    Hello Jason,

    You did f(x) = x ^ 2 and f’ is 2x. When x > 0 gradient is positive and you decrease x thereby minimizing loss.

    When f(x) = x ^ 2 – 4 though, just adding a constant: gradient also becomes 2x.

    Minima will be at x = 2 which is zero. When x = 1, gradient becomes positive and
    x := x – alpha * (2x)
    will decrease the x instead of increasing it. How do we solve this? (without taking 2nd order derivative)

    • Avatar
      Jason Brownlee May 11, 2021 at 6:38 am #

      The gradient may be positive or negative, ensuring we move the coefficients in the correct direction to reduce loss.

  5. Avatar
    Can May 10, 2021 at 4:59 pm #

    Oh it’s f = (x-2) ^ 2
    not x^2 -4 my bad.

  6. Avatar
    Henry June 19, 2021 at 4:08 pm #

    step by step , very clearly for

  7. Avatar
    Gordon August 13, 2021 at 6:52 pm #

    uhhh where did you find a function named derivative?

    • Adrian Tam
      Adrian Tam August 14, 2021 at 3:12 am #

      You have to define it according to your objective function. For example, if the objective function is f(x)=x*x, then the derivative function is f(x)=x

  8. Avatar
    Henrik September 29, 2022 at 11:38 pm #

    But if you have a different objective function, returning rmse or roc_auc or some other metric. How do you derive that?

Leave a Reply