Implementing Gradient Descent in PyTorch

By Muhammad Asad Iqbal Khan on April 8, 2023 in Deep Learning with PyTorch 3

The gradient descent algorithm is one of the most popular techniques for training deep neural networks. It has many applications in fields such as computer vision, speech recognition, and natural language processing. While the idea of gradient descent has been around for decades, it’s only recently that it’s been applied to applications related to deep learning.

Gradient descent is an iterative optimization method used to find the minimum of an objective function by updating values iteratively on each step. With each iteration, it takes small steps towards the desired direction until convergence, or a stop criterion is met.

In this tutorial, you will train a simple linear regression model with two trainable parameters and explore how gradient descent works and how to implement it in PyTorch. Particularly, you’ll learn about:

Gradient Descent algorithm and its implementation in PyTorch
Batch Gradient Descent and its implementation in PyTorch
Stochastic Gradient Descent and its implementation in PyTorch
How Batch Gradient Descent and Stochastic Gradient Descent are different from each other
How loss decreases in Batch Gradient Descent and Stochastic Gradient Descent during training

Kick-start your project with my book Deep Learning with PyTorch. It provides self-study tutorials with working code.

So, let’s get started.

Implementing Gradient Descent in PyTorch.
Picture by Michael Behrens. Some rights reserved.

Overview

This tutorial is in four parts; they are

Preparing Data
Batch Gradient Descent
Stochastic Gradient Descent
Plotting Graphs for Comparison

Preparing Data

To keep the model simple for illustration, we will use the linear regression problem as in the last tutorial. The data is synthetic and generated as follows:

import torch
import numpy as np
import matplotlib.pyplot as plt

# Creating a function f(X) with a slope of -5
X = torch.arange(-5, 5, 0.1).view(-1, 1)
func = -5 * X

# Adding Gaussian noise to the function f(X) and saving it in Y
Y = func + 0.4 * torch.randn(X.size())

import torch

import numpy as np

import matplotlib.pyplot as plt

# Creating a function f(X) with a slope of -5

X = torch.arange(-5, 5, 0.1).view(-1, 1)

func = -5 * X

# Adding Gaussian noise to the function f(X) and saving it in Y

Y = func + 0.4 * torch.randn(X.size())

Same as in the previous tutorial, we initialized a variable X with values ranging from $-5$ to $5$, and created a linear function with a slope of $-5$. Then, Gaussian noise is added to create the variable Y.

We can plot the data using matplotlib to visualize the pattern:

...
# Plot and visualizing the data points in blue
plt.plot(X.numpy(), Y.numpy(), 'b+', label='Y')
plt.plot(X.numpy(), func.numpy(), 'r', label='func')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.grid('True', color='y')
plt.show()

...

# Plot and visualizing the data points in blue

plt.plot(X.numpy(), Y.numpy(), 'b+', label='Y')

plt.plot(X.numpy(), func.numpy(), 'r', label='func')

plt.xlabel('x')

plt.ylabel('y')

plt.legend()

plt.grid('True', color='y')

plt.show()

Data points for regression model

Want to Get Started With Deep Learning with PyTorch?

Take my free email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Batch Gradient Descent

Now that we have created the data for our model, next we’ll build a forward function based on a simple linear regression equation. We’ll train the model for two parameters ($w$ and $b$). We will also need a loss criterion function. Because it is a regression problem on continuous values, MSE loss is appropriate.

...
# defining the function for forward pass for prediction
def forward(x):
    return w * x + b

# evaluating data points with Mean Square Error (MSE)
def criterion(y_pred, y):
    return torch.mean((y_pred - y) ** 2)

...

# defining the function for forward pass for prediction

def forward(x):

return w * x + b

# evaluating data points with Mean Square Error (MSE)

def criterion(y_pred, y):

return torch.mean((y_pred - y) ** 2)

Before we train our model, let’s learn about the batch gradient descent. In batch gradient descent, all the samples in the training data are considered in a single step. The parameters are updated by taking the mean gradient of all the training examples. In other words, there is only one step of gradient descent in one epoch.

While Batch Gradient Descent is the best choice for smooth error manifolds, it’s relatively slow and computationally complex, especially if you have a larger dataset for training.

Training with Batch Gradient Descent

Let’s randomly initialize the trainable parameters $w$ and $b$, and define some training parameters such as learning rate or step size, an empty list to store the loss, and number of epochs for training.

w = torch.tensor(-10.0, requires_grad=True)
b = torch.tensor(-20.0, requires_grad=True)

step_size = 0.1
loss_BGD = []
n_iter = 20

w = torch.tensor(-10.0, requires_grad=True)

b = torch.tensor(-20.0, requires_grad=True)

step_size = 0.1

loss_BGD = []

n_iter = 20

We’ll train our model for 20 epochs using below lines of code. Here, the forward() function generates the prediction while the criterion() function measures the loss to store it in loss variable. The backward() method performs the gradient computations and the updated parameters are stored in w.data and b.data.

for i in range (n_iter):
    # making predictions with forward pass
    Y_pred = forward(X)
    # calculating the loss between original and predicted data points
    loss = criterion(Y_pred, Y)
    # storing the calculated loss in a list
    loss_BGD.append(loss.item())
    # backward pass for computing the gradients of the loss w.r.t to learnable parameters
    loss.backward()
    # updateing the parameters after each iteration
    w.data = w.data - step_size * w.grad.data
    b.data = b.data - step_size * b.grad.data
    # zeroing gradients after each iteration
    w.grad.data.zero_()
    b.grad.data.zero_()
    # priting the values for understanding
    print('{}, \t{}, \t{}, \t{}'.format(i, loss.item(), w.item(), b.item()))

for i in range (n_iter):

# making predictions with forward pass

Y_pred = forward(X)

# calculating the loss between original and predicted data points

loss = criterion(Y_pred, Y)

# storing the calculated loss in a list

loss_BGD.append(loss.item())

# backward pass for computing the gradients of the loss w.r.t to learnable parameters

loss.backward()

# updateing the parameters after each iteration

w.data = w.data - step_size * w.grad.data

b.data = b.data - step_size * b.grad.data

# zeroing gradients after each iteration

w.grad.data.zero_()

b.grad.data.zero_()

# priting the values for understanding

print('{}, \t{}, \t{}, \t{}'.format(i, loss.item(), w.item(), b.item()))

Here is the how the output looks like and the parameters are updated after every epoch when we apply batch gradient descent.

0, 	596.7191162109375, 	-1.8527469635009766, 	-16.062074661254883
1, 	343.426513671875, 	-7.247585773468018, 	-12.83026123046875
2, 	202.7098388671875, 	-3.616910219192505, 	-10.298759460449219
3, 	122.16651153564453, 	-6.0132551193237305, 	-8.237251281738281
4, 	74.85094451904297, 	-4.394278526306152, 	-6.6120076179504395
5, 	46.450958251953125, 	-5.457883358001709, 	-5.295622825622559
6, 	29.111614227294922, 	-4.735295295715332, 	-4.2531514167785645
7, 	18.386211395263672, 	-5.206836700439453, 	-3.4119482040405273
8, 	11.687058448791504, 	-4.883906364440918, 	-2.7437009811401367
9, 	7.4728569984436035, 	-5.092618465423584, 	-2.205873966217041
10, 	4.808231830596924, 	-4.948029518127441, 	-1.777699589729309
11, 	3.1172332763671875, 	-5.040188312530518, 	-1.4337140321731567
12, 	2.0413269996643066, 	-4.975278854370117, 	-1.159447193145752
13, 	1.355530858039856, 	-5.0158305168151855, 	-0.9393846988677979
14, 	0.9178376793861389, 	-4.986582279205322, 	-0.7637402415275574
15, 	0.6382412910461426, 	-5.004333972930908, 	-0.6229321360588074
16, 	0.45952412486076355, 	-4.991086006164551, 	-0.5104631781578064
17, 	0.34523946046829224, 	-4.998797416687012, 	-0.42035552859306335
18, 	0.27213525772094727, 	-4.992753028869629, 	-0.3483465909957886
19, 	0.22536347806453705, 	-4.996064186096191, 	-0.2906789183616638

0, 596.7191162109375, -1.8527469635009766, -16.062074661254883

1, 343.426513671875, -7.247585773468018, -12.83026123046875

2, 202.7098388671875, -3.616910219192505, -10.298759460449219

3, 122.16651153564453, -6.0132551193237305, -8.237251281738281

4, 74.85094451904297, -4.394278526306152, -6.6120076179504395

5, 46.450958251953125, -5.457883358001709, -5.295622825622559

6, 29.111614227294922, -4.735295295715332, -4.2531514167785645

7, 18.386211395263672, -5.206836700439453, -3.4119482040405273

8, 11.687058448791504, -4.883906364440918, -2.7437009811401367

9, 7.4728569984436035, -5.092618465423584, -2.205873966217041

10, 4.808231830596924, -4.948029518127441, -1.777699589729309

11, 3.1172332763671875, -5.040188312530518, -1.4337140321731567

12, 2.0413269996643066, -4.975278854370117, -1.159447193145752

13, 1.355530858039856, -5.0158305168151855, -0.9393846988677979

14, 0.9178376793861389, -4.986582279205322, -0.7637402415275574

15, 0.6382412910461426, -5.004333972930908, -0.6229321360588074

16, 0.45952412486076355, -4.991086006164551, -0.5104631781578064

17, 0.34523946046829224, -4.998797416687012, -0.42035552859306335

18, 0.27213525772094727, -4.992753028869629, -0.3483465909957886

19, 0.22536347806453705, -4.996064186096191, -0.2906789183616638

Putting all together, the following is the complete code

import torch
import numpy as np
import matplotlib.pyplot as plt

X = torch.arange(-5, 5, 0.1).view(-1, 1)
func = -5 * X
Y = func + 0.4 * torch.randn(X.size())

# defining the function for forward pass for prediction
def forward(x):
    return w * x + b

# evaluating data points with Mean Square Error (MSE)
def criterion(y_pred, y):
    return torch.mean((y_pred - y) ** 2)

w = torch.tensor(-10.0, requires_grad=True)
b = torch.tensor(-20.0, requires_grad=True)

step_size = 0.1
loss_BGD = []
n_iter = 20

for i in range (n_iter):
    # making predictions with forward pass
    Y_pred = forward(X)
    # calculating the loss between original and predicted data points
    loss = criterion(Y_pred, Y)
    # storing the calculated loss in a list
    loss_BGD.append(loss.item())
    # backward pass for computing the gradients of the loss w.r.t to learnable parameters
    loss.backward()
    # updateing the parameters after each iteration
    w.data = w.data - step_size * w.grad.data
    b.data = b.data - step_size * b.grad.data
    # zeroing gradients after each iteration
    w.grad.data.zero_()
    b.grad.data.zero_()
    # priting the values for understanding
    print('{}, \t{}, \t{}, \t{}'.format(i, loss.item(), w.item(), b.item()))