A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size

Stochastic gradient descent is the dominant method used to train deep learning models.

There are three main variants of gradient descent and it can be confusing which one to use.

In this post, you will discover the one type of gradient descent you should use in general and how to configure it.

After completing this post, you will know:

  • What gradient descent is and how it works from a high level.
  • What batch, stochastic, and mini-batch gradient descent are and the benefits and limitations of each method.
  • That mini-batch gradient descent is the go-to method and how to configure it on your applications.

Let’s get started.

A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size

A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size
Photo by Brian Smithson, some rights reserved.

Tutorial Overview

This tutorial is divided into 3 parts; they are:

  1. What is Gradient Descent?
  2. Contrasting the 3 Types of Gradient Descent
  3. How to Configure Mini-Batch Gradient Descent

What is Gradient Descent?

Gradient descent is an optimization algorithm often used for finding the weights or coefficients of machine learning algorithms, such as artificial neural networks and logistic regression.

It works by having the model make predictions on training data and using the error on the predictions to update the model in such a way as to reduce the error.

The goal of the algorithm is to find model parameters (e.g. coefficients or weights) that minimize the error of the model on the training dataset. It does this by making changes to the model that move it along a gradient or slope of errors down toward a minimum error value. This gives the algorithm its name of “gradient descent.”

The pseudocode sketch below summarizes the gradient descent algorithm:

For more information see the posts:

Contrasting the 3 Types of Gradient Descent

Gradient descent can vary in terms of the number of training patterns used to calculate error; that is in turn used to update the model.

The number of patterns used to calculate the error includes how stable the gradient is that is used to update the model. We will see that there is a tension in gradient descent configurations of computational efficiency and the fidelity of the error gradient.

The three main flavors of gradient descent are batch, stochastic, and mini-batch.

Let’s take a closer look at each.

What is Stochastic Gradient Descent?

Stochastic gradient descent, often abbreviated SGD, is a variation of the gradient descent algorithm that calculates the error and updates the model for each example in the training dataset.

The update of the model for each training example means that stochastic gradient descent is often called an online machine learning algorithm.

Upsides

  • The frequent updates immediately give an insight into the performance of the model and the rate of improvement.
  • This variant of gradient descent may be the simplest to understand and implement, especially for beginners.
  • The increased model update frequency can result in faster learning on some problems.
  • The noisy update process can allow the model to avoid local minima (e.g. premature convergence).

Downsides

  • Updating the model so frequently is more computationally expensive than other configurations of gradient descent, taking significantly longer to train models on large datasets.
  • The frequent updates can result in a noisy gradient signal, which may cause the model parameters and in turn the model error to jump around (have a higher variance over training epochs).
  • The noisy learning process down the error gradient can also make it hard for the algorithm to settle on an error minimum for the model.

What is Batch Gradient Descent?

Batch gradient descent is a variation of the gradient descent algorithm that calculates the error for each example in the training dataset, but only updates the model after all training examples have been evaluated.

One cycle through the entire training dataset is called a training epoch. Therefore, it is often said that batch gradient descent performs model updates at the end of each training epoch.

Upsides

  • Fewer updates to the model means this variant of gradient descent is more computationally efficient than stochastic gradient descent.
  • The decreased update frequency results in a more stable error gradient and may result in a more stable convergence on some problems.
  • The separation of the calculation of prediction errors and the model update lends the algorithm to parallel processing based implementations.

Downsides

  • The more stable error gradient may result in premature convergence of the model to a less optimal set of parameters.
  • The updates at the end of the training epoch require the additional complexity of accumulating prediction errors across all training examples.
  • Commonly, batch gradient descent is implemented in such a way that it requires the entire training dataset in memory and available to the algorithm.
  • Model updates, and in turn training speed, may become very slow for large datasets.

What is Mini-Batch Gradient Descent?

Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training dataset into small batches that are used to calculate model error and update model coefficients.

Implementations may choose to sum the gradient over the mini-batch or take the average of the gradient which further reduces the variance of the gradient.

Mini-batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. It is the most common implementation of gradient descent used in the field of deep learning.

Upsides

  • The model update frequency is higher than batch gradient descent which allows for a more robust convergence, avoiding local minima.
  • The batched updates provide a computationally more efficient process than stochastic gradient descent.
  • The batching allows both the efficiency of not having all training data in memory and algorithm implementations.

Downsides

  • Mini-batch requires the configuration of an additional “mini-batch size” hyperparameter for the learning algorithm.
  • Error information must be accumulated across mini-batches of training examples like batch gradient descent.

How to Configure Mini-Batch Gradient Descent

Mini-batch gradient descent is the recommended variant of gradient descent for most applications, especially in deep learning.

Mini-batch sizes, commonly called “batch sizes” for brevity, are often tuned to an aspect of the computational architecture on which the implementation is being executed. Such as a power of two that fits the memory requirements of the GPU or CPU hardware like 32, 64, 128, 256, and so on.

Batch size is a slider on the learning process.

  • Small values give a learning process that converges quickly at the cost of noise in the training process.
  • Large values give a learning process that converges slowly with accurate estimates of the error gradient.

Tip 1: A good default for batch size might be 32.

… [batch size] is typically chosen between 1 and a few hundreds, e.g. [batch size] = 32 is a good default value, with values above 10 taking advantage of the speedup of matrix-matrix products over matrix-vector products.

Practical recommendations for gradient-based training of deep architectures, 2012

Tip 2: It is a good idea to review learning curves of model validation error against training time with different batch sizes when tuning the batch size.

… it can be optimized separately of the other hyperparameters, by comparing training curves (training and validation error vs amount of training time), after the other hyper-parameters (except learning rate) have been selected.

Tip 3: Tune batch size and learning rate after tuning all other hyperparameters.

… [batch size] and [learning rate] may slightly interact with other hyper-parameters so both should be re-optimized at the end. Once [batch size] is selected, it can generally be fixed while the other hyper-parameters can be further optimized (except for a momentum hyper-parameter, if one is used).

Further Reading

This section provides more resources on the topic if you are looking go deeper.

Related Posts

Additional Reading

Summary

In this post, you discovered the gradient descent algorithm and the version that you should use in practice.

Specifically, you learned:

  • What gradient descent is and how it works from a high level.
  • What batch, stochastic, and mini-batch gradient descent are and the benefits and limitations of each method.
  • That mini-batch gradient descent is the go-to method and how to configure it on your applications.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Frustrated With Your Progress In Deep Learning?

Deep Learning with Python

 What If You Could Develop A Network in Minutes

…with just a few lines of Python

Discover how in my new Ebook: Deep Learning With Python

It covers self-study tutorials and end-to-end projects on topics like:
Multilayer PerceptronsConvolutional Nets and Recurrent Neural Nets, and more…

Finally Bring Deep Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

9 Responses to A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size

  1. Jie July 28, 2017 at 2:19 pm #

    In mini-batch part, “The model update frequency is lower than batch gradient descent which allows for a more robust convergence, avoiding local minima.”
    I think this is lower than SGD, rather than BGD, am I wrong?

  2. Darlington July 31, 2017 at 12:29 pm #

    Wait, so won’t that make Adam a mini-batch gradient descent algorithm, instead of stochastic gradient descent? (At least, in Keras’ implementation)
    Since in Keras, when using Adam, you can still set batch size, rather than have it update weights per each data point

    • Jason Brownlee July 31, 2017 at 3:50 pm #

      The idea of batches in SGD and the Adam optimizations of SGD are orthogonal.

      You can use batches with or without Adam.

      More on Adam here:
      http://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/

      • Darlington August 1, 2017 at 3:00 am #

        Oh ok, and also isn’t SGD called so because Gradient Descent is a greedy algorithm and searches for a minima along a slope which can lead to it getting stuck with local minima and to prevent that, Stochastic Gradient Descent uses various random iteration and then a proximates the global minima from all slopes, hence the “stochastic”?

        • Jason Brownlee August 1, 2017 at 8:04 am #

          Yes, right on, it adds noise to the process which allows the process to escape local optima in search of something better.

  3. Sabih August 22, 2017 at 3:33 am #

    Suppose my training data size is 1000 and batch size I selected is 128.
    So, I would like to know how algorithm deals with last training set which is less than batch size?
    In this case 7 weights update will be done till algorithm reach 896 training samples.

    Now what happens for rest of 104 training samples.
    Will it ignore the last training set or it will use 24 samples from next epoch?

    • Jason Brownlee August 22, 2017 at 6:47 am #

      It uses a smaller batch size for the last batch. The samples are still used.

      • Sabih August 22, 2017 at 9:02 am #

        Thanks for the clarification.

Leave a Reply