Last Updated on
Stochastic gradient descent is a learning algorithm that has a number of hyperparameters.
Two hyperparameters that often confuse beginners are the batch size and number of epochs. They are both integer values and seem to do the same thing.
In this post, you will discover the difference between batches and epochs in stochastic gradient descent.
After reading this post, you will know:
- Stochastic gradient descent is an iterative learning algorithm that uses a training dataset to update a model.
- The batch size is a hyperparameter of gradient descent that controls the number of training samples to work through before the model’s internal parameters are updated.
- The number of epochs is a hyperparameter of gradient descent that controls the number of complete passes through the training dataset.
Discover how to develop deep learning models for a range of predictive modeling problems with just a few lines of code in my new book, with 18 step-by-step tutorials and 9 projects.
Let’s get started.

What is the Difference Between a Batch and an Epoch in a Neural Network?
Photo by Graham Cook, some rights reserved.
Overview
This post is divided into five parts; they are:
- Stochastic Gradient Descent
- What Is a Sample?
- What Is a Batch?
- What Is an Epoch?
- What Is the Difference Between Batch and Epoch?
Stochastic Gradient Descent
Stochastic Gradient Descent, or SGD for short, is an optimization algorithm used to train machine learning algorithms, most notably artificial neural networks used in deep learning.
The job of the algorithm is to find a set of internal model parameters that perform well against some performance measure such as logarithmic loss or mean squared error.
Optimization is a type of searching process and you can think of this search as learning. The optimization algorithm is called “gradient descent“, where “gradient” refers to the calculation of an error gradient or slope of error and “descent” refers to the moving down along that slope towards some minimum level of error.
The algorithm is iterative. This means that the search process occurs over multiple discrete steps, each step hopefully slightly improving the model parameters.
Each step involves using the model with the current set of internal parameters to make predictions on some samples, comparing the predictions to the real expected outcomes, calculating the error, and using the error to update the internal model parameters.
This update procedure is different for different algorithms, but in the case of artificial neural networks, the backpropagation update algorithm is used.
Before we dive into batches and epochs, let’s take a look at what we mean by sample.
Learn more about gradient descent here:
What Is a Sample?
A sample is a single row of data.
It contains inputs that are fed into the algorithm and an output that is used to compare to the prediction and calculate an error.
A training dataset is comprised of many rows of data, e.g. many samples. A sample may also be called an instance, an observation, an input vector, or a feature vector.
Now that we know what a sample is, let’s define a batch.
What Is a Batch?
The batch size is a hyperparameter that defines the number of samples to work through before updating the internal model parameters.
Think of a batch as a for-loop iterating over one or more samples and making predictions. At the end of the batch, the predictions are compared to the expected output variables and an error is calculated. From this error, the update algorithm is used to improve the model, e.g. move down along the error gradient.
A training dataset can be divided into one or more batches.
When all training samples are used to create one batch, the learning algorithm is called batch gradient descent. When the batch is the size of one sample, the learning algorithm is called stochastic gradient descent. When the batch size is more than one sample and less than the size of the training dataset, the learning algorithm is called mini-batch gradient descent.
- Batch Gradient Descent. Batch Size = Size of Training Set
- Stochastic Gradient Descent. Batch Size = 1
- Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set
In the case of mini-batch gradient descent, popular batch sizes include 32, 64, and 128 samples. You may see these values used in models in the literature and in tutorials.
What if the dataset does not divide evenly by the batch size?
This can and does happen often when training a model. It simply means that the final batch has fewer samples than the other batches.
Alternately, you can remove some samples from the dataset or change the batch size such that the number of samples in the dataset does divide evenly by the batch size.
For more on the differences between these variations of gradient descent, see the post:
For more on the effect of batch size on the learning process, see the post:
A batch involves an update to the model using samples; next, let’s look at an epoch.
What Is an Epoch?
The number of epochs is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset.
One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters. An epoch is comprised of one or more batches. For example, as above, an epoch that has one batch is called the batch gradient descent learning algorithm.
You can think of a for-loop over the number of epochs where each loop proceeds over the training dataset. Within this for-loop is another nested for-loop that iterates over each batch of samples, where one batch has the specified “batch size” number of samples.
The number of epochs is traditionally large, often hundreds or thousands, allowing the learning algorithm to run until the error from the model has been sufficiently minimized. You may see examples of the number of epochs in the literature and in tutorials set to 10, 100, 500, 1000, and larger.
It is common to create line plots that show epochs along the x-axis as time and the error or skill of the model on the y-axis. These plots are sometimes called learning curves. These plots can help to diagnose whether the model has over learned, under learned, or is suitably fit to the training dataset.
For more on diagnostics via learning curves with LSTM networks, see the post:
In case it is still not clear, let’s look at the differences between batches and epochs.
What Is the Difference Between Batch and Epoch?
The batch size is a number of samples processed before the model is updated.
The number of epochs is the number of complete passes through the training dataset.
The size of a batch must be more than or equal to one and less than or equal to the number of samples in the training dataset.
The number of epochs can be set to an integer value between one and infinity. You can run the algorithm for as long as you like and even stop it using other criteria besides a fixed number of epochs, such as a change (or lack of change) in model error over time.
They are both integer values and they are both hyperparameters for the learning algorithm, e.g. parameters for the learning process, not internal model parameters found by the learning process.
You must specify the batch size and number of epochs for a learning algorithm.
There are no magic rules for how to configure these parameters. You must try different values and see what works best for your problem.
Worked Example
Finally, let’s make this concrete with a small example.
Assume you have a dataset with 200 samples (rows of data) and you choose a batch size of 5 and 1,000 epochs.
This means that the dataset will be divided into 40 batches, each with five samples. The model weights will be updated after each batch of five samples.
This also means that one epoch will involve 40 batches or 40 updates to the model.
With 1,000 epochs, the model will be exposed to or pass through the whole dataset 1,000 times. That is a total of 40,000 batches during the entire training process.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
- Gradient Descent For Machine Learning
- How to Control the Speed and Stability of Training Neural Networks Batch Size
- A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size
- A Gentle Introduction to Learning Curves for Diagnosing Model Performance
- Stochastic gradient descent on Wikipedia
- Backpropagation on Wikipedia
Summary
In this post, you discovered the difference between batches and epochs in stochastic gradient descent.
Specifically, you learned:
- Stochastic gradient descent is an iterative learning algorithm that uses a training dataset to update a model.
- The batch size is a hyperparameter of gradient descent that controls the number of training samples to work through before the model’s internal parameters are updated.
- The number of epochs is a hyperparameter of gradient descent that controls the number of complete passes through the training dataset.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Very informative and well explained.
Thanks.
Great Explanation Jason..I have been your big fan and have read all of your books..it’s great to learn from u.
Thanks.
l love your articles ,good explanation and i enjoy from the reading
Thanks.
You nailed it with the last paragraph, a small simple toy example always trumps a description
Thanks Mark.
does the content of batches change frome an epoch to another ?
Yes. The samples are shuffled at the end of each epoch and batches across epochs differ in terms of the samples they contain.
Good explanation and good example .. Thankyou and keep up the good work sir !!
Thanks.
Very well explained and most simple possible way !!!.
I’m glad it helped.
Absolutely, thanks for making this Dr. Jason – this eases life without hammering head and time for some on exploring several sources
I have quick question based on (below excerpt from your post)…could you please name / refer other procedures used to update parameters in the case of other algorithms.
*******************************************************************************************
Each step involves using the model with the current set of internal parameters to make predictions on some samples, comparing the predictions to the real expected outcomes, calculating the error, and using the error to update the internal model parameters.
This update procedure is different for different algorithms, but in the case of artificial neural networks, the backpropagation update algorithm is used.
*******************************************************************************************
Glad it helped.
You can learn more about other algorithms here:
https://machinelearningmastery.com/start-here/#algorithms
Très bien. Mais j’aurais aimé voir plus d’exemples. En tout cas GRAND MERCI !
Thanks. Did the example at the end help?
in modern deeeep learning approaches, i almost always encounter that people save their models after some number of epoches (or some time period) while visualizing some kind of performance metrics to evaluate the next values for the hyperparams, thereafter do they carry out their experiments for the next epochs. So we can call this procedure as ‘mini epoch stochactic deep learning’. Thanks.
Thanks for sharing.
This is brilliant and straight forward. Thanks for the mini course Dr. Brownlee
I’m glad it helped.
Hello Dr Jason,
Thank you again for a great blog post. For time series data in LSTM, does it ever makes sense to have the size of a batch more than one?
I have searched and searched and I could not find any example where the batch size is more than one but I have also not found anyone saying that it does not make sense.
Yes, when you want the model to learn across multiple subsequences.
I have some posts that demonstrate this scheduled.
thank you for your explanation really very cair thanks again
I’m happy that it helped.
I’ve read many blogs written by you about such things. It does help me a lot, thank you! 抱拳
Thanks, I’m glad to hear that the post helped.
Thanks, great explanation. So far your blog is the best source for learning ML I’ve found (for beginners like me).
Thanks!
It is very clear. Thank you.
I also see ‘steps_per_epoch’ in some cases, what is that mean? Is it same as batches?
The number of batches to retrieve from a generator in order define an epoch.
We love examples! Thank you so much!
Thanks.
Thank you so much for your crystal clear explanation
I’m happy you found it useful.
Great explanation in an easy way. Thanks.
Thanks, I’m glad it helped.
Hi,
Updates are performed after each batches are over. I just used one sample and gave different batch_sizes in model.fit, why does the value change every time?…it should be able to take one batch size if there is only one sample, isn’t it?
Sorry, I don’t understand your question, can you elaborate please?
What a great explanation!
Never sent a reply to a tutorial, but cannot leave without saying Thanks Jason.
God bless you!
Thanks, I’m glad it helped!
Fantastic explanation !!!
Thanks.
Well Explained Thanks!!
Thanks, I’m glad it helped.
Hello there,
I am currently working with Word2Vec. In connection with Epochs and batchSize I still don’t understand exactly what a sample is. Above you describe that a sample is a single row of data. In my program I first edit my text file with a SentenceIterator so that I get one sentence per line and then I use a tokenizer to get single words in these lines. Is a sample in Word2Vec a word from the data set or is it a line (containing a sentence)? Thank you very much in Advance 🙂
The samples/epoch/batch terminology does not map onto word2vec. Instead you just have a training dataset of text from which you learn statistics.
But with the program Word2Vec you also have the hyperparameters Epochs, Iterations and Batch Size, which you can set… Don’t you think that they also influence the results from Word2Vec.
As I understood it now, a set passed as a Batch contains one sentence. However, I’m surprised that the number of iterations doesn’t change if I vary the number of epochs and batch sizes but don’t define iterations concretely. Do you know how that works?
Not really, I recommend this tutorial:
https://machinelearningmastery.com/develop-word-embeddings-python-gensim/
Very well explained in simple terms. Thanks.
Thanks.
It’s finally clear. Thank you
I’m happy to hear that.
You are super Dr,
Thank you so much for writing in easy way to understand…. Also, try to add pics or graph or schematic diagram to represent your text. As I have seen here you gave one example, it makes many things with super clarity. In some previous post you added graph as well…
Thanks again
Please keep continue
Best regards
Suraj
Thanks for the suggestion!
Hi Jason. After every epoch, the accuracy either improves or sometimes not. For example, epoch 1 achieved accuracy of 94 and epoch 2 achieved an accuracy of 95. After the end of epoch 1 we get new weights (i.e updated after final epoch 1 batch). Is that the new weights used in epoch 2 begining to improve it from 94% to 95%? If yes, is that the reason for some epoch getting lower accuracy from the previous epoch due to the generalization of weights for the entire dataset? That’s why we get good accuracy after running so many epochs due to better generalization?
Typically more training means better accuracy, but not always.
Sometimes it can be a good idea to stop training early, see this post on the topic:
https://machinelearningmastery.com/early-stopping-to-avoid-overtraining-neural-network-models/
Thanks! That was simple and easy to understand.
Thanks, I’m happy it helped!
very well explained thankyou boy
Thanks.
Well explained with easy to understand example. thank you
Thanks, I’m glad it helped.
Indeed, in the last example, the total number of mini-batches is 40,000, but this is true only if the batches are selected without shuffling the training data or selected with data shuffling but without repetition. Otherwise, if within one epoch the mini batches are constructed by selecting training data with repetition, we can have some points that appear more than once in one epoch (they appear in different mini batches in one epoch) and others only once. Therefore, the total number of mini-batches, in this case, may exceed 40,000.
Typically data is shuffled prior to each epoch.
Typically we do not select samples with replacement as it will bias the training.
You Deserve Big Thank You letter for this explanation
Thanks, I’m glad it helped.
thanks for this amazing blog post 🙂
If i have 1000 training samples and my batchsize=400 then i have to remove 200 samples
from my training data , always my training data should be mulitple of the batchsize
Thanks.
No, the samples will be shuffled before each epoch, then you will get 3 batches, 300, 300 and 200.
It is better to choose a batch size that divides the samples evenly, if possible, e.g. 100, 200, or 500 in your case.
Thank you so much! Such a nice explanation with an intuitive example in the end! Thank you!
Thanks, I’m glad it helped.
thanks for your great article , and i have a question
if i have the following settings and i am using fit_generator function
epochs =100
data=1000 images
batch = 10
step_per_epochs = 20
i know i should set the step_per_epochs = (1000/10)= 100 but if i set it to 20
Are these settings mean that the model will be trained using only part of the training data (at each epoch will use the same 200 images(batch*step_per_epochs )) and not used the all 1000 images ?
or it will use first 200 images in dataset in first epoch then the following 200 images in the second epoch and so on (will divided the 1000 images on each 5 epochs ) and model will be trained 20 times using the whole training dataset in the 100 epochs
Thanks
Yes, only 200 images per epoch will be used.
olá, tudo bem? Muito obrigada pela explicação. Gostaria de saber se o senhor sabe o que é Batch Accumulation, Random seed e Validation Interval (em epocas)
Yes.
Batch accumulation is the error collected from the samples in one match used to update the weights.
Random seed is the starting point for the random number generator:
https://machinelearningmastery.com/introduction-to-random-number-generators-for-machine-learning/
What exactly do you mean by validation interval? What context? Perhaps you mean validation dataset:
https://machinelearningmastery.com/difference-test-validation-datasets/
Sir thank you so much for this excellent tutorial.
Can you tell me how to run the model on a similar test dataset after training the model?
Yes, you can use model.predict(), see examples here:
https://machinelearningmastery.com/make-predictions-scikit-learn/
Great explanation, keep sharing your knowledge,
Thank you very much.
Thanks!
Hello Jason,
If I were to create my own custom batches say within the model.fit_generator() method.
Do we create new randomly sampled batches for each epoch or do we just create batches at __init__ and use them without any changes throughout the training?
What’s the recommended way?
P.S. If I randomly sample batches each epoch I see spikes in val_acc, not sure it’s bcoz of that though!
Great question.
It is important to ensure that each batch is representative (within reason), and that each epoch of batches is broadly representative of the problem.
If not, you will push the weights all over the place or back/forth on each update not not generalize well.
Hello Jason,
Thank you for your response.
I also just confirmed that Keras would separate the provided X in mini-batches only once before entering the epoch loop.
Here is the link to code https://github.com/keras-team/keras/blob/f242c6421fe93468064441551cdab66e70f631d8/keras/engine/training_generator.py#L160
Yes.
Good Morning Jason,
A question came in my mind today.
What happens while training a Neural Network in mini-batches when the class labels are imbalanced. Are we suppose to stratify the batches?
Becoz it seems like my NN is only predicting dominant class no matter what I do!
Great question. We get bad times!
Sometimes the experts would say to alternate classes in each batch. Sometimes stratify. It might be problem/model dependent. I’m thinking back to this book:
https://machinelearningmastery.com/neural-networks-tricks-of-the-trade-review/
Nevertheless, imbalanced data is a pain regardless of your update strategy. Oversampling the training set is a great solution.
Thanks, Jason.
I will surely take a look at the book.
Btw I am actually in ranking business. So I got very few 1st and 2nd rankers but a lot of 3rd and above, somewhere as (10%, 10%, 80%) respectively.
What I did is, I took a different perspective on the problem and converted my imbalanced multiclass dataset to an equalized binary dataset.
I converted.
Racing Car 1: 1st Rank
Racing Car 2: 2nd Rank
Racing Car 3: 3rd Rank
Racing Car 4: 4th Rank
to,
Racing Car1, Racing Car2 = 0
Racing Car2, Racing Car1 = 1
Racing Car1, Racing Car3 = 0
Racing Car3, Racing Car1 = 1
Racing Car1, Racing Car4 = 0
Racing Car4, Racing Car1 = 1
Racing Car2, Racing Car3 = 0
Racing Car3, Racing Car2 = 1
Racing Car2, Racing Car4 = 0
Racing Car4, Racing Car2 = 1
and so on… where Target is now the winning side!
Fascinating! Thanks for sharing.
very well explained, Jason. thanks.
Thanks!