How to Evaluate the Performance of PyTorch Models

Last Updated on March 22, 2023

Designing a deep learning model is sometimes an art. There are a lot of decision points, and it is not easy to tell what is the best. One way to come up with a design is by trial and error and evaluating the result on real data. Therefore, it is important to have a scientific method to evaluate the performance of your neural network and deep learning models. In fact, it is also the same method to compare any kind of machine learning models on a particular usage.

In this post, you will discover the received workflow to robustly evaluate model performance. In the examples, we will use PyTorch to build our models, but the method can also be applied to other models. After completing this post, you will know:

  • How to evaluate a PyTorch model using a verification dataset
  • How to evaluate a PyTorch model with k-fold cross-validation

Let’s get started.

How to evaluate the performance of PyTorch models
Photo by Kin Shing Lai. Some rights reserved.


This chapter is in four parts; they are:

  • Empirical Evaluation of Models
  • Data Splitting
  • Training a PyTorch Model with Validation
  • k-Fold Cross Validation

Empirical Evaluation of Models

In designing and configuring a deep learning model from scratch, there are a lot of decisions to make. This includes design decisions such as how many layers to use in a deep learning model, how big is each layer, and what kind of layers or activation functions to use. It can also be the choice of the loss function, optimization algorithm, number of epochs to train, and the interpretation of the model output. Luckily, sometimes, you can copy the structure of other people’s networks. Sometimes, you can just make up your choice using some heuristics. To tell if you made a good choice or not, the best way is to compare multiple alternatives by empirically evaluating them with actual data.

Deep learning is often used on problems that have very large datasets. That is tens of thousands or hundreds of thousands of data samples. This provides ample data for testing. But you need to have a robust test strategy to estimate the performance of your model on unseen data. Based on that, you can have a metric to compare among different model configurations.

Kick-start your project with my book Deep Learning with PyTorch. It provides self-study tutorials with working code to guide you into building a fully-working transformer model that can
translate sentences from one language to another...

Data Splitting

If you have a dataset of tens of thousands of samples or even more, you don’t always need to give everything to your model for training. This will unnecessarily increase the complexity and lengthen the training time. More is not always better. You may not get the best result.

When you have a large amount of data, you should take a portion of it as the training set that is fed into the model for training. Another portion is kept as a test set to hold back from the training but verified with a trained or partially trained model as an evaluation. This step is usually called “train-test split.”

Let’s consider the Pima Indians Diabetes dataset. You can load the data using NumPy:

There are 768 data samples. It is not a lot but is enough to demonstrate the split. Let’s consider the first 66% as the training set and the remaining as the test set. The easiest way to do so is by slicing an array:

The choice of 66% is arbitrary, but you do not want the training set too small. Sometimes you may use 70%-30% split. But if the dataset is huge, you may even use a 30%-70% split if 30% of training data is large enough.

If you split the data in this way, you’re suggesting the datasets are shuffled so that the training set and the test set are equally diverse. If you find the original dataset is sorted and take the test set only at the end, you may find you have all the test data belonging to the same class or carrying the same value in one of the input features. That’s not ideal.

Of course, you can call np.random.shuffle(data) before the split to avoid that. But many machine learning engineers usually use scikit-learn for this. See this example:

But more commonly, it is done after you separate the input feature and output labels. Note that this function from scikit-learn can work not only on NumPy arrays but also on PyTorch tensors:

Training a PyTorch Model with Validation

Let’s revisit the code for building and training a deep learning model on this dataset:

In this code, one batch is extracted from the training set in each iteration and sent to the model in the forward pass. Then you compute the gradient in the backward pass and update the weights.

While, in this case, you used binary cross entropy as the loss metric in the training loop, you may be more concerned with the prediction accuracy. Calculating accuracy is easy. You round off the output (in the range of 0 to 1) to the nearest integer so you can get a binary value of 0 or 1. Then you count how much percentage your prediction matched the label; this gives you the accuracy.

But what is your prediction? It is y_pred above, which is the prediction by your current model on X_batch. Adding accuracy to the training loop becomes this:

However, the X_batch and y_batch is used by the optimizer, and the optimizer will fine-tune your model so that it can predict y_batch from X_batch. And now you’re using accuracy to check if y_pred match with y_batch. It is like cheating because if your model somehow remembers the solution, it can just report to you the y_pred and get perfect accuracy without actually inferring y_pred from X_batch.

Indeed, a deep learning model can be so convoluted that you cannot know if your model simply remembers the answer or is inferring the answer. Therefore, the best way is not to calculate accuracy from X_batch or anything from X_trainbut from something else: your test set. Let’s add an accuracy measurement after each epoch using X_test:

In this case, the acc in the inner for-loop is just a metric showing the progress. Not much difference in displaying the loss metric, except it is not involved in the gradient descent algorithm. And you expect the accuracy to improve as the loss metric also improves.

In the outer for-loop, at the end of each epoch, you calculate the accuracy from X_test. The workflow is similar: You give the test set to the model and ask for its prediction, then count the number of matched results with your test set labels. But this accuracy is the one you should care about. It should improve as the training progresses, but if you do not see it improve (i.e., accuracy increase) or even deteriorates, you have to interrupt the training as it seems to start overfitting. Overfitting is when the model started to remember the training set rather than learning to infer the prediction from it. A sign of that is the accuracy from the training set keeps increasing while the accuracy from the test set decreases.

The following is the complete code to implement everything above, from data splitting to validation using the test set:

The code above will print the following:

k-Fold Cross Validation

In the above example, you calculated the accuracy from the test set. It is used as a score for the model as you progressed in the training. You want to stop at the point where this score is at its maximum. In fact, by merely comparing the score from this test set, you know your model works best after epoch 21 and starts to overfit afterward. Is that right?

If you built two models of different designs, should you just compare these models’ accuracy on the same test set and claim one is better than another?

Actually, you can argue that the test set is not representative enough even after you have shuffled your dataset before extracting the test set. You may also argue that, by chance, one model fits better to this particular test set but not always better. To make a stronger argument on which model is better independent of the selection of the test set, you can try multiple test sets and average the accuracy.

This is what a k-fold cross validation does. It is a progress to decide on which design works better. It works by repeating the training process from scratch for $k$ times, each with a different composition of the training and test sets. Because of that, you will have $k$ models and $k$ accuracy scores from their respective test set. You are not only interested in the average accuracy but also the standard deviation. The standard deviation tells whether the accuracy score is consistent or if some test set is particularly good or bad in a model.

Since k-fold cross validation trains the model from scratch a few times, it is best to wrap around the training loop in a function:

The code above is deliberately not printing anything (with disable=True in tqdm) to keep the screen less cluttered.

Also from scikit-learn, you have a function for k-fold cross validation. You can make use of it to produce a robust estimate of model accuracy:

Running this prints:

In scikit-learn, there are multiple k-fold cross validation functions, and the one used here is stratified k-fold. It assumes y are class labels and takes into account of their values such that it will provide a balanced class representation in the splits.

The code above used $k=5$ or 5 splits. It means splitting the dataset into five equal portions, picking one of them as the test set and combining the rest into a training set. There are five ways of doing that, so the for-loop above will have five iterations. In each iteration, you call the model_train() function and obtain the accuracy score in return. Then you save it into a list, which will be used to calculate the mean and standard deviation at the end.

The kfold object will return to you the indices. Hence you do not need to run the train-test split in advance but use the indices provided to extract the training set and test set on the fly when you call the model_train() function.

The result above shows the model is moderately good, at 64% average accuracy. And this score is stable since the standard deviation is at 3%. This means that most of the time, you expect the model accuracy to be 61% to 67%. You may try to change the model above, such as adding or removing a layer, and see how much change you have in the mean and standard deviation. You may also try to increase the number of epochs used in training and observe the result.

The mean and standard deviation from the k-fold cross validation is what you should use to benchmark a model design.

Tying it all together, below is the complete code for k-fold cross validation:


In this post, you discovered the importance of having a robust way to estimate the performance of your deep learning models on unseen data, and you learned how to do that. You saw:

  • How to split data into training and test sets using scikit-learn
  • How to do k-fold cross validation with the help of scikit-learn
  • How to modify the training loop in a PyTorch model to incorporate test set validation and cross validation

2 Responses to How to Evaluate the Performance of PyTorch Models

  1. Oladimeji February 10, 2023 at 5:54 pm #

    Thanks a lot for your efforts highly appreciated

    • James Carmichael February 11, 2023 at 7:38 am #

      You are very welcome Oladimeji! We appreciate your support and feedback!

Leave a Reply