Last Updated on March 22, 2023

Designing a deep learning model is sometimes an art. There are a lot of decision points, and it is not easy to tell what is the best. One way to come up with a design is by trial and error and evaluating the result on real data. Therefore, it is important to have a scientific method to evaluate the performance of your neural network and deep learning models. In fact, it is also the same method to compare any kind of machine learning models on a particular usage.

In this post, you will discover the received workflow to robustly evaluate model performance. In the examples, we will use PyTorch to build our models, but the method can also be applied to other models. After completing this post, you will know:

- How to evaluate a PyTorch model using a verification dataset
- How to evaluate a PyTorch model with k-fold cross-validation

Letâ€™s get started.

## Overview

This chapter is in four parts; they are:

- Empirical Evaluation of Models
- Data Splitting
- Training a PyTorch Model with Validation
- k-Fold Cross Validation

## Empirical Evaluation of Models

In designing and configuring a deep learning model from scratch, there are a lot of decisions to make. This includes design decisions such as how many layers to use in a deep learning model, how big is each layer, and what kind of layers or activation functions to use. It can also be the choice of the loss function, optimization algorithm, number of epochs to train, and the interpretation of the model output. Luckily, sometimes, you can copy the structure of other people’s networks. Sometimes, you can just make up your choice using some heuristics. To tell if you made a good choice or not, the best way is to compare multiple alternatives by empirically evaluating them with actual data.

Deep learning is often used on problems that have very large datasets. That is tens of thousands or hundreds of thousands of data samples. This provides ample data for testing. But you need to have a robust test strategy to estimate the performance of your model on unseen data. Based on that, you can have a metric to compare among different model configurations.

**Kick-start your project** with my book Deep Learning with PyTorch. It provides **self-study tutorials** with **working code** to guide you into building a fully-working transformer model that can*translate sentences from one language to another*...

## Data Splitting

If you have a dataset of tens of thousands of samples or even more, you don’t always need to give everything to your model for training. This will unnecessarily increase the complexity and lengthen the training time. More is not always better. You may not get the best result.

When you have a large amount of data, you should take a portion of it as the **training set** that is fed into the model for training. Another portion is kept as a **test set** to hold back from the training but verified with a trained or partially trained model as an evaluation. This step is usually called “train-test split.”

Let’s consider the Pima Indians Diabetes dataset. You can load the data using NumPy:

1 2 |
import numpy as np data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",") |

There are 768 data samples. It is not a lot but is enough to demonstrate the split. Let’s consider the first 66% as the training set and the remaining as the test set. The easiest way to do so is by slicing an array:

1 2 3 4 5 6 |
# find the boundary at 66% of total samples count = len(data) n_train = int(count * 0.66) # split the data at the boundary train_data = data[:n_train] test_data = data[n_train:] |

The choice of 66% is arbitrary, but you do not want the training set too small. Sometimes you may use 70%-30% split. But if the dataset is huge, you may even use a 30%-70% split if 30% of training data is large enough.

If you split the data in this way, you’re suggesting the datasets are shuffled so that the training set and the test set are equally diverse. If you find the original dataset is sorted and take the test set only at the end, you may find you have all the test data belonging to the same class or carrying the same value in one of the input features. That’s not ideal.

Of course, you can call `np.random.shuffle(data)`

before the split to avoid that. But many machine learning engineers usually use scikit-learn for this. See this example:

1 2 3 4 5 |
import numpy as np from sklearn.model_selection import train_test_split data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",") train_data, test_data = train_test_split(data, test_size=0.33) |

But more commonly, it is done after you separate the input feature and output labels. Note that this function from scikit-learn can work not only on NumPy arrays but also on PyTorch tensors:

1 2 3 4 5 6 7 8 9 10 |
import numpy as np import torch from sklearn.model_selection import train_test_split data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",") X = data[:, 0:8] y = data[:, 8] X = torch.tensor(X, dtype=torch.float32) y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) |

## Training a PyTorch Model with Validation

Let’s revisit the code for building and training a deep learning model on this dataset:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
import torch import torch.nn as nn import torch.optim as optim import tqdm ... model = nn.Sequential( nn.Linear(8, 12), nn.ReLU(), nn.Linear(12, 8), nn.ReLU(), nn.Linear(8, 1), nn.Sigmoid() ) # loss function and optimizer loss_fn = nn.BCELoss() # binary cross entropy optimizer = optim.Adam(model.parameters(), lr=0.0001) n_epochs = 50 # number of epochs to run batch_size = 10 # size of each batch batches_per_epoch = len(Xtrain) // batch_size for epoch in range(n_epochs): with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0) as bar: bar.set_description(f"Epoch {epoch}") for i in bar: # take a batch start = i * batch_size X_batch = X_train[start:start+batch_size] y_batch = y_train[start:start+batch_size] # forward pass y_pred = model(X_batch) loss = loss_fn(y_pred, y_batch) # backward pass optimizer.zero_grad() loss.backward() # update weights optimizer.step() # print progress bar.set_postfix( loss=float(loss) ) |

In this code, one batch is extracted from the training set in each iteration and sent to the model in the forward pass. Then you compute the gradient in the backward pass and update the weights.

While, in this case, you used binary cross entropy as the loss metric in the training loop, you may be more concerned with the prediction accuracy. Calculating accuracy is easy. You round off the output (in the range of 0 to 1) to the nearest integer so you can get a binary value of 0 or 1. Then you count how much percentage your prediction matched the label; this gives you the accuracy.

But what is your prediction? It is `y_pred`

above, which is the prediction by your current model on `X_batch`

. Adding accuracy to the training loop becomes this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
for epoch in range(n_epochs): with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0) as bar: bar.set_description(f"Epoch {epoch}") for i in bar: # take a batch start = i * batch_size X_batch = X_train[start:start+batch_size] y_batch = y_train[start:start+batch_size] # forward pass y_pred = model(X_batch) loss = loss_fn(y_pred, y_batch) # backward pass optimizer.zero_grad() loss.backward() # update weights optimizer.step() # print progress, with accuracy acc = (y_pred.round() == y_batch).float().mean() bar.set_postfix( loss=float(loss) acc=float(acc) ) |

However, the `X_batch`

and `y_batch`

is used by the optimizer, and the optimizer will fine-tune your model so that it can predict `y_batch`

from `X_batch`

. And now you’re using accuracy to check if `y_pred`

match with `y_batch`

. It is like cheating because if your model somehow remembers the solution, it can just report to you the `y_pred`

and get perfect accuracy without actually inferring `y_pred`

from `X_batch`

.

Indeed, a deep learning model can be so convoluted that you cannot know if your model simply remembers the answer or is inferring the answer. Therefore, the best way is **not** to calculate accuracy from `X_batch`

or anything from `X_train`

but from something else: your test set. Let’s add an accuracy measurement **after** each epoch using `X_test`

:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
for epoch in range(n_epochs): with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0) as bar: bar.set_description(f"Epoch {epoch}") for i in bar: # take a batch start = i * batch_size X_batch = X_train[start:start+batch_size] y_batch = y_train[start:start+batch_size] # forward pass y_pred = model(X_batch) loss = loss_fn(y_pred, y_batch) # backward pass optimizer.zero_grad() loss.backward() # update weights optimizer.step() # print progress acc = (y_pred.round() == y_batch).float().mean() bar.set_postfix( loss=float(loss), acc=float(acc) ) # evaluate model at end of epoch y_pred = model(X_test) acc = (y_pred.round() == y_test).float().mean() acc = float(acc) print(f"End of {epoch}, accuracy {acc}") |

In this case, the `acc`

in the inner for-loop is just a metric showing the progress. Not much difference in displaying the loss metric, except it is not involved in the gradient descent algorithm. And you expect the accuracy to improve as the loss metric also improves.

In the outer for-loop, at the end of each epoch, you calculate the accuracy from `X_test`

. The workflow is similar: You give the test set to the model and ask for its prediction, then count the number of matched results with your test set labels. But this accuracy is the one you should care about. It should improve as the training progresses, but if you do not see it improve (i.e., accuracy increase) or even deteriorates, you have to interrupt the training as it seems to start overfitting. Overfitting is when the model started to remember the training set rather than learning to infer the prediction from it. A sign of that is the accuracy from the training set keeps increasing while the accuracy from the test set decreases.

The following is the complete code to implement everything above, from data splitting to validation using the test set:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
import numpy as np import torch import torch.nn as nn import torch.optim as optim import tqdm from sklearn.model_selection import train_test_split data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",") X = data[:, 0:8] y = data[:, 8] X = torch.tensor(X, dtype=torch.float32) y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) model = nn.Sequential( nn.Linear(8, 12), nn.ReLU(), nn.Linear(12, 8), nn.ReLU(), nn.Linear(8, 1), nn.Sigmoid() ) # loss function and optimizer loss_fn = nn.BCELoss() # binary cross entropy optimizer = optim.Adam(model.parameters(), lr=0.0001) n_epochs = 50 # number of epochs to run batch_size = 10 # size of each batch batches_per_epoch = len(X_train) // batch_size for epoch in range(n_epochs): with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0) as bar: #, disable=True) as bar: bar.set_description(f"Epoch {epoch}") for i in bar: # take a batch start = i * batch_size X_batch = X_train[start:start+batch_size] y_batch = y_train[start:start+batch_size] # forward pass y_pred = model(X_batch) loss = loss_fn(y_pred, y_batch) # backward pass optimizer.zero_grad() loss.backward() # update weights optimizer.step() # print progress acc = (y_pred.round() == y_batch).float().mean() bar.set_postfix( loss=float(loss), acc=float(acc) ) # evaluate model at end of epoch y_pred = model(X_test) acc = (y_pred.round() == y_test).float().mean() acc = float(acc) print(f"End of {epoch}, accuracy {acc}") |

The code above will print the following:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
End of 0, accuracy 0.5787401795387268 End of 1, accuracy 0.6102362275123596 End of 2, accuracy 0.6220472455024719 End of 3, accuracy 0.6220472455024719 End of 4, accuracy 0.6299212574958801 End of 5, accuracy 0.6377952694892883 End of 6, accuracy 0.6496062874794006 End of 7, accuracy 0.6535432934761047 End of 8, accuracy 0.665354311466217 End of 9, accuracy 0.6614173054695129 End of 10, accuracy 0.665354311466217 End of 11, accuracy 0.665354311466217 End of 12, accuracy 0.665354311466217 End of 13, accuracy 0.665354311466217 End of 14, accuracy 0.665354311466217 End of 15, accuracy 0.6732283234596252 End of 16, accuracy 0.6771653294563293 End of 17, accuracy 0.6811023354530334 End of 18, accuracy 0.6850393414497375 End of 19, accuracy 0.6889764070510864 End of 20, accuracy 0.6850393414497375 End of 21, accuracy 0.6889764070510864 End of 22, accuracy 0.6889764070510864 End of 23, accuracy 0.6889764070510864 End of 24, accuracy 0.6889764070510864 End of 25, accuracy 0.6850393414497375 End of 26, accuracy 0.6811023354530334 End of 27, accuracy 0.6771653294563293 End of 28, accuracy 0.6771653294563293 End of 29, accuracy 0.6692913174629211 End of 30, accuracy 0.6732283234596252 End of 31, accuracy 0.6692913174629211 End of 32, accuracy 0.6692913174629211 End of 33, accuracy 0.6732283234596252 End of 34, accuracy 0.6771653294563293 End of 35, accuracy 0.6811023354530334 End of 36, accuracy 0.6811023354530334 End of 37, accuracy 0.6811023354530334 End of 38, accuracy 0.6811023354530334 End of 39, accuracy 0.6811023354530334 End of 40, accuracy 0.6811023354530334 End of 41, accuracy 0.6771653294563293 End of 42, accuracy 0.6771653294563293 End of 43, accuracy 0.6771653294563293 End of 44, accuracy 0.6771653294563293 End of 45, accuracy 0.6771653294563293 End of 46, accuracy 0.6771653294563293 End of 47, accuracy 0.6732283234596252 End of 48, accuracy 0.6732283234596252 End of 49, accuracy 0.6732283234596252 |

## k-Fold Cross Validation

In the above example, you calculated the accuracy from the test set. It is used as a **score** for the model as you progressed in the training. You want to stop at the point where this score is at its maximum. In fact, by merely comparing the score from this test set, you know your model works best after epoch 21 and starts to overfit afterward. Is that right?

If you built two models of different designs, should you just compare these models’ accuracy on the same test set and claim one is better than another?

Actually, you can argue that the test set is not representative enough even after you have shuffled your dataset before extracting the test set. You may also argue that, by chance, one model fits better to this particular test set but not always better. To make a stronger argument on which model is better independent of the selection of the test set, you can try **multiple test sets**Â and average the accuracy.

This is what a k-fold cross validation does. It is a progress to decide on which **design** works better. It works by repeating the training process from scratch for $k$ times, each with a different composition of the training and test sets. Because of that, you will have $k$ models and $k$ accuracy scores from their respective test set. You are not only interested in the average accuracy but also the standard deviation. The standard deviation tells whether the accuracy score is consistent or if some test set is particularly good or bad in a model.

Since k-fold cross validation trains the model from scratch a few times, it is best to wrap around the training loop in a function:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
def model_train(X_train, y_train, X_test, y_test): # create new model model = nn.Sequential( nn.Linear(8, 12), nn.ReLU(), nn.Linear(12, 8), nn.ReLU(), nn.Linear(8, 1), nn.Sigmoid() ) # loss function and optimizer loss_fn = nn.BCELoss() # binary cross entropy optimizer = optim.Adam(model.parameters(), lr=0.0001) n_epochs = 25 # number of epochs to run batch_size = 10 # size of each batch batches_per_epoch = len(X_train) // batch_size for epoch in range(n_epochs): with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0, disable=True) as bar: bar.set_description(f"Epoch {epoch}") for i in bar: # take a batch start = i * batch_size X_batch = X_train[start:start+batch_size] y_batch = y_train[start:start+batch_size] # forward pass y_pred = model(X_batch) loss = loss_fn(y_pred, y_batch) # backward pass optimizer.zero_grad() loss.backward() # update weights optimizer.step() # print progress acc = (y_pred.round() == y_batch).float().mean() bar.set_postfix( loss=float(loss), acc=float(acc) ) # evaluate accuracy at end of training y_pred = model(X_test) acc = (y_pred.round() == y_test).float().mean() return float(acc) |

The code above is deliberately not printing anything (with `disable=True`

in `tqdm`

) to keep the screen less cluttered.

Also from scikit-learn, you have a function for k-fold cross validation. You can make use of it to produce a robust estimate of model accuracy:

1 2 3 4 5 6 7 8 9 10 11 12 |
from sklearn.model_selection import StratifiedKFold # define 5-fold cross validation test harness kfold = StratifiedKFold(n_splits=5, shuffle=True) cv_scores = [] for train, test in kfold.split(X, y): # create model, train, and get accuracy acc = model_train(X[train], y[train], X[test], y[test]) print("Accuracy: %.2f" % acc) cv_scores.append(acc) # evaluate the model print("%.2f%% (+/- %.2f%%)" % (np.mean(cv_scores)*100, np.std(cv_scores)*100)) |

Running this prints:

1 2 3 4 5 6 |
Accuracy: 0.64 Accuracy: 0.67 Accuracy: 0.68 Accuracy: 0.63 Accuracy: 0.59 64.05% (+/- 3.30%) |

In scikit-learn, there are multiple k-fold cross validation functions, and the one used here is stratified k-fold. It assumes `y`

are class labels and takes into account of their values such that it will provide a balanced class representation in the splits.

The code above used $k=5$ or 5 splits. It means splitting the dataset into five equal portions, picking one of them as the test set and combining the rest into a training set. There are five ways of doing that, so the for-loop above will have five iterations. In each iteration, you call the `model_train()`

function and obtain the accuracy score in return. Then you save it into a list, which will be used to calculate the mean and standard deviation at the end.

The `kfold`

object will return to you the **indices**. Hence you do not need to run the train-test split in advance but use the indices provided to extract the training set and test set on the fly when you call the `model_train()`

function.

The result above shows the model is moderately good, at 64% average accuracy. And this score is stable since the standard deviation is at 3%. This means that most of the time, you expect the model accuracy to be 61% to 67%. You may try to change the model above, such as adding or removing a layer, and see how much change you have in the mean and standard deviation. You may also try to increase the number of epochs used in training and observe the result.

The mean and standard deviation from the k-fold cross validation is what you should use to benchmark a model design.

Tying it all together, below is the complete code for k-fold cross validation:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
import numpy as np import torch import torch.nn as nn import torch.optim as optim import tqdm from sklearn.model_selection import StratifiedKFold data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",") X = data[:, 0:8] y = data[:, 8] X = torch.tensor(X, dtype=torch.float32) y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1) def model_train(X_train, y_train, X_test, y_test): # create new model model = nn.Sequential( nn.Linear(8, 12), nn.ReLU(), nn.Linear(12, 8), nn.ReLU(), nn.Linear(8, 1), nn.Sigmoid() ) # loss function and optimizer loss_fn = nn.BCELoss() # binary cross entropy optimizer = optim.Adam(model.parameters(), lr=0.0001) n_epochs = 25 # number of epochs to run batch_size = 10 # size of each batch batches_per_epoch = len(X_train) // batch_size for epoch in range(n_epochs): with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0, disable=True) as bar: bar.set_description(f"Epoch {epoch}") for i in bar: # take a batch start = i * batch_size X_batch = X_train[start:start+batch_size] y_batch = y_train[start:start+batch_size] # forward pass y_pred = model(X_batch) loss = loss_fn(y_pred, y_batch) # backward pass optimizer.zero_grad() loss.backward() # update weights optimizer.step() # print progress acc = (y_pred.round() == y_batch).float().mean() bar.set_postfix( loss=float(loss), acc=float(acc) ) # evaluate accuracy at end of training y_pred = model(X_test) acc = (y_pred.round() == y_test).float().mean() return float(acc) # define 5-fold cross validation test harness kfold = StratifiedKFold(n_splits=5, shuffle=True) cv_scores = [] for train, test in kfold.split(X, y): # create model, train, and get accuracy acc = model_train(X[train], y[train], X[test], y[test]) print("Accuracy: %.2f" % acc) cv_scores.append(acc) # evaluate the model print("%.2f%% (+/- %.2f%%)" % (np.mean(cv_scores)*100, np.std(cv_scores)*100)) |

## Summary

In this post, you discovered the importance of having a robust way to estimate the performance of your deep learning models on unseen data, and you learned how to do that. You saw:

- How to split data into training and test sets using scikit-learn
- How to do k-fold cross validation with the help of scikit-learn
- How to modify the training loop in a PyTorch model to incorporate test set validation and cross validation

Thanks a lot for your efforts highly appreciated

You are very welcome Oladimeji! We appreciate your support and feedback!