How to Evaluate the Performance of PyTorch Models

By Adrian Tam on April 8, 2023 in Deep Learning with PyTorch 6

Designing a deep learning model is sometimes an art. There are a lot of decision points, and it is not easy to tell what is the best. One way to come up with a design is by trial and error and evaluating the result on real data. Therefore, it is important to have a scientific method to evaluate the performance of your neural network and deep learning models. In fact, it is also the same method to compare any kind of machine learning models on a particular usage.

In this post, you will discover the received workflow to robustly evaluate model performance. In the examples, we will use PyTorch to build our models, but the method can also be applied to other models. After completing this post, you will know:

How to evaluate a PyTorch model using a verification dataset
How to evaluate a PyTorch model with k-fold cross-validation

Kick-start your project with my book Deep Learning with PyTorch. It provides self-study tutorials with working code.

Let’s get started.

How to evaluate the performance of PyTorch models
Photo by Kin Shing Lai. Some rights reserved.

Overview

This chapter is in four parts; they are:

Empirical Evaluation of Models
Data Splitting
Training a PyTorch Model with Validation
k-Fold Cross Validation

Empirical Evaluation of Models

In designing and configuring a deep learning model from scratch, there are a lot of decisions to make. This includes design decisions such as how many layers to use in a deep learning model, how big is each layer, and what kind of layers or activation functions to use. It can also be the choice of the loss function, optimization algorithm, number of epochs to train, and the interpretation of the model output. Luckily, sometimes, you can copy the structure of other people’s networks. Sometimes, you can just make up your choice using some heuristics. To tell if you made a good choice or not, the best way is to compare multiple alternatives by empirically evaluating them with actual data.

Deep learning is often used on problems that have very large datasets. That is tens of thousands or hundreds of thousands of data samples. This provides ample data for testing. But you need to have a robust test strategy to estimate the performance of your model on unseen data. Based on that, you can have a metric to compare among different model configurations.

Want to Get Started With Deep Learning with PyTorch?

Take my free email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Data Splitting

If you have a dataset of tens of thousands of samples or even more, you don’t always need to give everything to your model for training. This will unnecessarily increase the complexity and lengthen the training time. More is not always better. You may not get the best result.

When you have a large amount of data, you should take a portion of it as the training set that is fed into the model for training. Another portion is kept as a test set to hold back from the training but verified with a trained or partially trained model as an evaluation. This step is usually called “train-test split.”

Let’s consider the Pima Indians Diabetes dataset. You can load the data using NumPy:

import numpy as np
data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",")

1 2	import numpy as np data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",")

There are 768 data samples. It is not a lot but is enough to demonstrate the split. Let’s consider the first 66% as the training set and the remaining as the test set. The easiest way to do so is by slicing an array:

# find the boundary at 66% of total samples
count = len(data)
n_train = int(count * 0.66)
# split the data at the boundary
train_data = data[:n_train]
test_data = data[n_train:]

# find the boundary at 66% of total samples

count = len(data)

n_train = int(count * 0.66)

# split the data at the boundary

train_data = data[:n_train]

test_data = data[n_train:]

The choice of 66% is arbitrary, but you do not want the training set too small. Sometimes you may use 70%-30% split. But if the dataset is huge, you may even use a 30%-70% split if 30% of training data is large enough.

If you split the data in this way, you’re suggesting the datasets are shuffled so that the training set and the test set are equally diverse. If you find the original dataset is sorted and take the test set only at the end, you may find you have all the test data belonging to the same class or carrying the same value in one of the input features. That’s not ideal.

Of course, you can call np.random.shuffle(data) before the split to avoid that. But many machine learning engineers usually use scikit-learn for this. See this example:

import numpy as np
from sklearn.model_selection import train_test_split

data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",")
train_data, test_data = train_test_split(data, test_size=0.33)

import numpy as np

from sklearn.model_selection import train_test_split

data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",")

train_data, test_data = train_test_split(data, test_size=0.33)

But more commonly, it is done after you separate the input feature and output labels. Note that this function from scikit-learn can work not only on NumPy arrays but also on PyTorch tensors:

import numpy as np
import torch
from sklearn.model_selection import train_test_split

data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",")
X = data[:, 0:8]
y = data[:, 8]
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

import numpy as np

import torch

from sklearn.model_selection import train_test_split

data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",")

X = data[:, 0:8]

y = data[:, 8]

X = torch.tensor(X, dtype=torch.float32)

y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

Training a PyTorch Model with Validation

Let’s revisit the code for building and training a deep learning model on this dataset:

import torch
import torch.nn as nn
import torch.optim as optim
import tqdm

...

model = nn.Sequential(
    nn.Linear(8, 12),
    nn.ReLU(),
    nn.Linear(12, 8),
    nn.ReLU(),
    nn.Linear(8, 1),
    nn.Sigmoid()
)

# loss function and optimizer
loss_fn = nn.BCELoss()  # binary cross entropy
optimizer = optim.Adam(model.parameters(), lr=0.0001)

n_epochs = 50    # number of epochs to run
batch_size = 10  # size of each batch
batches_per_epoch = len(Xtrain) // batch_size

for epoch in range(n_epochs):
    with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0) as bar:
        bar.set_description(f"Epoch {epoch}")
        for i in bar:
            # take a batch
            start = i * batch_size
            X_batch = X_train[start:start+batch_size]
            y_batch = y_train[start:start+batch_size]
            # forward pass
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)
            # backward pass
            optimizer.zero_grad()
            loss.backward()
            # update weights
            optimizer.step()
            # print progress
            bar.set_postfix(
                loss=float(loss)
            )

import torch

import torch.nn as nn

import torch.optim as optim

import tqdm

...

model = nn.Sequential(

nn.Linear(8, 12),

nn.ReLU(),

nn.Linear(12, 8),

nn.ReLU(),

nn.Linear(8, 1),

nn.Sigmoid()

)

# loss function and optimizer

loss_fn = nn.BCELoss() # binary cross entropy

optimizer = optim.Adam(model.parameters(), lr=0.0001)

n_epochs = 50 # number of epochs to run

batch_size = 10 # size of each batch

batches_per_epoch = len(Xtrain) // batch_size

for epoch in range(n_epochs):

with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0) as bar:

bar.set_description(f"Epoch {epoch}")

for i in bar:

# take a batch

start = i * batch_size

X_batch = X_train[start:start+batch_size]

y_batch = y_train[start:start+batch_size]

# forward pass

y_pred = model(X_batch)

loss = loss_fn(y_pred, y_batch)

# backward pass

optimizer.zero_grad()

loss.backward()

# update weights

optimizer.step()

# print progress

bar.set_postfix(

loss=float(loss)

)

In this code, one batch is extracted from the training set in each iteration and sent to the model in the forward pass. Then you compute the gradient in the backward pass and update the weights.

While, in this case, you used binary cross entropy as the loss metric in the training loop, you may be more concerned with the prediction accuracy. Calculating accuracy is easy. You round off the output (in the range of 0 to 1) to the nearest integer so you can get a binary value of 0 or 1. Then you count how much percentage your prediction matched the label; this gives you the accuracy.

But what is your prediction? It is y_pred above, which is the prediction by your current model on X_batch. Adding accuracy to the training loop becomes this:

for epoch in range(n_epochs):
    with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0) as bar:
        bar.set_description(f"Epoch {epoch}")
        for i in bar:
            # take a batch
            start = i * batch_size
            X_batch = X_train[start:start+batch_size]
            y_batch = y_train[start:start+batch_size]
            # forward pass
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)
            # backward pass
            optimizer.zero_grad()
            loss.backward()
            # update weights
            optimizer.step()
            # print progress, with accuracy
            acc = (y_pred.round() == y_batch).float().mean()
            bar.set_postfix(
                loss=float(loss)
                acc=float(acc)
            )

for epoch in range(n_epochs):

with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0) as bar:

bar.set_description(f"Epoch {epoch}")

for i in bar:

# take a batch

start = i * batch_size

X_batch = X_train[start:start+batch_size]

y_batch = y_train[start:start+batch_size]

# forward pass

y_pred = model(X_batch)

loss = loss_fn(y_pred, y_batch)

# backward pass

optimizer.zero_grad()

loss.backward()

# update weights

optimizer.step()

# print progress, with accuracy

acc = (y_pred.round() == y_batch).float().mean()

bar.set_postfix(

loss=float(loss)

acc=float(acc)

)

However, the X_batch and y_batch is used by the optimizer, and the optimizer will fine-tune your model so that it can predict y_batch from X_batch. And now you’re using accuracy to check if y_pred match with y_batch. It is like cheating because if your model somehow remembers the solution, it can just report to you the y_pred and get perfect accuracy without actually inferring y_pred from X_batch.

Indeed, a deep learning model can be so convoluted that you cannot know if your model simply remembers the answer or is inferring the answer. Therefore, the best way is not to calculate accuracy from X_batch or anything from X_trainbut from something else: your test set. Let’s add an accuracy measurement after each epoch using X_test:

for epoch in range(n_epochs):
    with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0) as bar:
        bar.set_description(f"Epoch {epoch}")
        for i in bar:
            # take a batch
            start = i * batch_size
            X_batch = X_train[start:start+batch_size]
            y_batch = y_train[start:start+batch_size]
            # forward pass
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)
            # backward pass
            optimizer.zero_grad()
            loss.backward()
            # update weights
            optimizer.step()
            # print progress
            acc = (y_pred.round() == y_batch).float().mean()
            bar.set_postfix(
                loss=float(loss),
                acc=float(acc)
            )
    # evaluate model at end of epoch
    y_pred = model(X_test)
    acc = (y_pred.round() == y_test).float().mean()
    acc = float(acc)
    print(f"End of {epoch}, accuracy {acc}")

for epoch in range(n_epochs):

with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0) as bar:

bar.set_description(f"Epoch {epoch}")

for i in bar:

# take a batch

start = i * batch_size

X_batch = X_train[start:start+batch_size]

y_batch = y_train[start:start+batch_size]

# forward pass

y_pred = model(X_batch)

loss = loss_fn(y_pred, y_batch)

# backward pass

optimizer.zero_grad()

loss.backward()

# update weights

optimizer.step()

# print progress

acc = (y_pred.round() == y_batch).float().mean()

bar.set_postfix(

loss=float(loss),

acc=float(acc)

)

# evaluate model at end of epoch

y_pred = model(X_test)

acc = (y_pred.round() == y_test).float().mean()

acc = float(acc)

print(f"End of {epoch}, accuracy {acc}")

In this case, the acc in the inner for-loop is just a metric showing the progress. Not much difference in displaying the loss metric, except it is not involved in the gradient descent algorithm. And you expect the accuracy to improve as the loss metric also improves.

In the outer for-loop, at the end of each epoch, you calculate the accuracy from X_test. The workflow is similar: You give the test set to the model and ask for its prediction, then count the number of matched results with your test set labels. But this accuracy is the one you should care about. It should improve as the training progresses, but if you do not see it improve (i.e., accuracy increase) or even deteriorates, you have to interrupt the training as it seems to start overfitting. Overfitting is when the model started to remember the training set rather than learning to infer the prediction from it. A sign of that is the accuracy from the training set keeps increasing while the accuracy from the test set decreases.

The following is the complete code to implement everything above, from data splitting to validation using the test set:

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import tqdm
from sklearn.model_selection import train_test_split

data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",")
X = data[:, 0:8]
y = data[:, 8]
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

model = nn.Sequential(
    nn.Linear(8, 12),
    nn.ReLU(),
    nn.Linear(12, 8),
    nn.ReLU(),
    nn.Linear(8, 1),
    nn.Sigmoid()
)

# loss function and optimizer
loss_fn = nn.BCELoss()  # binary cross entropy
optimizer = optim.Adam(model.parameters(), lr=0.0001)

n_epochs = 50    # number of epochs to run
batch_size = 10  # size of each batch
batches_per_epoch = len(X_train) // batch_size

for epoch in range(n_epochs):
    with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0) as bar: #, disable=True) as bar:
        bar.set_description(f"Epoch {epoch}")
        for i in bar:
            # take a batch
            start = i * batch_size
            X_batch = X_train[start:start+batch_size]
            y_batch = y_train[start:start+batch_size]
            # forward pass
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)
            # backward pass
            optimizer.zero_grad()
            loss.backward()
            # update weights
            optimizer.step()
            # print progress
            acc = (y_pred.round() == y_batch).float().mean()
            bar.set_postfix(
                loss=float(loss),
                acc=float(acc)
            )
    # evaluate model at end of epoch
    y_pred = model(X_test)
    acc = (y_pred.round() == y_test).float().mean()
    acc = float(acc)
    print(f"End of {epoch}, accuracy {acc}")

import numpy as np

import torch

import torch.nn as nn

import torch.optim as optim

import tqdm

from sklearn.model_selection import train_test_split

data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",")

X = data[:, 0:8]

y = data[:, 8]

X = torch.tensor(X, dtype=torch.float32)

y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

model = nn.Sequential(

nn.Linear(8, 12),

nn.ReLU(),

nn.Linear(12, 8),

nn.ReLU(),

nn.Linear(8, 1),

nn.Sigmoid()

)

# loss function and optimizer

loss_fn = nn.BCELoss() # binary cross entropy

optimizer = optim.Adam(model.parameters(), lr=0.0001)

n_epochs = 50 # number of epochs to run

batch_size = 10 # size of each batch

batches_per_epoch = len(X_train) // batch_size

for epoch in range(n_epochs):

with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0) as bar: #, disable=True) as bar:

bar.set_description(f"Epoch {epoch}")

for i in bar:

# take a batch

start = i * batch_size

X_batch = X_train[start:start+batch_size]

y_batch = y_train[start:start+batch_size]

# forward pass

y_pred = model(X_batch)

loss = loss_fn(y_pred, y_batch)

# backward pass

optimizer.zero_grad()

loss.backward()

# update weights

optimizer.step()

# print progress

acc = (y_pred.round() == y_batch).float().mean()

bar.set_postfix(

loss=float(loss),

acc=float(acc)

)

# evaluate model at end of epoch

y_pred = model(X_test)

acc = (y_pred.round() == y_test).float().mean()

acc = float(acc)

print(f"End of {epoch}, accuracy {acc}")

The code above will print the following:

End of 0, accuracy 0.5787401795387268
End of 1, accuracy 0.6102362275123596
End of 2, accuracy 0.6220472455024719
End of 3, accuracy 0.6220472455024719
End of 4, accuracy 0.6299212574958801
End of 5, accuracy 0.6377952694892883
End of 6, accuracy 0.6496062874794006
End of 7, accuracy 0.6535432934761047
End of 8, accuracy 0.665354311466217
End of 9, accuracy 0.6614173054695129
End of 10, accuracy 0.665354311466217
End of 11, accuracy 0.665354311466217
End of 12, accuracy 0.665354311466217
End of 13, accuracy 0.665354311466217
End of 14, accuracy 0.665354311466217
End of 15, accuracy 0.6732283234596252
End of 16, accuracy 0.6771653294563293
End of 17, accuracy 0.6811023354530334
End of 18, accuracy 0.6850393414497375
End of 19, accuracy 0.6889764070510864
End of 20, accuracy 0.6850393414497375
End of 21, accuracy 0.6889764070510864
End of 22, accuracy 0.6889764070510864
End of 23, accuracy 0.6889764070510864
End of 24, accuracy 0.6889764070510864
End of 25, accuracy 0.6850393414497375
End of 26, accuracy 0.6811023354530334
End of 27, accuracy 0.6771653294563293
End of 28, accuracy 0.6771653294563293
End of 29, accuracy 0.6692913174629211
End of 30, accuracy 0.6732283234596252
End of 31, accuracy 0.6692913174629211
End of 32, accuracy 0.6692913174629211
End of 33, accuracy 0.6732283234596252
End of 34, accuracy 0.6771653294563293
End of 35, accuracy 0.6811023354530334
End of 36, accuracy 0.6811023354530334
End of 37, accuracy 0.6811023354530334
End of 38, accuracy 0.6811023354530334
End of 39, accuracy 0.6811023354530334
End of 40, accuracy 0.6811023354530334
End of 41, accuracy 0.6771653294563293
End of 42, accuracy 0.6771653294563293
End of 43, accuracy 0.6771653294563293
End of 44, accuracy 0.6771653294563293
End of 45, accuracy 0.6771653294563293
End of 46, accuracy 0.6771653294563293
End of 47, accuracy 0.6732283234596252
End of 48, accuracy 0.6732283234596252
End of 49, accuracy 0.6732283234596252

End of 0, accuracy 0.5787401795387268

End of 1, accuracy 0.6102362275123596

End of 2, accuracy 0.6220472455024719

End of 3, accuracy 0.6220472455024719

End of 4, accuracy 0.6299212574958801

End of 5, accuracy 0.6377952694892883

End of 6, accuracy 0.6496062874794006

End of 7, accuracy 0.6535432934761047

End of 8, accuracy 0.665354311466217

End of 9, accuracy 0.6614173054695129

End of 10, accuracy 0.665354311466217

End of 11, accuracy 0.665354311466217

End of 12, accuracy 0.665354311466217

End of 13, accuracy 0.665354311466217

End of 14, accuracy 0.665354311466217

End of 15, accuracy 0.6732283234596252

End of 16, accuracy 0.6771653294563293

End of 17, accuracy 0.6811023354530334

End of 18, accuracy 0.6850393414497375

End of 19, accuracy 0.6889764070510864

End of 20, accuracy 0.6850393414497375

End of 21, accuracy 0.6889764070510864

End of 22, accuracy 0.6889764070510864

End of 23, accuracy 0.6889764070510864

End of 24, accuracy 0.6889764070510864

End of 25, accuracy 0.6850393414497375

End of 26, accuracy 0.6811023354530334

End of 27, accuracy 0.6771653294563293

End of 28, accuracy 0.6771653294563293

End of 29, accuracy 0.6692913174629211

End of 30, accuracy 0.6732283234596252

End of 31, accuracy 0.6692913174629211

End of 32, accuracy 0.6692913174629211

End of 33, accuracy 0.6732283234596252

End of 34, accuracy 0.6771653294563293

End of 35, accuracy 0.6811023354530334

End of 36, accuracy 0.6811023354530334

End of 37, accuracy 0.6811023354530334

End of 38, accuracy 0.6811023354530334

End of 39, accuracy 0.6811023354530334

End of 40, accuracy 0.6811023354530334

End of 41, accuracy 0.6771653294563293

End of 42, accuracy 0.6771653294563293

End of 43, accuracy 0.6771653294563293

End of 44, accuracy 0.6771653294563293

End of 45, accuracy 0.6771653294563293

End of 46, accuracy 0.6771653294563293

End of 47, accuracy 0.6732283234596252

End of 48, accuracy 0.6732283234596252

End of 49, accuracy 0.6732283234596252

k-Fold Cross Validation

In the above example, you calculated the accuracy from the test set. It is used as a score for the model as you progressed in the training. You want to stop at the point where this score is at its maximum. In fact, by merely comparing the score from this test set, you know your model works best after epoch 21 and starts to overfit afterward. Is that right?

If you built two models of different designs, should you just compare these models’ accuracy on the same test set and claim one is better than another?

Actually, you can argue that the test set is not representative enough even after you have shuffled your dataset before extracting the test set. You may also argue that, by chance, one model fits better to this particular test set but not always better. To make a stronger argument on which model is better independent of the selection of the test set, you can try multiple test sets and average the accuracy.

This is what a k-fold cross validation does. It is a progress to decide on which design works better. It works by repeating the training process from scratch for $k$ times, each with a different composition of the training and test sets. Because of that, you will have $k$ models and $k$ accuracy scores from their respective test set. You are not only interested in the average accuracy but also the standard deviation. The standard deviation tells whether the accuracy score is consistent or if some test set is particularly good or bad in a model.

Since k-fold cross validation trains the model from scratch a few times, it is best to wrap around the training loop in a function:

def model_train(X_train, y_train, X_test, y_test):
    # create new model
    model = nn.Sequential(
        nn.Linear(8, 12),
        nn.ReLU(),
        nn.Linear(12, 8),
        nn.ReLU(),
        nn.Linear(8, 1),
        nn.Sigmoid()
    )

    # loss function and optimizer
    loss_fn = nn.BCELoss()  # binary cross entropy
    optimizer = optim.Adam(model.parameters(), lr=0.0001)

    n_epochs = 25    # number of epochs to run
    batch_size = 10  # size of each batch
    batches_per_epoch = len(X_train) // batch_size

    for epoch in range(n_epochs):
        with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0, disable=True) as bar:
            bar.set_description(f"Epoch {epoch}")
            for i in bar:
                # take a batch
                start = i * batch_size
                X_batch = X_train[start:start+batch_size]
                y_batch = y_train[start:start+batch_size]
                # forward pass
                y_pred = model(X_batch)
                loss = loss_fn(y_pred, y_batch)
                # backward pass
                optimizer.zero_grad()
                loss.backward()
                # update weights
                optimizer.step()
                # print progress
                acc = (y_pred.round() == y_batch).float().mean()
                bar.set_postfix(
                    loss=float(loss),
                    acc=float(acc)
                )
    # evaluate accuracy at end of training
    y_pred = model(X_test)
    acc = (y_pred.round() == y_test).float().mean()
    return float(acc)

def model_train(X_train, y_train, X_test, y_test):

# create new model

model = nn.Sequential(

nn.Linear(8, 12),

nn.ReLU(),

nn.Linear(12, 8),

nn.ReLU(),

nn.Linear(8, 1),

nn.Sigmoid()

)

# loss function and optimizer

loss_fn = nn.BCELoss() # binary cross entropy

optimizer = optim.Adam(model.parameters(), lr=0.0001)

n_epochs = 25 # number of epochs to run

batch_size = 10 # size of each batch

batches_per_epoch = len(X_train) // batch_size

for epoch in range(n_epochs):

with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0, disable=True) as bar:

bar.set_description(f"Epoch {epoch}")

for i in bar:

# take a batch

start = i * batch_size

X_batch = X_train[start:start+batch_size]

y_batch = y_train[start:start+batch_size]

# forward pass

y_pred = model(X_batch)

loss = loss_fn(y_pred, y_batch)

# backward pass

optimizer.zero_grad()

loss.backward()

# update weights

optimizer.step()

# print progress

acc = (y_pred.round() == y_batch).float().mean()

bar.set_postfix(

loss=float(loss),

acc=float(acc)

)

# evaluate accuracy at end of training

y_pred = model(X_test)

acc = (y_pred.round() == y_test).float().mean()

return float(acc)

The code above is deliberately not printing anything (with disable=True in tqdm) to keep the screen less cluttered.

Also from scikit-learn, you have a function for k-fold cross validation. You can make use of it to produce a robust estimate of model accuracy:

from sklearn.model_selection import StratifiedKFold

# define 5-fold cross validation test harness
kfold = StratifiedKFold(n_splits=5, shuffle=True)
cv_scores = []
for train, test in kfold.split(X, y):
    # create model, train, and get accuracy
    acc = model_train(X[train], y[train], X[test], y[test])
    print("Accuracy: %.2f" % acc)
    cv_scores.append(acc)
# evaluate the model
print("%.2f%% (+/- %.2f%%)" % (np.mean(cv_scores)*100, np.std(cv_scores)*100))

from sklearn.model_selection import StratifiedKFold

# define 5-fold cross validation test harness

kfold = StratifiedKFold(n_splits=5, shuffle=True)

cv_scores = []

for train, test in kfold.split(X, y):

# create model, train, and get accuracy

acc = model_train(X[train], y[train], X[test], y[test])

print("Accuracy: %.2f" % acc)

cv_scores.append(acc)

# evaluate the model

print("%.2f%% (+/- %.2f%%)" % (np.mean(cv_scores)*100, np.std(cv_scores)*100))

Running this prints:

Accuracy: 0.64
Accuracy: 0.67
Accuracy: 0.68
Accuracy: 0.63
Accuracy: 0.59
64.05% (+/- 3.30%)

Accuracy: 0.64

Accuracy: 0.67

Accuracy: 0.68

Accuracy: 0.63

Accuracy: 0.59

64.05% (+/- 3.30%)

In scikit-learn, there are multiple k-fold cross validation functions, and the one used here is stratified k-fold. It assumes y are class labels and takes into account of their values such that it will provide a balanced class representation in the splits.

The code above used $k=5$ or 5 splits. It means splitting the dataset into five equal portions, picking one of them as the test set and combining the rest into a training set. There are five ways of doing that, so the for-loop above will have five iterations. In each iteration, you call the model_train() function and obtain the accuracy score in return. Then you save it into a list, which will be used to calculate the mean and standard deviation at the end.

The kfold object will return to you the indices. Hence you do not need to run the train-test split in advance but use the indices provided to extract the training set and test set on the fly when you call the model_train() function.

The result above shows the model is moderately good, at 64% average accuracy. And this score is stable since the standard deviation is at 3%. This means that most of the time, you expect the model accuracy to be 61% to 67%. You may try to change the model above, such as adding or removing a layer, and see how much change you have in the mean and standard deviation. You may also try to increase the number of epochs used in training and observe the result.

The mean and standard deviation from the k-fold cross validation is what you should use to benchmark a model design.

Tying it all together, below is the complete code for k-fold cross validation:

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import tqdm
from sklearn.model_selection import StratifiedKFold

data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",")
X = data[:, 0:8]
y = data[:, 8]
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)

def model_train(X_train, y_train, X_test, y_test):
    # create new model
    model = nn.Sequential(
        nn.Linear(8, 12),
        nn.ReLU(),
        nn.Linear(12, 8),
        nn.ReLU(),
        nn.Linear(8, 1),
        nn.Sigmoid()
    )

    # loss function and optimizer
    loss_fn = nn.BCELoss()  # binary cross entropy
    optimizer = optim.Adam(model.parameters(), lr=0.0001)

    n_epochs = 25    # number of epochs to run
    batch_size = 10  # size of each batch
    batches_per_epoch = len(X_train) // batch_size

    for epoch in range(n_epochs):
        with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0, disable=True) as bar:
            bar.set_description(f"Epoch {epoch}")
            for i in bar:
                # take a batch
                start = i * batch_size
                X_batch = X_train[start:start+batch_size]
                y_batch = y_train[start:start+batch_size]
                # forward pass
                y_pred = model(X_batch)
                loss = loss_fn(y_pred, y_batch)
                # backward pass
                optimizer.zero_grad()
                loss.backward()
                # update weights
                optimizer.step()
                # print progress
                acc = (y_pred.round() == y_batch).float().mean()
                bar.set_postfix(
                    loss=float(loss),
                    acc=float(acc)
                )
    # evaluate accuracy at end of training
    y_pred = model(X_test)
    acc = (y_pred.round() == y_test).float().mean()
    return float(acc)

# define 5-fold cross validation test harness
kfold = StratifiedKFold(n_splits=5, shuffle=True)
cv_scores = []
for train, test in kfold.split(X, y):
    # create model, train, and get accuracy
    acc = model_train(X[train], y[train], X[test], y[test])
    print("Accuracy: %.2f" % acc)
    cv_scores.append(acc)
# evaluate the model
print("%.2f%% (+/- %.2f%%)" % (np.mean(cv_scores)*100, np.std(cv_scores)*100))

import numpy as np

import torch

import torch.nn as nn

import torch.optim as optim

import tqdm

from sklearn.model_selection import StratifiedKFold

data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",")

X = data[:, 0:8]

y = data[:, 8]

X = torch.tensor(X, dtype=torch.float32)

y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)

def model_train(X_train, y_train, X_test, y_test):

# create new model

model = nn.Sequential(

nn.Linear(8, 12),

nn.ReLU(),

nn.Linear(12, 8),

nn.ReLU(),

nn.Linear(8, 1),

nn.Sigmoid()

)

# loss function and optimizer

loss_fn = nn.BCELoss() # binary cross entropy

optimizer = optim.Adam(model.parameters(), lr=0.0001)

n_epochs = 25 # number of epochs to run

batch_size = 10 # size of each batch

batches_per_epoch = len(X_train) // batch_size

for epoch in range(n_epochs):

with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0, disable=True) as bar:

bar.set_description(f"Epoch {epoch}")

for i in bar:

# take a batch

start = i * batch_size

X_batch = X_train[start:start+batch_size]

y_batch = y_train[start:start+batch_size]

# forward pass

y_pred = model(X_batch)

loss = loss_fn(y_pred, y_batch)

# backward pass

optimizer.zero_grad()

loss.backward()

# update weights

optimizer.step()

# print progress

acc = (y_pred.round() == y_batch).float().mean()

bar.set_postfix(

loss=float(loss),

acc=float(acc)

)

# evaluate accuracy at end of training

y_pred = model(X_test)

acc = (y_pred.round() == y_test).float().mean()

return float(acc)

# define 5-fold cross validation test harness

kfold = StratifiedKFold(n_splits=5, shuffle=True)

cv_scores = []

for train, test in kfold.split(X, y):

# create model, train, and get accuracy

acc = model_train(X[train], y[train], X[test], y[test])

print("Accuracy: %.2f" % acc)

cv_scores.append(acc)

# evaluate the model

print("%.2f%% (+/- %.2f%%)" % (np.mean(cv_scores)*100, np.std(cv_scores)*100))

Summary

In this post, you discovered the importance of having a robust way to estimate the performance of your deep learning models on unseen data, and you learned how to do that. You saw:

How to split data into training and test sets using scikit-learn
How to do k-fold cross validation with the help of scikit-learn
How to modify the training loop in a PyTorch model to incorporate test set validation and cross validation

6 Responses to How to Evaluate the Performance of PyTorch Models

Oladimeji February 10, 2023 at 5:54 pm #

Thanks a lot for your efforts highly appreciated

- James Carmichael February 11, 2023 at 7:38 am #
  
  You are very welcome Oladimeji! We appreciate your support and feedback!
  
Jahangir Khan July 25, 2023 at 12:40 am #

Good Job ! Thanks James

- James Carmichael July 25, 2023 at 8:31 am #
  
  You are very welcome Jahangir! We greatly appreciate the support and feedback!
  
shadow February 2, 2024 at 5:08 pm #

Thanks James !

- James Carmichael February 3, 2024 at 9:44 am #
  
  You are very welcome! We appreciate your support!

Navigation

How to Evaluate the Performance of PyTorch Models

Overview

Empirical Evaluation of Models

Want to Get Started With Deep Learning with PyTorch?

Data Splitting

Training a PyTorch Model with Validation

k-Fold Cross Validation

Summary

Get Started on Deep Learning with PyTorch!

Learn how to build deep learning models

Kick-start your deep learning journey with hands-on exercises

More On This Topic

6 Responses to How to Evaluate the Performance of PyTorch Models

Leave a Reply Click here to cancel reply.