Understand Model Behavior During Training by Visualizing Metrics

Last Updated on March 22, 2023

You can learn a lot about neural networks and deep learning models by observing their performance over time during training. For example, if you see the training accuracy went worse with training epochs, you know you have issue with the optimization. Probably your learning rate is too fast. In this post, you will discover how you can review and visualize the performance of PyTorch models over time during training. After completing this post, you will know:

  • What metrics to collect during training
  • How to plot the metrics on training and validation datasets from training
  • How to interpret the plot to tell about the model and training progress

Let’s get started.

Understand Model Behavior During Training by Visualizing Metrics
Photo by Alison Pang. Some rights reserved.


This chapter is in two parts; they are:

  • Collecting Metrics from a Training Loop
  • Plotting the Training History

Collecting Metrics from a Training Loop

In deep learning, training a model with gradient descent algorithm means to take a forward pass to infer loss metric from the input using the model and a loss function, then a backward pass to compute the gradient from the loss metric, and a update process to apply the gradient to update the model parameters. While these are the basic steps you must take, you can do a bit more along the process to collect additional information.

A model that trained correctly should expect the loss metric to decrease, as the loss is the objective to optimize. The loss metric to use should depends on the problem.

For regression problems, the closer the model’s prediction to the actual value the better. Therefore you want to keep track on the mean square error (MSE), or sometimes root mean square error (RMSE), mean absolute error (MAE), or mean absolute percentage error (MAPE). Although not used as a loss metric, you may also interested in the maximum error produced by your model.

For classification problems, usually the loss metric is cross entropy. But the value of cross entropy is not very intuitive. Therefore you may also want to keep track on the accuracy of prediction, true positive rate, precision, recall, F1 scores, and so on.

Collecting these metrics from a training loop is trivial. Let’s start with a basic regression example of deep learning using PyTorch with the California housing dataset:

This implementation is primitive, but you obtained loss as a tensor in each step in the process which provides hints to the optimizer to improve the model. To know about the progress of the training, you can, of course, print this loss metric at every step. But you can also save this value so you can visualize it later. When you do that, beware that you do not want to save a tensor but simply its value. It is because the PyTorch tensor here remembers how it comes with its value so automatic differentiation can be done. These additional data are occupying memory but you do not need them.

Hence you can modify the training loop to the following:

In training a model, you should evaluate it with a test set which is segregated from the training set. Usually it is done once in an epoch, after all the training steps in that epoch. The test result can also be saved for visualization later. In fact, you can obtain multiple metrics from the test set if you want to. Hence you can add to the training loop as follows:

You can define your own function to compute the metrics or use one that already implemented from PyTorch library. It is a good practice to switch the model to evaluation mode on evaluation. It is also good practice to run the evaluation under the no_grad() context, in which you explicitly tell PyTorch that you have no intention to run automatic differentiation on the tensors.

However, there is a problem in the code above: The MSE from training set is computed once per training step based on one batch while the metrics from the test set are computed once per epoch and based on the entire test set. They are not directly comparable. In fact, if you look a the MSE from training steps, you will find it very noisy. The better way is to summarize the MSE from the same epoch to one number (e.g., their mean) so you can compare to the test set’s data.

Making this change, following is the complete code:

Kick-start your project with my book Deep Learning with PyTorch. It provides self-study tutorials with working code.

Plotting the Training History

In the code above, you collected metrics in a Python list, one each per epoch. Therefore, it is trivial to plot them into a line graph using matplotlib. Below is an example:

It plots, for example, the following:

Plots like this can provide an indication of useful things about the training of the model, such as:

  • Its speed of convergence over epochs (slope)
  • Whether the model may have already converged (plateau of the line)
  • Whether the model may be over-learning the training data (inflection for validation line)

In a regression example like the above, the metrics MAE and MSE should both decrease if the model gets better. In a classification example, however, accuracy metric should increase while the cross entropy loss should decrease as more training has been done. This is what you are expected to see from the plot.

These curves should eventually flatten, meaning you cannot improve the model any further based on the current dataset, model design, and algorithms. You want this to happen as soon as possible, so your model converge faster as your training is efficient. You also want the metric to flatten at a high accuracy or low loss region, so your model is effective in prediction.

The other property to watch for in the plots is how different are the metrics from training and validation. In the above, you see the training set’s RMSE is higher than test set’s RMSE at the beginning but very soon, the curves crossed and the test set’s RMSE is higher at the end. This is expected, as eventually the model will fit better to the training set but it is the test set that can predict how the model performs on future, unseen data.

You need to be careful to interpret the curves or metrics in a microscopic scale. In the plot above, you see that the training set’s RMSE is extremely large compare to that of test set’s in epoch 0. Their difference may not be that drastic, but since you collected the training set’s RMSE by taking the MSE of each steps during the first epoch, your model probably not doing well in the first few steps but much better at the last few steps of the epoch. Taking average across all the steps may not be a fair comparison as the MSE from test set is based on the model after the last step.

Your model is overfit if you see the training set’s metric is much better than that from test set. This can hint that you should stop your training at an earlier epoch or your model’s design need some regularization, such as dropout layer.

In the plot above, while you collected mean square error (MSE) for the regression problem but you plotted root mean square error (RMSE) instead, so you can compare to the mean absolute error (MAE) in the same scale. Probably you should also collect the MAE of the training set as well. The two MAE curves should behave similarly to that of the RMSE curves.

Putting everything together, the following is the complete code:

Further Readings

This section provides more resources on the topic if you are looking to go deeper.



In this chapter, you discovered the importance of collecting and reviewing metrics while training your deep learning models. You learned:

  • What metrics to look for during model training
  • How to compute and collect metrics in a PyTorch training loop
  • How to visualize the metrics from a training loop
  • How to interpret the metrics to infer details about the training experience

No comments yet.

Leave a Reply