Last Updated on March 22, 2023

A popular demonstration of the capability of deep learning techniques is object recognition in image data. The “hello world” of object recognition for machine learning and deep learning is the MNIST dataset for handwritten digit recognition. In this post, you will discover how to develop a deep learning model to achieve near state-of-the-art performance on the MNIST handwritten digit recognition task in PyTorch. After completing this chapter, you will know:

- How to load the MNIST dataset using torchvision
- How to develop and evaluate a baseline neural network model for the MNIST problem
- How to implement and evaluate a simple Convolutional Neural Network for MNIST
- How to implement a state-of-the-art deep learning model for MNIST

Let’s get started.

## Overview

This post is divided into five parts; they are:

- The MNIST Handwritten Digit Recognition Problem
- Loading the MNIST Dataset in PyTorch
- Baseline Model with Multilayer Perceptrons
- Simple Convolutional Neural Network for MNIST
- LeNet5 for MNIST

## The MNIST Handwritten Digit Recognition Problem

The MNIST problem is a classic problem that can demonstrate the power of convolutional neural networks. The MNIST dataset was developed by Yann LeCun, Corinna Cortes, and Christopher Burges for evaluating machine learning models on the handwritten digit classification problem. The dataset was constructed from a number of scanned document datasets available from the National Institute of Standards and Technology (NIST). This is where the name for the dataset comes from, the Modified NIST or MNIST dataset.

Images of digits were taken from a variety of scanned documents, normalized in size, and centered. This makes it an excellent dataset for evaluating models, allowing the developer to focus on machine learning with minimal data cleaning or preparation required. Each image is a 28×28-pixel square (784 pixels total) in grayscale. A standard split of the dataset is used to evaluate and compare models, where 60,000 images are used to train a model, and a separate set of 10,000 images are used to test it.

To goal of this problem is to identify the digits on the image. There are ten digits (0 to 9) or ten classes to predict. The state-of-the-art prediction accuracy is at 99.8% level, achieved with large convolutional neural networks.

**Kick-start your project** with my book Deep Learning with PyTorch. It provides **self-study tutorials** with **working code** to guide you into building a fully-working transformer model that can*translate sentences from one language to another*...

## Loading the MNIST Dataset in PyTorch

The `torchvision`

library is a sister project of PyTorch that provide specialized functions for computer vision tasks. There is a function in `torchvision`

that can download the MNIST dataset for use with PyTorch. The dataset is downloaded the first time this function is called and stored locally, so you don’t need to download again in the future. Below is a little script to download and visualize the first 16 images in the training subset of the MNIST dataset.

1 2 3 4 5 6 7 8 9 10 |
import matplotlib.pyplot as plt import torchvision train = torchvision.datasets.MNIST('./data', train=True, download=True) fig, ax = plt.subplots(4, 4, sharex=True, sharey=True) for i in range(4): for j in range(4): ax[i][j].imshow(train.data[4*i+j], cmap="gray") plt.show() |

## Baseline Model with Multilayer Perceptrons

Do you really need a complex model like a convolutional neural network to get the best results with MNIST? You can get good results using a very simple neural network model with a single hidden layer. In this section, you will create a simple multilayer perceptron model that achieves accuracy of 99.81%. You will use this as a baseline for comparison to more complex convolutional neural network models. First, let’s check what the data looks like:

1 2 3 4 5 6 7 8 9 10 |
import torch import torch.nn as nn import torch.optim as optim import torchvision # Load MNIST data train = torchvision.datasets.MNIST('data', train=True, download=True) test = torchvision.datasets.MNIST('data', train=True, download=True) print(train.data.shape, train.targets.shape) print(test.data.shape, test.targets.shape) |

You should see:

1 2 |
torch.Size([60000, 28, 28]) torch.Size([60000]) torch.Size([10000, 28, 28]) torch.Size([10000]) |

The training dataset is structured as a 3-dimensional array of instance, image height, and image width. For a multilayer perceptron model, you must reduce the images down into a vector of pixels. In this case, the 28×28-sized images will be 784 pixel input vectors. You can do this transform easily using the `reshape()`

function.

The pixel values are grayscale between 0 and 255. It is almost always a good idea to perform some scaling of input values when using neural network models. Because the scale is well known and well behaved, you can very quickly normalize the pixel values to the range 0 and 1 by dividing each value by the maximum of 255.

In the following, you transform the dataset, convert to floating point, and normalize them by scaling floating point values and you can normalize them easily in the next step.

1 2 3 4 5 |
# each sample becomes a vector of values 0-1 X_train = train.data.reshape(-1, 784).float() / 255.0 y_train = train.targets X_test = test.data.reshape(-1, 784).float() / 255.0 y_test = test.targets |

The output targets `y_train`

and `y_test`

are labels in the form of integers from 0 to 9. This is a multiclass classification problem. You can convert these labels into one-hot encoding or keep them as integer labels like this case. You are going to use the cross entropy function to evaluate the model performance and the PyTorch implementation of cross entropy function can be applied on one-hot encoded targets or integer labeled targets.

You are now ready to create your simple neural network model. You will define your model in a PyTorch `Module`

class.

1 2 3 4 5 6 7 8 9 10 11 |
class Baseline(nn.Module): def __init__(self): super().__init__() self.layer1 = nn.Linear(784, 784) self.act1 = nn.ReLU() self.layer2 = nn.Linear(784, 10) def forward(self, x): x = self.act1(self.layer1(x)) x = self.layer2(x) return x |

The model is a simple neural network with one hidden layer with the same number of neurons as there are inputs (784). A rectifier activation function is used for the neurons in the hidden layer. The output of this model are **logits**, meaning they are real numbers which can be transformed into probability-like values using a softmax function. You do not apply the softmax function explicitly because the cross entropy function will do that for you.

You will use the stochastic gradient descent algorithm (with learning rate set to 0.01) to optimize this model. The training loop is as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
model = Baseline() optimizer = optim.SGD(model.parameters(), lr=0.01) loss_fn = nn.CrossEntropyLoss() loader = torch.utils.data.DataLoader(list(zip(X_train, y_train)), shuffle=True, batch_size=100) n_epochs = 10 for epoch in range(n_epochs): model.train() for X_batch, y_batch in loader: y_pred = model(X_batch) loss = loss_fn(y_pred, y_batch) optimizer.zero_grad() loss.backward() optimizer.step() # Validation model.eval() y_pred = model(X_test) acc = (torch.argmax(y_pred, 1) == y_test).float().mean() print("Epoch %d: model accuracy %.2f%%" % (epoch, acc*100)) |

The MNIST dataset is small. This example should complete in a minute, with the output below. This simple network can produce accuracy at 92%.

1 2 3 4 5 6 7 8 9 10 |
Epoch 0: model accuracy 84.11% Epoch 1: model accuracy 87.53% Epoch 2: model accuracy 89.01% Epoch 3: model accuracy 89.76% Epoch 4: model accuracy 90.29% Epoch 5: model accuracy 90.69% Epoch 6: model accuracy 91.10% Epoch 7: model accuracy 91.48% Epoch 8: model accuracy 91.74% Epoch 9: model accuracy 91.96% |

Below is the complete code for the above multilayer perceptron classification on MNIST dataset.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
import torch import torch.nn as nn import torch.optim as optim import torchvision # Load MNIST data train = torchvision.datasets.MNIST('data', train=True, download=True) test = torchvision.datasets.MNIST('data', train=True, download=True) # each sample becomes a vector of values 0-1 X_train = train.data.reshape(-1, 784).float() / 255.0 y_train = train.targets X_test = test.data.reshape(-1, 784).float() / 255.0 y_test = test.targets class Baseline(nn.Module): def __init__(self): super().__init__() self.layer1 = nn.Linear(784, 784) self.act1 = nn.ReLU() self.layer2 = nn.Linear(784, 10) def forward(self, x): x = self.act1(self.layer1(x)) x = self.layer2(x) return x model = Baseline() optimizer = optim.SGD(model.parameters(), lr=0.01) loss_fn = nn.CrossEntropyLoss() loader = torch.utils.data.DataLoader(list(zip(X_train, y_train)), shuffle=True, batch_size=100) n_epochs = 10 for epoch in range(n_epochs): model.train() for X_batch, y_batch in loader: y_pred = model(X_batch) loss = loss_fn(y_pred, y_batch) optimizer.zero_grad() loss.backward() optimizer.step() # Validation model.eval() y_pred = model(X_test) acc = (torch.argmax(y_pred, 1) == y_test).float().mean() print("Epoch %d: model accuracy %.2f%%" % (epoch, acc*100)) |

## Simple Convolutional Neural Network for MNIST

Now that you have seen how to use multilayer perceptron model to classify MNIST dataset. Let’s move on to try a convolutional neural network model. In this section, you will create a simple CNN for MNIST that demonstrates how to use all the aspects of a modern CNN implementation, including convolutional layers, pooling layers, and dropout layers.

In PyTorch, convolutional layers are supposed to work on images. Tensors for images should be the pixel values with the dimensions (sample, channel, height, width) but when you load images using libraries such as PIL, the pixels are usually presented as array of dimensions (height, width, channel). The conversion to a proper tensor format can be done using a transform from the `torchvision`

library.

1 2 3 4 5 6 7 8 9 |
... transform = torchvision.transforms.Compose([ torchvision.transforms.ToTensor(), torchvision.transforms.Normalize((0,), (128,)), ]) train = torchvision.datasets.MNIST('data', train=True, download=True, transform=transform) test = torchvision.datasets.MNIST('data', train=True, download=True, transform=transform) trainloader = torch.utils.data.DataLoader(train, shuffle=True, batch_size=100) testloader = torch.utils.data.DataLoader(test, shuffle=True, batch_size=100) |

You need to use `DataLoader`

because the transform is applied when you read the data from the `DataLoader`

.

Next, define your neural network model. Convolutional neural networks are more complex than standard multilayer perceptrons, so you will start by using a simple structure that uses all the elements for state-of-the-art results. Below summarizes the network architecture.

- The first hidden layer is a convolutional layer,
`nn.Conv2d()`

. The layer turns a grayscale image into 10 feature maps, with the filter size of 5×5 and a ReLU activation function. This is the input layer that expects images with the structure outlined above. - Next is a pooling layer that takes the max,
`nn.MaxPool2d()`

. It is configured with a pool size of 2×2 with stride 1. What it does is to take the maximum in a 2×2 pixel patch per channel and assign the value to the output pixel. The result is a 27×27-pixels feature map per channel. - The next layer is a regularization layer using dropout,
`nn.Dropout()`

. It is configured to randomly exclude 20% of neurons in the layer in order to reduce overfitting. - Next is a layer that converts the 2D matrix data to a vector, using
`nn.Flatten`

. There are 10 channels from its input and each channel’s feature map has size 27×27. This layer allows the output to be processed by standard, fully connected layers. - Next is a fully connected layer with 128 neurons. ReLU activation function is used.
- Finally, the output layer has ten neurons for the ten classes. You can transform the output into probability-like predictions by applying a softmax function on it.

This model is trained using cross entropy loss and the Adam optimiztion algorithm. It is implemented as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
class CNN(nn.Module): def __init__(self): super().__init__() self.conv = nn.Conv2d(1, 10, kernel_size=5, stride=1, padding=2) self.relu1 = nn.ReLU() self.pool = nn.MaxPool2d(kernel_size=2, stride=1) self.dropout = nn.Dropout(0.2) self.flat = nn.Flatten() self.fc = nn.Linear(27*27*10, 128) self.relu2 = nn.ReLU() self.output = nn.Linear(128, 10) def forward(self, x): x = self.relu1(self.conv(x)) x = self.pool(x) x = self.dropout(x) x = self.relu2(self.fc(self.flat(x))) x = self.output(x) return x model = CNN() optimizer = optim.Adam(model.parameters(), lr=0.01) loss_fn = nn.CrossEntropyLoss() n_epochs = 10 for epoch in range(n_epochs): model.train() for X_batch, y_batch in trainloader: y_pred = model(X_batch) loss = loss_fn(y_pred, y_batch) optimizer.zero_grad() loss.backward() optimizer.step() # Validation model.eval() acc = 0 count = 0 for X_batch, y_batch in testloader: y_pred = model(X_batch) acc += (torch.argmax(y_pred, 1) == y_batch).float().sum() count += len(y_batch) acc = acc / count print("Epoch %d: model accuracy %.2f%%" % (epoch, acc*100)) |

Running the above takes a few minutes and produces the following:

1 2 3 4 5 6 7 8 9 10 |
Epoch 0: model accuracy 81.74% Epoch 1: model accuracy 85.38% Epoch 2: model accuracy 86.37% Epoch 3: model accuracy 87.75% Epoch 4: model accuracy 88.00% Epoch 5: model accuracy 88.17% Epoch 6: model accuracy 88.81% Epoch 7: model accuracy 88.34% Epoch 8: model accuracy 88.86% Epoch 9: model accuracy 88.75% |

Not the best result but this demonstrates how convolutional layer works.

Below is the complete code for using the simple convolutional network.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
import torch import torch.nn as nn import torch.optim as optim import torchvision # Load MNIST data transform = torchvision.transforms.Compose([ torchvision.transforms.ToTensor(), torchvision.transforms.Normalize((0,), (128,)), ]) train = torchvision.datasets.MNIST('data', train=True, download=True, transform=transform) test = torchvision.datasets.MNIST('data', train=True, download=True, transform=transform) trainloader = torch.utils.data.DataLoader(train, shuffle=True, batch_size=100) testloader = torch.utils.data.DataLoader(test, shuffle=True, batch_size=100) class CNN(nn.Module): def __init__(self): super().__init__() self.conv = nn.Conv2d(1, 10, kernel_size=5, stride=1, padding=2) self.relu1 = nn.ReLU() self.pool = nn.MaxPool2d(kernel_size=2, stride=1) self.dropout = nn.Dropout(0.2) self.flat = nn.Flatten() self.fc = nn.Linear(27*27*10, 128) self.relu2 = nn.ReLU() self.output = nn.Linear(128, 10) def forward(self, x): x = self.relu1(self.conv(x)) x = self.pool(x) x = self.dropout(x) x = self.relu2(self.fc(self.flat(x))) x = self.output(x) return x model = CNN() optimizer = optim.Adam(model.parameters()) loss_fn = nn.CrossEntropyLoss() n_epochs = 10 for epoch in range(n_epochs): model.train() for X_batch, y_batch in trainloader: y_pred = model(X_batch) loss = loss_fn(y_pred, y_batch) optimizer.zero_grad() loss.backward() optimizer.step() # Validation model.eval() acc = 0 count = 0 for X_batch, y_batch in testloader: y_pred = model(X_batch) acc += (torch.argmax(y_pred, 1) == y_batch).float().sum() count += len(y_batch) acc = acc / count print("Epoch %d: model accuracy %.2f%%" % (epoch, acc*100)) |

## LeNet5 for MNIST

The previous model has only one convolutional layer. Of course, you can add more to make a deeper model. One of the earliest demonstration of the effectiveness of convolutional layers in neural networks is the “LeNet5” model. This model is developed to solve the MNIST classification problem. It has three convolutional layers and two fully connected layer to make up five trainable layers in the model, as it is named.

At the time it was developed, using hyperbolic tangent function as activation is common. Hence it is used here. This model is implemented as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
class LeNet5(nn.Module): def __init__(self): super().__init__() self.conv1 = nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=2) self.act1 = nn.Tanh() self.pool1 = nn.AvgPool2d(kernel_size=2, stride=2) self.conv2 = nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0) self.act2 = nn.Tanh() self.pool2 = nn.AvgPool2d(kernel_size=2, stride=2) self.conv3 = nn.Conv2d(16, 120, kernel_size=5, stride=1, padding=0) self.act3 = nn.Tanh() self.flat = nn.Flatten() self.fc1 = nn.Linear(1*1*120, 84) self.act4 = nn.Tanh() self.fc2 = nn.Linear(84, 10) def forward(self, x): # input 1x28x28, output 6x28x28 x = self.act1(self.conv1(x)) # input 6x28x28, output 6x14x14 x = self.pool1(x) # input 6x14x14, output 16x10x10 x = self.act2(self.conv2(x)) # input 16x10x10, output 16x5x5 x = self.pool2(x) # input 16x5x5, output 120x1x1 x = self.act3(self.conv3(x)) # input 120x1x1, output 84 x = self.act4(self.fc1(self.flat(x))) # input 84, output 10 x = self.fc2(x) return x |

Compare to the previous model, LeNet5 does not have Dropout layer (because Dropout layer was invented several years after LeNet5) and use average pooling instead of max pooling (i.e., for a patch of 2×2 pixels, it is taking average of the pixel values instead of taking the maximum). But the most notable characteristic of LeNet5 model is that it uses strides and paddings to reduce the image size from 28×28 pixel down to 1×1 pixel while increasing the number of channels from a one (grayscale) into 120.

Padding means to add pixels of value 0 at the border of the image to make it a bit larger. Without padding, the output of a convolutional layer will be smaller than its input. The stride parameter controls how much the filter should move to produce the next pixel in the output. Usually it is 1 to preserve the same size. If it is larger than 1, the output is a **downsampling** of the input. Hence you see in the LeNet5 model, stride 2 was used in the pooling layers to make, for example, a 28×28-pixel image into 14×14.

Training this model is same as training the previous convolutional network model, as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
... model = LeNet5() optimizer = optim.Adam(model.parameters()) loss_fn = nn.CrossEntropyLoss() n_epochs = 10 for epoch in range(n_epochs): model.train() for X_batch, y_batch in trainloader: y_pred = model(X_batch) loss = loss_fn(y_pred, y_batch) optimizer.zero_grad() loss.backward() optimizer.step() # Validation model.eval() acc = 0 count = 0 for X_batch, y_batch in testloader: y_pred = model(X_batch) acc += (torch.argmax(y_pred, 1) == y_batch).float().sum() count += len(y_batch) acc = acc / count print("Epoch %d: model accuracy %.2f%%" % (epoch, acc*100)) |

Running this you may see:

1 2 3 4 5 6 7 8 9 10 |
Epoch 0: model accuracy 89.46% Epoch 1: model accuracy 93.14% Epoch 2: model accuracy 94.69% Epoch 3: model accuracy 95.84% Epoch 4: model accuracy 96.43% Epoch 5: model accuracy 96.99% Epoch 6: model accuracy 97.14% Epoch 7: model accuracy 97.66% Epoch 8: model accuracy 98.05% Epoch 9: model accuracy 98.22% |

Here, we achieved accuracy beyond 98%.

The following is the complete code.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
import torch import torch.nn as nn import torch.optim as optim import torchvision # Load MNIST data transform = torchvision.transforms.Compose([ torchvision.transforms.ToTensor(), torchvision.transforms.Normalize((0,), (128,)), ]) train = torchvision.datasets.MNIST('data', train=True, download=True, transform=transform) test = torchvision.datasets.MNIST('data', train=True, download=True, transform=transform) trainloader = torch.utils.data.DataLoader(train, shuffle=True, batch_size=100) testloader = torch.utils.data.DataLoader(test, shuffle=True, batch_size=100) class LeNet5(nn.Module): def __init__(self): super().__init__() self.conv1 = nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=2) self.act1 = nn.Tanh() self.pool1 = nn.AvgPool2d(kernel_size=2, stride=2) self.conv2 = nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0) self.act2 = nn.Tanh() self.pool2 = nn.AvgPool2d(kernel_size=2, stride=2) self.conv3 = nn.Conv2d(16, 120, kernel_size=5, stride=1, padding=0) self.act3 = nn.Tanh() self.flat = nn.Flatten() self.fc1 = nn.Linear(1*1*120, 84) self.act4 = nn.Tanh() self.fc2 = nn.Linear(84, 10) def forward(self, x): # input 1x28x28, output 6x28x28 x = self.act1(self.conv1(x)) # input 6x28x28, output 6x14x14 x = self.pool1(x) # input 6x14x14, output 16x10x10 x = self.act2(self.conv2(x)) # input 16x10x10, output 16x5x5 x = self.pool2(x) # input 16x5x5, output 120x1x1 x = self.act3(self.conv3(x)) # input 120x1x1, output 84 x = self.act4(self.fc1(self.flat(x))) # input 84, output 10 x = self.fc2(x) return x model = LeNet5() optimizer = optim.Adam(model.parameters()) loss_fn = nn.CrossEntropyLoss() n_epochs = 10 for epoch in range(n_epochs): model.train() for X_batch, y_batch in trainloader: y_pred = model(X_batch) loss = loss_fn(y_pred, y_batch) optimizer.zero_grad() loss.backward() optimizer.step() # Validation model.eval() acc = 0 count = 0 for X_batch, y_batch in testloader: y_pred = model(X_batch) acc += (torch.argmax(y_pred, 1) == y_batch).float().sum() count += len(y_batch) acc = acc / count print("Epoch %d: model accuracy %.2f%%" % (epoch, acc*100)) |

## Resources on MNIST

The MNIST dataset is very well studied. Below are some additional resources you might want to look into.

- Yann LeCun, Corinna Cortes, and Christopher J. C. Burges. The MNIST database of handwritten digits.
- Rodrigo Benenson. What is the class of this image? Classification datasets results, 2016.
- Digit Recognizer: Learn computer vision fundamentals with the famous MNIST data. Kaggle.
- Hubert Eichner. Neural Net for Handwritten Digit Recognition in JavaScript.

## Summary

In this post, you discovered the MNIST handwritten digit recognition problem and deep learning models developed in Python using the Keras library that are capable of achieving excellent results. Working through this chapter, you learned:

- How to load the MNIST dataset in PyTorch with torchvision
- How to convert the MNIST dataset into PyTorch tensors for consumption by a convolutional neural network
- How to use PyTorch to create convolutional neural network models for MNIST
- How to implement the LeNet5 model for MNIST classification

## No comments yet.