Understanding the Design of a Convolutional Neural Network

Last Updated on August 6, 2022

Convolutional neural networks have been found successful in computer vision applications. Various network architectures are proposed, and they are neither magical nor hard to understand.

In this tutorial, you will make sense of the operation of convolutional layers and their role in a larger convolutional neural network.

After finishing this tutorial, you will learn:

  • How convolutional layers extract features from an image
  • How different convolutional layers can stack up to build a neural network

Let’s get started.

Understanding the design of a convolutional neural network
Photo by Kin Shing Lai. Some rights reserved.


This article is divided into three sections; they are:

  • An Example Network
  • Showing the Feature Maps
  • Effect of the Convolutional Layers

An Example Network

The following is a program to do image classification on the CIFAR-10 dataset:

This network should be able to achieve around 70% accuracy in classification. The images are in 32×32 pixels in RGB color. They are in 10 different classes, and the labels are integers from 0 to 9.

You can print the network using Keras’s summary() function:

In this network, the following will be shown on the screen:

It is typical in a network for image classification to be comprised of convolutional layers at an early stage, with dropout and pooling layers interleaved. Then, at a later stage, the output from convolutional layers is flattened and processed by some fully connected layers.

Showing the Feature Maps

In the above network, there are two convolutional layers (Conv2D). The first layer is defined as follows:

This means the convolutional layer will have a 3×3 kernel and apply on an input image of 32×32 pixels and three channels (the RGB colors). Therefore, the output of this layer will be 32 channels.

In order to make sense of the convolutional layer, you can check out its kernel. The variable model holds the network, and you can find the kernel of the first convolutional layer with the following:

This prints:

You can tell that model.layers[0] is the correct layer by comparing the name conv2d from the above output to the output of model.summary(). This layer has a kernel of the shape (3, 3, 3, 32), which are the height, width, input channels, and output feature maps, respectively.

Assume the kernel is a NumPy array k. A convolutional layer will take its kernel k[:, :, 0, n] (a 3×3 array) and apply on the first channel of the image. Then apply k[:, :, 1, n] on the second channel of the image, and so on. Afterward, the result of the convolution on all the channels is added up to become the feature map n of the output, where n, in this case, will run from 0 to 31 for the 32 output feature maps.

In Keras, you can extract the output of each layer using an extractor model. In the following, you will create a batch with one input image and send it to the network. Then look at the feature maps of the first convolutional layer:

The above code will print the feature maps like the following:

This corresponds to the following input image:

You can see that they are called feature maps because they are highlighting certain features from the input image. A feature is identified using a small window (in this case, over a 3×3 pixels filter). The input image has three color channels. Each channel has a different filter applied, and their results are combined for an output feature.

You can similarly display the feature map from the output of the second convolutional layer as follows:

This shows the following:

From the above, you can see that the features extracted are more abstract and less recognizable.

Effect of the Convolutional Layers

The most important hyperparameter to a convolutional layer is the size of the filter. Usually, it is in a square shape, and you can consider that as a window or receptive field to look at the input image. Therefore, the higher resolution of the image, then you can expect a larger filter.

On the other hand, a filter too large will blur the detailed features because all pixels from the receptive field through the filter will be combined into one pixel at the output feature map. Therefore, there is a trade-off for the appropriate size of the filter.

Stacking two convolutional layers (without any other layers in between) is equivalent to a single convolutional layer with a larger filter. But a typical design to use nowadays is two layers with small filters stacked together rather than one larger with a larger filter, as there are fewer parameters to train.

The exception would be a convolutional layer with a 1×1 filter. This is usually found as the beginning layer of a network. The purpose of such a convolutional layer is to combine the input channels into one rather than transforming the pixels. Conceptually, this can convert a color image into grayscale, but usually, you can use multiple ways of conversion to create more input channels than merely RGB for the network.

Also, note that in the above network, you are using Conv2D for a 2D filter. There is also a Conv3D layer for a 3D filter. The difference is whether you apply the filter separately for each channel or feature map or consider the input feature maps stacked up as a 3D array and apply a single filter to transform it altogether. Usually, the former is used as it is more reasonable to consider no particular order in which the feature maps should be stacked.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.




In this post, you have seen how to visualize the feature maps from a convolutional neural network and how it works to extract the feature maps.

Specifically, you learned:

  • The structure of a typical convolutional neural network
  • What is the effect of the filter size on a convolutional layer
  • What is the effect of stacking convolutional layers in a network

No comments yet.

Leave a Reply