Convolutional Neural Networks are a powerful artificial neural network technique.

These networks preserve the spatial structure of the problem and were developed for object recognition tasks such as handwritten digit recognition. They are popular because people are achieving state-of-the-art results on difficult computer vision and natural language processing tasks.

In this post you will discover Convolutional Neural Networks for deep learning, also called ConvNets or CNNs. After completing this crash course you will know:

- The building blocks used in CNNs such as convolutional layers and pool layers.
- How the building blocks fit together with a short worked example.
- Best practices for configuring CNNs on your own object recognition tasks.
- References for state of the art networks applied to complex machine learning problems.

Let’s get started.

## The Case for Convolutional Neural Networks

Given a dataset of gray scale images with the standardized size of 32×32 pixels each, a traditional feedforward neural network would require 1024 input weights (plus one bias).

This is fair enough, but the flattening of the image matrix of pixels to a long vector of pixel values loses all of the spatial structure in the image. Unless all of the images are perfectly resized, the neural network will have great difficulty with the problem.

Convolutional Neural Networks expect and preserve the spatial relationship between pixels by learning internal feature representations using small squares of input data. Feature are learned and used across the whole image, allowing for the objects in the images to be shifted or translated in the scene and still detectable by the network.

It is this reason why the network is so useful for object recognition in photographs, picking out digits, faces, objects and so on with varying orientation.

In summary, below are some benefits of using convolutional neural networks:

- They use fewer parameters (weights) to learn than a fully connected network.
- They are designed to be invariant to object position and distortion in the scene.
- They automatically learn and generalize features from the input domain.

## Beat the Math/Theory Doldrums and Start using Deep Learning in your own projects Today, without getting lost in “documentation hell”

Get my free Deep Learning With Python mini course and develop your own deep nets by the time you’ve finished the first PDF with just a few lines of Python.

#### Daily lessons in your inbox for 14 days, and a DL-With-Python “Cheat Sheet” you can download right now.

## Building Blocks of Convolutional Neural Networks

There are three types of layers in a Convolutional Neural Network:

- Convolutional Layers.
- Pooling Layers.
- Fully-Connected Layers.

### 1. Convolutional Layers

Convolutional layers are comprised of filters and feature maps.

#### Filters

The filters are the “neurons” of the layer. The have input weights and output a value. The input size is a fixed square called a patch or a receptive field.

If the convolutional layer is an input layer, then the input patch will be pixel values. If the deeper in the network architecture, then the convolutional layer will take input from a feature map from the previous layer.

#### Feature Maps

The feature map is the output of one filter applied to the previous layer.

A given filter is drawn across the entire previous layer, moved one pixel at a time. Each position results in an activation of the neuron and the output is collected in the feature map. You can see that if the receptive field is moved one pixel from activation to activation, then the field will overlap with the previous activation by (field width – 1) input values.

#### Zero Padding

The distance that filter is moved across the the input from the previous layer each activation is referred to as the stride.

If the size of the previous layer is not cleanly divisible by the size of the filters receptive field and the size of the stride then it is possible for the receptive field to attempt to read off the edge of the input feature map. In this case, techniques like zero padding can be used to invent mock inputs for the receptive field to read.

### 2. Pooling Layers

The pooling layers down-sample the previous layers feature map.

Pooling layers follow a sequence of one or more convolutional layers and are intended to consolidate the features learned and expressed in the previous layers feature map. As such, pooling may be consider a technique to compress or generalize feature representations and generally reduce the overfitting of the training data by the model.

They too have a receptive field, often much smaller than the convolutional layer. Also, the stride or number of inputs that the receptive field is moved for each activation is often equal to the size of the receptive field to avoid any overlap.

Pooling layers are often very simple, taking the average or the maximum of the input value in order to create its own feature map.

### 3. Fully Connected Layers

Fully connected layers are the normal flat feed-forward neural network layer.

These layers may have a non-linear activation function or a softmax activation in order to output probabilities of class predictions.

Fully connected layers are used at the end of the network after feature extraction and consolidation has been performed by the convolutional and pooling layers. They are used to create final non-linear combinations of features and for making predictions by the network.

## Worked Example of a Convolutional Neural Network

You now know about convolutional, pooling and fully connected layers. Let’s make this more concrete by working through how these three layers may be connected together.

### 1. Image Input Data

Let’s assume we have a dataset of grayscale images. Each image has the same size of 32 pixels wide and 32 pixels high, and pixel values are between 0 and 255, g.e. a matrix of 32x32x1 or 1024 pixel values.

Image input data is expressed as a 3-dimensional matrix of width * height * channels. If we were using color images in our example, we would have 3 channels for the red, green and blue pixel values, e.g. 32x32x3.

### 2. Convolutional Layer

We define a convolutional layer with 10 filters and a receptive field 5 pixels wide and 5 pixels high and a stride length of 1.

Because each filter can only get input from (i.e. “see”) 5×5 (25) pixels at a time, we can calculate that each will require 25 + 1 input weights (plus 1 for the bias input).

Dragging the 5×5 receptive field across the input image data with a stride width of 1 will result in a feature map of 28×28 output values or 784 distinct activations per image.

We have 10 filters, so that is 10 different 28×28 feature maps or 7,840 outputs that will be created for one image.

Finally, we know we have 26 inputs per filter, 10 filters and 28×28 output values to calculate per filter, therefore we have a total of 26x10x28x28 or 203,840 “connections” in our convolutional layer, we we want to phrase it using traditional neural network nomenclature.

Convolutional layers also make use of a nonlinear transfer function as part of activation and the rectifier activation function is the popular default to use.

### 3. Pool Layer

We define a pooling layer with a receptive field with a width of 2 inputs and a height of 2 inputs. We also use a stride of 2 to ensure that there is no overlap.

This results in feature maps that are one half the size of the input feature maps. From 10 different 28×28 feature maps as input to 10 different 14×14 feature maps as output.

We will use a max() operation for each receptive field so that the activation is the maximum input value.

### 4. Fully Connected Layer

Finally, we can flatten out the square feature maps into a traditional flat fully connected layer.

We can define the fully connected layer with 200 hidden neurons, each with 10x14x14 input connections, or 1960 + 1 weights per neuron. That is a total of 392,200 connections and weights to learn in this layer.

We can use a sigmoid or softmax transfer function to output probabilities of class values directly.

## Convolutional Neural Networks Best Practices

Now that we know about the building blocks for a convolutional neural network and how the layers hang together, we can review some best practices to consider when applying them.

**Input Receptive Field Dimensions**: The default is 2D for images, but could be 1D such as for words in a sentence or 3D for video that adds a time dimension.**Receptive Field Size**: The patch should be as small as possible, but large enough to “see” features in the input data. It is common to use 3×3 on small images and 5×5 or 7×7 and more on larger image sizes.**Stride Width**: Use the default stride of 1. It is easy to understand and you don’t need padding to handle the receptive field falling off the edge of your images. This could increased to 2 or larger for larger images.**Number of Filters**: Filters are the feature detectors. Generally fewer filters are used at the input layer and increasingly more filters used at deeper layers.**Padding**: Set to zero and called zero padding when reading non-input data. This is useful when you cannot or do not want to standardize input image sizes or when you want to use receptive field and stride sizes that do not neatly divide up the input image size.**Pooling**: Pooling is a destructive or generalization process to reduce overfitting. Receptive field is almost always set to to 2×2 with a stride of 2 to discard 75% of the activations from the output of the previous layer.**Data Preparation**: Consider standardizing input data, both the dimensions of the images and pixel values.**Pattern Architecture**: It is common to pattern the layers in your network architecture. This might be one, two or some number of convolutional layers followed by a pooling layer. This structure can then be repeated one or more times. Finally, fully connected layers are often only used at the output end and may be stacked one, two or more deep.**Dropout**: CNNs have a habit of overfitting, even with pooling layers. Dropout should be used such as between fully connected layers and perhaps after pooling layers.

Do you know about some more best practices for using CNNs?

Let me know in the comments.

## Further Reading on Convolutional Neural Networks

You have only scratch the surface on convolutional neural networks. The field is moving very fast and new and interesting architectures and techniques are been discussed and used all the time.

If you are looking for a deeper understanding of the technique, take a look at LeCun, et. al’s seminal paper titled “Gradient-Based Learning Applied to Document Recognition” [PDF]. In it they introduce LeNet applied to handwritten digit recognition and carefully explain the layers and how the network is connected.

There are a lot of tutorials and discussions of CNNs around the web. A few choice examples are listed below. Personally I find the explanatory pictures in the posts useful only after understanding how the network hangs together, many of the explanations are confusing and defer you to LeCun’s paper if in doubt.

- Convolutional Networks in DeepLearning4J
- Convolutional Networks model in the Stanford CS231n course
- Convolutional Networks and Applications in Vision [PDF]
- Chapter 6 in Michael Nielsen’s open Deep Learning book
- VGG Convolutional Neural Networks Practical from Oxford
- Understanding Convolutional Neural Networks for NLP by Denny Britz

## Summary

In this post you discovered convolutional neural networks. You learned about:

- Why CNNs are needed to preserve spatial structure in your input data and the benefits they provide.
- The building blocks of CNN including convolutional, pooling and fully connected layers.
- How the layers in a CNN hang together.
- Best practices when applying CNN to your own problems.

Do you have any questions about convolutional neural networks or about this post? Ask your questions in the comments and I will do my best to answer.

## Frustrated With Your Progress In Deep Learning?

#### What If You Could Develop Your Own Deep Nets in Minutes

...with just a few lines of Python

Discover how in my new Ebook: Deep Learning With Python

It covers **self-study tutorials** and **end-to-end projects** on topics like:*Multilayer Perceptrons*, *Convolutional Nets* and *Recurrent Neural Nets*, and more...

#### Finally Bring Deep Learning To

Your Own Projects

Skip the Academics. Just Results.

Thanks for your useful post. I think there is a problem with downloading your pdf. Could you please check it out. Thanks

You can download the Deep Learning mini course by clicking on the “Download button”, entering your email address, clicking the “Download Now” button, then checking your email where the link to your PDF will be available.

I just double checked the process and everything seems to work perfectly, Perhaps try again or try on a different computer or browser?

Can CNN be used for non-2D problem such as a general regression problem? If so, could you give an example code?

Perhaps ana.

You may be able to use a CNN in front of an LSTM to learn the spatial structure of sequence data.

There is a related example of sequence classification here that you could use as a starting point:

http://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

Hi, I need some advice. Do you think LeCun’s paper “Gradient-Based Learning Applied to Document Recognition” is worth reading to get a knowledge for Image recognition such as hand gestures? Or is it only suitable for Document Recognition. It is 46-page paper so I first want to make sure if I am reading a right paper. Thank you!

Generally, I would recommend reading on material directly related to your problem, even reproducing existing work to help you get a “working result” as fast as possible, before you attempt to improve upon it.

I see, thank you mister!

Hi Jason, I was just wondering why did you choose 200 neurons for the fully-connected layer? Does the number of neurons here correspond to the number of classes you’re looking to classify? Thanks!

Hi Amy, great question.

I chose the number of neurons after some trial and error. There are no good theories for configuring neural networks, it’s the “art” or empirical part of the discipline that causes a lot of difficulty.

Ah, I see. So I guess it’s a matter of gaining practice building these networks.

Just to confirm, is it correct that the output layer should still correspond to the number of classes we’re looking to classify?

Yes correct Amy on both counts.

Hi Jason,

Can you suggest how can I take 2 channel output from CNN? I mean output should be 2*64*64 where 64*64 is the dimensions of the image and 2 are the number of channels. I am new to cnn so please help me in this. I can make h*w 2 dim output but for only 1 channel I do not understand how can I do it for 2 channels.

This tutorial might provide a good starting point that you can adapt from 3 channels to 2:

http://machinelearningmastery.com/object-recognition-convolutional-neural-networks-keras-deep-learning-library/

Thanks for the great explanation.

I am new and trying to build CNN to classify 4 patterns from 32×32 binary images. I have 3000 samples for each pattern. I used keras in python.

My model has.

1.64 filters 3×3 max pool 2×2

2.128 filters 3×3 max pool 2×2

3. 256 filters 3×3 max pool 2×2

4. Fully connected layer with 256 output

5. Output layer with 4 outputs.

But i get underfitiing which has traing accuracy around 65% and validation accuracy 85%.

Can you tell me what can i do to improve my model.

Thank you for your time.

Moyo

See this post full of ideas on how to get better performance:

http://machinelearningmastery.com/improve-deep-learning-performance/