[New Book] Click to get The Beginner's Guide to Data Science!
Use the offer code 20offearlybird to get 20% off. Hurry, sale ends soon!

Crash Course in Convolutional Neural Networks for Machine Learning

Convolutional neural networks are a powerful artificial neural network technique.

These networks preserve the spatial structure of the problem and were developed for object recognition tasks such as handwritten digit recognition. They are popular because people can achieve state-of-the-art results on challenging computer vision and natural language processing tasks.

In this post, you will discover convolutional neural networks for deep learning, also called ConvNets or CNNs. After completing this crash course, you will know:

  • The building blocks used in CNNs, such as convolutional layers and pool layers
  • How the building blocks fit together with a short worked example
  • Best practices for configuring CNNs on your object recognition tasks
  • References for state-of-the-art networks applied to complex machine learning problems

Kick-start your project with my new book Deep Learning With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Crash Course in Convolutional Neural Networks for Machine Learning

Crash course in convolutional neural networks for machine learning
Photo by Bryan Ledgard, some rights reserved.

The Case for Convolutional Neural Networks

Given a dataset of grayscale images with the standardized size of 32×32 pixels each, a traditional feed-forward neural network would require 1024 input weights (plus one bias).

This is fair enough, but the flattening of the image matrix of pixels to a long vector of pixel values loses all the spatial structure in the image. Unless all the images are perfectly resized, the neural network will have great difficulty with the problem.

Convolutional neural networks expect and preserve the spatial relationship between pixels by learning internal feature representations using small squares of input data. Features are learned and used across the whole image, allowing the objects in the images to be shifted or translated in the scene but still detectable by the network.

This is why the network is so useful for object recognition in photographs, picking out digits, faces, objects, and so on with varying orientations.

In summary, below are some benefits of using convolutional neural networks:

  • They use fewer parameters (weights) to learn than a fully connected network
  • They are designed to be invariant to object position and distortion in the scene
  • They automatically learn and generalize features from the input domain

Need help with Deep Learning in Python?

Take my free 2-week email course and discover MLPs, CNNs and LSTMs (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Building Blocks of Convolutional Neural Networks

There are three types of layers in a convolutional neural network:

  1. Convolutional Layers
  2. Pooling Layers
  3. Fully-Connected Layers

1. Convolutional Layers

Convolutional layers are comprised of filters and feature maps.

Filters

The filters are the “neurons” of the layer. They take weighted inputs and output a value. The input size is a fixed square called a patch or a receptive field.

If the convolutional layer is an input layer, then the input patch will be the pixel values. If deeper in the network architecture, then the convolutional layer will take input from a feature map from the previous layer.

Feature Maps

The feature map is the output of one filter applied to the previous layer.

A given filter is drawn across the entire previous layer, moved one pixel at a time. Each position results in the activation of the neuron, and the output is collected in the feature map. You can see that if the receptive field is moved one pixel from activation to activation, then the field will overlap with the previous activation by (field width – 1) input values.

Zero Padding

The distance that a filter is moved across the input from the previous layer for each activation is referred to as the stride.

If the size of the previous layer is not cleanly divisible by the size of the filter’s receptive field and the size of the stride, then it is possible for the receptive field to attempt to read off the edge of the input feature map. In this case, techniques like zero padding can be used to invent mock inputs for the receptive field to read.

2. Pooling Layers

The pooling layers down-sample the previous layer’s feature map.

Pooling layers follow a sequence of one or more convolutional layers and are intended to consolidate the features learned and expressed in the previous layer’s feature map. As such, pooling may be considered a technique to compress or generalize feature representations and generally reduce the overfitting of the training data by the model.

They, too, have a receptive field, often much smaller than the convolutional layer. Also, the stride or number of inputs that the receptive field is moved for each activation is often equal to the size of the receptive field to avoid any overlap.

Pooling layers are often very simple, taking the average or the maximum of the input value in order to create its own feature map.

For more about pooling layers, see the post:

3. Fully Connected Layers

Fully connected layers are the normal flat feed-forward neural network layer.

These layers may have a nonlinear activation function or a softmax activation in order to output probabilities of class predictions.

Fully connected layers are used at the end of the network after feature extraction and consolidation have been performed by the convolutional and pooling layers. They are used to create final nonlinear combinations of features and for making predictions by the network.

Worked Example of a Convolutional Neural Network

You now know about convolutional, pooling, and fully connected layers. Let’s make this more concrete by working through how these three layers may be connected together.

1. Image Input Data

Let’s assume you have a dataset of grayscale images. Each image has the same size of 32 pixels wide and 32 pixels high, and pixel values are between 0 and 255, e.g., a matrix of 32x32x1 or 1024 pixel values.

Image input data is expressed as a 3-dimensional matrix of width * height * channels. If you were using color images in the example, you would have three channels for the red, green, and blue pixel values, e.g., 32x32x3.

2. Convolutional Layer

Define a convolutional layer with ten filters, a receptive field 5 pixels wide and 5 pixels high, and a stride length of 1.

Because each filter can only get input from (i.e., “see”) 5×5 or 25 pixels at a time, you can calculate that each will require 25 + 1 input weights (plus 1 for the bias input).

Dragging the 5×5 receptive field across the input image data with a stride width of 1 will result in a feature map of 28×28 output values or 784 distinct activations per image.

You have ten filters, so ten different 28×28 feature maps or 7,840 outputs will be created for one image.

Finally, you know you have 26 inputs per filter, ten filters, and 28×28 output values to calculate per filter. Therefore, you have a total of 26x10x28x28 or 203,840 “connections” in your convolutional layer if you want to phrase it using traditional neural network nomenclature.

Convolutional layers also make use of a nonlinear transfer function as part of the activation, and the rectifier activation function is the popular default to use.

3. Pool Layer

You can define a pooling layer with a receptive field with a width of 2 inputs and a height of 2 inputs. You can also use a stride of 2 to ensure that there is no overlap.

This results in feature maps that are one-half the size of the input feature maps, from ten different 28×28 feature maps as input to ten different 14×14 feature maps as output.

You will use a max() operation for each receptive field so that the activation is the maximum input value.

4. Fully Connected Layer

Finally, you can flatten out the square feature maps into a traditional flat, fully connected layer.

You can define the fully connected layer with 200 hidden neurons, each with 10x14x14 input connections, or 1960 + 1 weights per neuron. That is a total of 392,200 connections and weights to learn in this layer.

You can use a sigmoid or softmax transfer function to output probabilities of class values directly.

Convolutional Neural Networks Best Practices

Now that you know about the building blocks for a convolutional neural network and how the layers hang together, you can review some best practices to consider when applying them.

  • Input Receptive Field Dimensions: The default is 2D for images but could be 1D for words in a sentence or 3D for a video that adds a time dimension.
  • Receptive Field Size: The patch should be as small as possible but large enough to “see” features in the input data. It is common to use 3×3 on small images and 5×5 or 7×7 and more on larger image sizes.
  • Stride Width: Use the default stride of 1. It is easy to understand, and you don’t need padding to handle the receptive field falling off the edge of your images. This could be increased to 2 or larger for larger images.
  • Number of Filters: Filters are the feature detectors. Generally, fewer filters are used at the input layer, and increasingly more filters are used at deeper layers.
  • Padding: Set to zero and called zero padding when reading non-input data. This is useful when you cannot or do not want to standardize input image sizes or when you want to use receptive field and stride sizes that do not neatly divide up the input image size.
  • Pooling: Pooling is a destructive or generalization process to reduce overfitting. The receptive field is almost always set to 2×2 with a stride of 2 to discard 75% of the activations from the output of the previous layer.
  • Data Preparation: Consider standardizing input data, both the dimensions of the images and pixel values.
  • Pattern Architecture: It is common to pattern the layers in your network architecture. This might be one, two, or some number of convolutional layers followed by a pooling layer. This structure can then be repeated one or more times. Finally, fully connected layers are often only used at the output end and may be stacked one, two, or more deep.
  • Dropout: CNNs have a habit of overfitting, even with pooling layers. Dropout should be used, such as between fully connected layers and perhaps after pooling layers.

Do you know about some more best practices for using CNNs?

Let me know in the comments.

Further Reading on Convolutional Neural Networks

You have only scratched the surface on convolutional neural networks. The field is moving very fast, and new and interesting architectures and techniques are being discussed and used all the time.

If you are looking for a deeper understanding of the technique, take a look at LeCun et al.’s seminal paper titled “Gradient-Based Learning Applied to Document Recognition” [PDF]. In it, they introduce LeNet applied to handwritten digit recognition and carefully explain the layers and how the network is connected.

There are a lot of tutorials and discussions of CNNs around the web. A few choice examples are listed below. Personally, I find the explanatory pictures in the posts useful only after understanding how the network hangs together. Many of the explanations are confusing and defer you to LeCun’s paper if in doubt.

Summary

In this post, you discovered convolutional neural networks. You learned about:

  • Why CNNs are needed to preserve spatial structure in your input data and the benefits they provide
  • The building blocks of CNN include convolutional, pooling, and fully connected layers.
  • How the layers in a CNN hang together
  • Best practices when applying a CNN to your own problems

Do you have any questions about convolutional neural networks or this post? Ask your questions in the comments, and I will do my best to answer.

75 Responses to Crash Course in Convolutional Neural Networks for Machine Learning

  1. Avatar
    Anonymous June 25, 2016 at 9:48 pm #

    Thanks for your useful post. I think there is a problem with downloading your pdf. Could you please check it out. Thanks

    • Avatar
      Jason Brownlee June 26, 2016 at 6:03 am #

      You can download the Deep Learning mini course by clicking on the “Download button”, entering your email address, clicking the “Download Now” button, then checking your email where the link to your PDF will be available.

      I just double checked the process and everything seems to work perfectly, Perhaps try again or try on a different computer or browser?

  2. Avatar
    ana lu November 6, 2016 at 12:36 pm #

    Can CNN be used for non-2D problem such as a general regression problem? If so, could you give an example code?

  3. Avatar
    ML704 January 23, 2017 at 3:18 pm #

    Hi, I need some advice. Do you think LeCun’s paper “Gradient-Based Learning Applied to Document Recognition” is worth reading to get a knowledge for Image recognition such as hand gestures? Or is it only suitable for Document Recognition. It is 46-page paper so I first want to make sure if I am reading a right paper. Thank you!

    • Avatar
      Jason Brownlee January 24, 2017 at 10:59 am #

      Generally, I would recommend reading on material directly related to your problem, even reproducing existing work to help you get a “working result” as fast as possible, before you attempt to improve upon it.

      • Avatar
        ML704 January 28, 2017 at 2:44 pm #

        I see, thank you mister!

  4. Avatar
    Amy February 1, 2017 at 3:34 am #

    Hi Jason, I was just wondering why did you choose 200 neurons for the fully-connected layer? Does the number of neurons here correspond to the number of classes you’re looking to classify? Thanks!

    • Avatar
      Jason Brownlee February 1, 2017 at 10:52 am #

      Hi Amy, great question.

      I chose the number of neurons after some trial and error. There are no good theories for configuring neural networks, it’s the “art” or empirical part of the discipline that causes a lot of difficulty.

      • Avatar
        Amy February 2, 2017 at 3:46 am #

        Ah, I see. So I guess it’s a matter of gaining practice building these networks.

        Just to confirm, is it correct that the output layer should still correspond to the number of classes we’re looking to classify?

      • Avatar
        Rui February 23, 2021 at 6:36 pm #

        Hi Jason, I saw your answer to Amy, but I still am confused. So you have only one hidden layer with 200 neurons? May I ask what problem you are solving here? Is it an image classification problem? From the description it seems like you have 200 categories. However, if that is the case, then it should be fixed? You also mentioned you chose it after. trial and error, that makes it feel like there is another layer after the fully connected layer? Please help, thanks 🙂

  5. Avatar
    cv007 March 9, 2017 at 4:05 am #

    Hi Jason,
    Can you suggest how can I take 2 channel output from CNN? I mean output should be 2*64*64 where 64*64 is the dimensions of the image and 2 are the number of channels. I am new to cnn so please help me in this. I can make h*w 2 dim output but for only 1 channel I do not understand how can I do it for 2 channels.

  6. Avatar
    Moyo March 11, 2017 at 4:54 am #

    Thanks for the great explanation.
    I am new and trying to build CNN to classify 4 patterns from 32×32 binary images. I have 3000 samples for each pattern. I used keras in python.
    My model has.
    1.64 filters 3×3 max pool 2×2
    2.128 filters 3×3 max pool 2×2
    3. 256 filters 3×3 max pool 2×2
    4. Fully connected layer with 256 output
    5. Output layer with 4 outputs.

    But i get underfitiing which has traing accuracy around 65% and validation accuracy 85%.

    Can you tell me what can i do to improve my model.

    Thank you for your time.

    Moyo

  7. Avatar
    Animesh Mohanty April 5, 2017 at 2:57 pm #

    sir, can you please post a tutorial to build a RBF neural network?

  8. Avatar
    chunhui July 6, 2017 at 10:25 pm #

    Hi Jason,if there are many convolutional layers and pooling layers in a CNN, and every convolutional layer has some feature maps, i was wondering how the convolutional layer connect with other convolutional layers, i mean feature map in this layer fully connect with all maps in last layers, or just connect with some maps .

    • Avatar
      Jason Brownlee July 9, 2017 at 10:31 am #

      Convolutional layers connect to pool layers which connect to Conv layers, and so on down the line.

  9. Avatar
    Yeman Brhane Hagos August 12, 2017 at 7:16 am #

    Your posts are amazing. Thank you so much for sharing.

  10. Avatar
    Prabir Sinha September 14, 2017 at 4:17 pm #

    Hi Jason ,

    I am trying to build a convolution neural network with TF . But it seems the demo program posted on http://www.tensorflow.org is having some compilation issue . Can you provide a simple TF code for building CNN which I can take it as reference.

    Regards
    Prabir

  11. Avatar
    Ricky November 9, 2017 at 10:25 pm #

    Hi Jason!

    I am new to CNN, just wondering while doing backprop; weights in penultimate layer and filter matrix are the only things that are getting optimised?

    • Avatar
      Jason Brownlee November 10, 2017 at 10:35 am #

      We would perform backprop across all layers. Perhaps I don’t follow your question?

  12. Avatar
    Manik November 20, 2017 at 3:47 am #

    Sir , I am a big fan of you. I have purchased one of your book related to machine learning algorithms. The way you explained is nice and pretty awesome….now I am ready to purchase another book of you about deep learning…I am totally in a vague state for its implementation on a system ,whether to buy a laptop or desktop. Could you please give ur suggestion , for deep learning implementation on a laptop or desktop ????
    Thank you sir …..

    • Avatar
      Jason Brownlee November 20, 2017 at 10:20 am #

      Thanks again!

      Generally, I don’t think it matters. You can do development locally, then run large models on AWS. That is exactly what I do.

  13. Avatar
    Manik November 20, 2017 at 5:39 pm #

    OK sir ….

  14. Avatar
    Manik November 20, 2017 at 5:42 pm #

    Is there any book written by you for signal processing using python ????

  15. Avatar
    Manik November 20, 2017 at 5:45 pm #

    Please suggest a good book for features extraction of images using transforms in signal processing….thank you sir

    • Avatar
      Jason Brownlee November 22, 2017 at 10:38 am #

      Sorry, I don’t have material on that topic.

  16. Avatar
    Manik November 23, 2017 at 2:49 am #

    Sir ,can we use evolutionary algorithms on image
    classifications , thank you….

    • Avatar
      Jason Brownlee November 23, 2017 at 10:34 am #

      Perhaps you could use evolutionary algorithms to learn a classifier, but you must still choose the form of the classifier.

  17. Avatar
    Flor December 13, 2017 at 5:01 am #

    Gracias por su respuesta,
    Qué significa preservar la estructura espacial en sus datos?

    • Avatar
      Jason Brownlee December 13, 2017 at 5:46 am #

      You’re welcome.

      Spatial structure means the relationships or “structure” in the data that has meaning for the model.

  18. Avatar
    Alex February 25, 2018 at 7:47 am #

    Thanks for a great tutorial!
    I’ve been trying to solve a problem for some time with no luck up to now. There’s a text file with all spaces removed from. So the task is to restore them.
    My strategy is to predict the next letter based on the input and insert it into the string if the model predicts the space. I trained a bunch of RNN models on a normal text set, the best results showed Bidirectional LSTM. But still the results are terrible (besides spaces around commas, capitals and some short patterns). So maybe my whole approach is wrong? I found out that convolutional networks can be used on a character level for some interesting tasks in NLP. Is it worth a try or should I improve RNN model, or maybe use something else. What approach would you suggest for such a task?
    Thank you for your time.

  19. Avatar
    Leonardo May 16, 2018 at 11:49 am #

    Excuse.me, I have a doubt. It is possible to use CNN for non image dataset, especially with student data. For example with attributes such as average grade, year of enter to the university, level of incomes, marital status, etc. I have also used traditional machine learning techniques, but I am wanting to try deep learning techniques. Can you help me please. Thank for all.

    • Avatar
      Jason Brownlee May 17, 2018 at 6:23 am #

      Yes, CNNs are great for text data and other input data that has a spatial relationship.

      • Avatar
        Leonardo May 17, 2018 at 7:28 am #

        How can I use if the data doesn’t have a spatial relationship?
        Could I use that stil?

    • Avatar
      Fatima June 20, 2022 at 8:50 pm #

      Do you used CNN for student dataset ? Did it work properly for you ?

  20. Avatar
    Suresh Veera Venkata Veera Venkata Grandhi July 10, 2018 at 2:24 am #

    Sir, Why we are using the Activation function at Convolution step? Could you please tell me the significance of it?

    • Avatar
      Jason Brownlee July 10, 2018 at 6:50 am #

      What do you mean exactly? Can you give an example?

  21. Avatar
    mrinal verma July 14, 2018 at 5:16 pm #

    you tutorial cleared all my doubts. thanks a lot.

  22. Avatar
    delaman September 30, 2018 at 5:20 am #

    hi Jason Brownlee, can you please recommend some research papers i can easily depreciate to practice for conv. neural networks pls.

    • Avatar
      Jason Brownlee September 30, 2018 at 6:08 am #

      You can search for papers here: scholar.google.com

  23. Avatar
    Abid Rizvi April 18, 2019 at 6:51 pm #

    Dear Sir,
    Under this section:
    Further Reading on Convolutional Neural Networks\

    Convolutional Networks model in the Stanford CS231n course
    https://cs231n.github.io/convolutional-networks/%20https://cs231n.github.io/

    This link shows Page not found. For your review and fixation, please.

  24. Avatar
    Joe July 18, 2020 at 6:57 pm #

    Thank you so much for another great tutorial.

    Why do you recommend to increase the number of filters as we go deeper into the CNN model? My first intuition is that the number of filters need to be decreased in order for the net to capture more of the larger picture and less of the details.

    • Avatar
      Jason Brownlee July 19, 2020 at 6:28 am #

      Good question, it is a pattern of model configuration that has repeatedly shown better performance.

      E.g. the deeper we go we get more opportunity to interpret the extracted features in different ways.

      The references in this tutorial may also help:
      https://machinelearningmastery.com/introduction-to-the-imagenet-large-scale-visual-recognition-challenge-ilsvrc/

      • Avatar
        Joe July 20, 2020 at 12:52 am #

        Thank you so much Jason you’re a legend.
        so the answer is:

        (a) that experimentally the architecture of ascending number of filters has shown better results.

        (b) that we may explain the observations by the number of interpretations – the deeper we go, the number of possibilities increases because the patterns that the computer can capture are more sophisticated since it moves from seeing very simple features like dots and lines to seeing complex structures like eyes and faces.

  25. Avatar
    Waleed July 25, 2020 at 9:57 am #

    Hello: Is CNN the best text classification and if not the best what is the best algorithm?

  26. Avatar
    Waleed July 26, 2020 at 10:43 am #

    I am currently doing research as a master’s thesis on identifying speech that promotes hatred in social networking sites … Do you have a specific idea on this subject to work on (identifying a specific problem in order to work on it) ….. Thank you very much in advance

  27. Avatar
    Jan August 13, 2020 at 11:07 pm #

    Hello Jason. Thank you for amazing work! My question is related to convolutional filters. I understand the traditional way of using 2D convolution in image processing, where each kernel (with filter map) was designed up front to “do”something. For instance 1/9 * [1 1 1; 1 1 1; 1 1 1] would be 3×3 kernel serving as blurring filter and so on. What I do not understand is, how the filter maps are created during learning process. Would you mind to share some information about that? Or some helpful link? Thank you, best regards! Jan

  28. Avatar
    Stephane November 22, 2021 at 6:06 pm #

    Hi,

    If I have
    – an input of x20x20x1
    – a Conv2D with filters = 16, kernel_size = 3, activation=’relu’, padding=’same’
    – a MaxPooling2D(pool_size = (2, 2), padding=’same’)

    My output here will be a x10x10x16, as there is 16 filters

    then I add another layer of
    – a Conv2D with filters = 8, kernel_size = 3, activation=’relu’, padding=’same’
    – a MaxPooling2D(pool_size = (2, 2), padding=’same’)

    My output here will be a x5x5x8, as there is 8 filters

    What’s happening between the two layers ? What became of the x16 ?

    My understanding is that the last number of the shape, x16 then x8 are information on the number of layer, but a layer is a more complex object.
    The x16 is in fact x16 x1 (the 4th dimension). And the x8 is x8x16x1 (8 filters applied to the previous 16 filters).

    Is that correct ? Or am I completely wrong ?

    • Avatar
      Adrian Tam November 23, 2021 at 1:30 pm #

      Input of (20,20,1) applying one Conv2D with kernel size 3, stride 1, and same padding will be (20,20,1); so if you apply 16 Conv2D the output will be (20,20,16). Then applying MaxPooling with pool size (2,2) will make it into (10,10,16). We call the 16 here the number of channels.

  29. Avatar
    Stephane November 23, 2021 at 2:03 pm #

    So my question is between the output of the 1st MaxPooling2D that is (10,10,16), and the 2nd Conv2D with filters = 8, how the number of channel (16) are transformed to become the 8 ?
    Or what is a channel ?

  30. Avatar
    yerma August 28, 2022 at 8:09 am #

    Hi Jason, I lost you on this part:

    “Dragging the 5×5 receptive field across the input image data with a stride width of 1 will result in a feature map of 28×28 output values or 784 distinct activations per image.”

    Where did the number “28” come from? Did you just do : (field dimensions)^2 + stride*dimensions = 5^2 + 1*3 = 28 ?
    what is the formula here?

  31. Avatar
    Mujtaba August 1, 2023 at 8:58 pm #

    Can CNN work with 3D geometry data or is there any other NN structure good for 3D classification? Do you have any blog regarding this or maybe you know if some work is done?

  32. Avatar
    Alex November 15, 2023 at 4:34 am #

    I use conv1d as a variant of simple/fast attention layer

    y = Input(x) -> Dense(x)
    Input(x) -> Conv(x,x) -> Add(y) -> result

    Try it 🙂

    • Avatar
      James Carmichael November 15, 2023 at 11:18 am #

      Thank you for your recommendation Alex! We appreciate the input!

Leave a Reply