Artificial neural networks have two main hyperparameters that control the architecture or topology of the network: the number of layers and the number of nodes in each hidden layer.

You must specify values for these parameters when configuring your network.

The most reliable way to configure these hyperparameters for your specific predictive modeling problem is via systematic experimentation with a robust test harness.

This can be a tough pill to swallow for beginners to the field of machine learning, looking for an analytical way to calculate the optimal number of layers and nodes, or easy rules of thumb to follow.

In this post, you will discover the roles of layers and nodes and how to approach the configuration of a multilayer perceptron neural network for your predictive modeling problem.

After reading this post, you will know:

- The difference between single-layer and multiple-layer perceptron networks.
- The value of having one and more than one hidden layers in a network.
- Five approaches for configuring the number of layers and nodes in a network.

Discover how to train faster, reduce overfitting, and make better predictions with deep learning models in my new book, with 26 step-by-step tutorials and full source code.

Let’s get started.

## Overview

This post is divided into four sections; they are:

- The Multilayer Perceptron
- How to Count Layers?
- Why Have Multiple Layers?
- How Many Layers and Nodes to Use?

## The Multilayer Perceptron

A node, also called a neuron or Perceptron, is a computational unit that has one or more weighted input connections, a transfer function that combines the inputs in some way, and an output connection.

Nodes are then organized into layers to comprise a network.

A single-layer artificial neural network, also called a single-layer, has a single layer of nodes, as its name suggests. Each node in the single layer connects directly to an input variable and contributes to an output variable.

Single-layer networks have just one layer of active units. Inputs connect directly to the outputs through a single layer of weights. The outputs do not interact, so a network with N outputs can be treated as N separate single-output networks.

— Page 15, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

A single-layer network can be extended to a multiple-layer network, referred to as a Multilayer Perceptron. A Multilayer Perceptron, or MLP for short, is an artificial neural network with more than a single layer.

It has an input layer that connects to the input variables, one or more hidden layers, and an output layer that produces the output variables.

The standard multilayer perceptron (MLP) is a cascade of single-layer perceptrons. There is a layer of input nodes, a layer of output nodes, and one or more intermediate layers. The interior layers are sometimes called “hidden layers” because they are not directly observable from the systems inputs and outputs.

— Page 31, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

We can summarize the types of layers in an MLP as follows:

**Input Layer**: Input variables, sometimes called the visible layer.**Hidden Layers**: Layers of nodes between the input and output layers. There may be one or more of these layers.**Output Layer**: A layer of nodes that produce the output variables.

Finally, there are terms used to describe the shape and capability of a neural network; for example:

**Size**: The number of nodes in the model.**Width**: The number of nodes in a specific layer.**Depth**: The number of layers in a neural network.**Capacity**: The type or structure of functions that can be learned by a network configuration. Sometimes called “*representational capacity*“.**Architecture**: The specific arrangement of the layers and nodes in the network.

## How to Count Layers?

Traditionally, there is some disagreement about how to count the number of layers.

The disagreement centers around whether or not the input layer is counted. There is an argument to suggest it should not be counted because the inputs are not active; they are simply the input variables. We will use this convention; this is also the convention recommended in the book “*Neural Smithing*“.

Therefore, an MLP that has an input layer, one hidden layer, and one output layer is a 2-layer MLP.

The structure of an MLP can be summarized using a simple notation.

This convenient notation summarizes both the number of layers and the number of nodes in each layer. The number of nodes in each layer is specified as an integer, in order from the input layer to the output layer, with the size of each layer separated by a forward-slash character (“/”).

For example, a network with two variables in the input layer, one hidden layer with eight nodes, and an output layer with one node would be described using the notation: 2/8/1.

I recommend using this notation when describing the layers and their size for a Multilayer Perceptron neural network.

## Why Have Multiple Layers?

Before we look at how many layers to specify, it is important to think about why we would want to have multiple layers.

A single-layer neural network can only be used to represent linearly separable functions. This means very simple problems where, say, the two classes in a classification problem can be neatly separated by a line. If your problem is relatively simple, perhaps a single layer network would be sufficient.

Most problems that we are interested in solving are not linearly separable.

A Multilayer Perceptron can be used to represent convex regions. This means that in effect, they can learn to draw shapes around examples in some high-dimensional space that can separate and classify them, overcoming the limitation of linear separability.

In fact, there is a theoretical finding by Lippmann in the 1987 paper “An introduction to computing with neural nets” that shows that an MLP with two hidden layers is sufficient for creating classification regions of any desired shape. This is instructive, although it should be noted that no indication of how many nodes to use in each layer or how to learn the weights is given.

A further theoretical finding and proof has shown that MLPs are universal approximators. That with one hidden layer, an MLP can approximate any function that we require.

Specifically, the universal approximation theorem states that a feedforward network with a linear output layer and at least one hidden layer with any “squashing” activation function (such as the logistic sigmoid activation function) can approximate any Borel measurable function from one finite-dimensional space to another with any desired non-zero amount of error, provided that the network is given enough hidden units.

— Page 198, Deep Learning, 2016.

This is an often-cited theoretical finding and there is a ton of literature on it. In practice, we again have no idea how many nodes to use in the single hidden layer for a given problem nor how to learn or set their weights effectively. Further, many counterexamples have been presented of functions that cannot directly be learned via a single one-hidden-layer MLP or require an infinite number of nodes.

Even for those functions that can be learned via a sufficiently large one-hidden-layer MLP, it can be more efficient to learn it with two (or more) hidden layers.

Since a single sufficiently large hidden layer is adequate for approximation of most functions, why would anyone ever use more? One reason hangs on the words “sufficiently large”. Although a single hidden layer is optimal for some functions, there are others for which a single-hidden-layer-solution is very inefficient compared to solutions with more layers.

— Page 38, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

## How Many Layers and Nodes to Use?

With the preamble of MLPs out of the way, let’s get down to your real question.

How many layers should you use in your Multilayer Perceptron and how many nodes per layer?

In this section, we will enumerate five approaches to solving this problem.

### 1) Experimentation

In general, when I’m asked how many layers and nodes to use for an MLP, I often reply:

I don’t know. Use systematic experimentation to discover what works best for your specific dataset.

I still stand by this answer.

In general, you cannot analytically calculate the number of layers or the number of nodes to use per layer in an artificial neural network to address a specific real-world predictive modeling problem.

The number of layers and the number of nodes in each layer are model hyperparameters that you must specify.

You are likely to be the first person to attempt to address your specific problem with a neural network. No one has solved it before you. Therefore, no one can tell you the answer of how to configure the network.

You must discover the answer using a robust test harness and controlled experiments. For example, see the post:

Regardless of the heuristics you might encounter, all answers will come back to the need for careful experimentation to see what works best for your specific dataset.

### 2) Intuition

The network can be configured via intuition.

For example, you may have an intuition that a deep network is required to address a specific predictive modeling problem.

A deep model provides a hierarchy of layers that build up increasing levels of abstraction from the space of the input variables to the output variables.

Given an understanding of the problem domain, we may believe that a deep hierarchical model is required to sufficiently solve the prediction problem. In which case, we may choose a network configuration that has many layers of depth.

Choosing a deep model encodes a very general belief that the function we want to learn should involve composition of several simpler functions. This can be interpreted from a representation learning point of view as saying that we believe the learning problem consists of discovering a set of underlying factors of variation that can in turn be described in terms of other, simpler underlying factors of variation.

— Page 201, Deep Learning, 2016.

This intuition can come from experience with the domain, experience with modeling problems with neural networks, or some mixture of the two.

In my experience, intuitions are often invalidated via experiments.

### 3) Go For Depth

In their important textbook on deep learning, Goodfellow, Bengio, and Courville highlight that empirically, on problems of interest, deep neural networks appear to perform better.

Specifically, they state the choice of using deep neural networks as a statistical argument in cases where depth may be intuitively beneficial.

Empirically, greater depth does seem to result in better generalization for a wide variety of tasks. […] This suggests that using deep architectures does indeed express a useful prior over the space of functions the model learns.

— Page 201, Deep Learning, 2016.

We may use this argument to suggest that using deep networks, those with many layers, may be a heuristic approach to configuring networks for challenging predictive modeling problems.

This is similar to the advice for starting with Random Forest and Stochastic Gradient Boosting on a predictive modeling problem with tabular data to quickly get an idea of an upper-bound on model skill prior to testing other methods.

### 4) Borrow Ideas

A simple, but perhaps time consuming approach, is to leverage findings reported in the literature.

Find research papers that describe the use of MLPs on instances of prediction problems similar in some way to your problem. Note the configuration of the networks used in those papers and use them as a starting point for the configurations to test on your problem.

Transferability of model hyperparameters that result in skillful models from one problem to another is a challenging open problem and the reason why model hyperparameter configuration is more art than science.

Nevertheless, the network layers and number of nodes used on related problems is a good starting point for testing ideas.

### 5) Search

Design an automated search to test different network configurations.

You can seed the search with ideas from literature and intuition.

Some popular search strategies include:

**Random**: Try random configurations of layers and nodes per layer.**Grid**: Try a systematic search across the number of layers and nodes per layer.**Heuristic**: Try a directed search across configurations such as a genetic algorithm or Bayesian optimization.**Exhaustive**: Try all combinations of layers and the number of nodes; it might be feasible for small networks and datasets.

This can be challenging with large models, large datasets and combinations of the two. Some ideas to reduce or manage the computational burden include:

- Fit models on a smaller subset of the training dataset to speed up the search.
- Aggressively bound the size of the search space.
- Parallelize the search across multiple server instances (e.g. use Amazon EC2 service).

I recommend being systematic if time and resources permit.

### More

I have seen countless heuristics of how to estimate the number of layers and either the total number of neurons or the number of neurons per layer.

I do not want to enumerate them; I’m skeptical that they add practical value beyond the special cases on which they are demonstrated.

If this area is interesting to you, perhaps start with “*Section 4.4 Capacity versus Size*” in the book “Neural Smithing“. It summarizes a ton of findings in this area. The book is dated from 1999, so there are another nearly 20 years of ideas to wade through in this area if you’re up for it.

Also, see some of the discussions linked in the *Further Reading* section (below).

Did I miss your favorite method for configuring a neural network? Or do you know a good reference on the topic?

Let me know in the comments below.

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Papers

### Books

- Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.
- Deep Learning, 2016.

### Articles

- Artificial neural network on Wikipedia
- Universal approximation theorem on Wikipedia
- How many hidden layers should I use?, comp.ai.neural-nets FAQ

### Discussions

- How to choose the number of hidden layers and nodes in a feedforward neural network?
- Number of nodes in hidden layers of neural network
- multi-layer perceptron (MLP) architecture: criteria for choosing number of hidden layers and size of the hidden layer?
- In deep learning, how do I select the optimal number of layers and neurons?

## Summary

In this post, you discovered the role of layers and nodes and how to configure a multilayer perceptron neural network.

Specifically, you learned:

- The difference between single-layer and multiple-layer perceptron networks.
- The value of having one and more than one hidden layers in a network.
- Five approaches for configuring the number of layers and nodes in a network.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Thanks for the blog post.

There is indeed a large number of recent research going to answer this question automatically, dubbed (neural) architecture search. Here a list of papers which I maintain:

https://www.automl.org/automl/literature-on-neural-architecture-search/

Thanks.

Great! Thanks for the blog post 🙂 There is also an interesting post here which tries to address the same question.

https://towardsdatascience.com/beginners-ask-how-many-hidden-layers-neurons-to-use-in-artificial-neural-networks-51466afa0d3e

Thanks for sharing.

This is grate blog and nice information

Thanks.

Hi, Very nice summary. Thank you very much! I’m a deep learning researcher working in an inter-disciplinary team in Univ Edi. May I ask about the template you used to create this site? It looks quite professional and great!

Thanks, you can learn more about the software I use for the site here:

https://machinelearningmastery.com/faq/single-faq/what-software-do-you-use-to-run-your-website

Thank you for such a great post, but I have a question;

Let’s say our image size is 64*64*3 so what would be the number of nodes in our input layer?

I would recommend using a CNN and perhaps try 32 filters?

Sorry for ambiguity in my question….Suppose I’m using a CNN and I have a picture of size 64*64*3 and my question is what would be the number of nodes in my input layer?

It would be input_shape=(64,64,3) if using channels last format.

Thanks

Will the number of neuron in hidden layer mention here work for RNN/LSTM as well?

Perhaps test and compare results with different configurations?

Hi Jason,

There is practically no way to know ahead of time how many layers or nodes you will need for a certain neural network learning task. I have a solution: a neural network, called ALNfitDeep, which can automatically *grow* during training to fit the problem. Software to do this is at https://github.com/Bill-Armstrong/ . There is a new executable release available, which you can get by clicking where it says “4 releases” near the top of the main page. You can forget the source code for now. Use the Help button. Since the release is new, I would appreciate any feedback on problems you encounter. From my point of view, neural nets which can’t learn automatically based on the problem are a total waste of time. Also having to use a lot of valuable data for validation is a waste. My nets measure the noise variance and then train on all of the data not used in testing. My nets can grow to tens of thousands of nodes, yet the execution of the learned function remains very fast (because very little has to actually be computed for a given input). The secret is that all computation of linear functions is in the first layer. Instead of a one-input squashing function, there are two-input non-linearities: max, and min. People have to have the courage to try it. I will help.

Thanks for the note.

I have played with “growing” and “pruning” nets since the late 1990s, I remain skeptical.

A sensitivity analysis of model capacity vs skill is reliable and repeatable for me.

Hi Jason,

I have 2 questions.

1. What if I want to predict financial time series (e.g. Forex, stock price) and I have decided to use mlp with 2 hidden layers. I have also decided to use 4 neurons for each of the hidden layers. How exactly do I split up my data set including the input. I’m assuming that the 4 neurons will be Open, Close, High and Low Values.

Will my input values be majority of my dataset in total? and then for the first hidden layer, a subset of the dataset including, so some of the ‘High values for one of the neurons’, some of the low values for the second neuron and so on, and then the same in the other hidden layer?

2. Do you have a python script example for iterating through different example layers.

Thank you very much!

If you have 4 classes as output, then the data must be prepared to match this expectation with a one hot encoding:

https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

Data would not be split based on the number of nodes, I’m not sure I follow, sorry.

Here’s an example of a model for multi-class classification:

https://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/

Hi, May I say that layers and nodes are relevant to how many training inputs numbers?

If more inputs, more nodes?

Thanks

Not really. They are unrelated.

There is an interrelation between the number of layers (nodes per layer). Please have a look into our paper https://arxiv.org/abs/1902.02771. Thank you.

Thanks for sharing.

Thank you, Jason, for the post.

Is there a rule of thumb for the number of units when you want to increase the number of hidden layers? Let’s say for example that your model has a decent performance for 1 hidden layer and 30 units, would choosing 2 hidden layers means you would decrease the number of units for each of these layers or you can even increase it?

Not really, sorry.

Test and use a robust test harness so that the results are reliable.

lets say we want to differentiate between clear and blur images, can CNN train a model to do that and how do you go about it

Yes, perhaps a classification problem with a binary prediction (blur vs no-blur).

MLP? Or MLFFN? Is there a way to use the simpler perceptron update algorithm without using derivative or backprop or without separating the layers with a non-linearity activation without having the whole thing collapse into the equivslent of a single linear layer as Minsky pointed out way back?

There may be, I don’t have material on it, sorry.

We moved away from simple Perceptron because backprop on a MLP works really well in general.