Last Updated on August 6, 2019
Artificial neural networks have two main hyperparameters that control the architecture or topology of the network: the number of layers and the number of nodes in each hidden layer.
You must specify values for these parameters when configuring your network.
The most reliable way to configure these hyperparameters for your specific predictive modeling problem is via systematic experimentation with a robust test harness.
This can be a tough pill to swallow for beginners to the field of machine learning, looking for an analytical way to calculate the optimal number of layers and nodes, or easy rules of thumb to follow.
In this post, you will discover the roles of layers and nodes and how to approach the configuration of a multilayer perceptron neural network for your predictive modeling problem.
After reading this post, you will know:
- The difference between single-layer and multiple-layer perceptron networks.
- The value of having one and more than one hidden layers in a network.
- Five approaches for configuring the number of layers and nodes in a network.
Kick-start your project with my new book Better Deep Learning, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
This post is divided into four sections; they are:
- The Multilayer Perceptron
- How to Count Layers?
- Why Have Multiple Layers?
- How Many Layers and Nodes to Use?
The Multilayer Perceptron
A node, also called a neuron or Perceptron, is a computational unit that has one or more weighted input connections, a transfer function that combines the inputs in some way, and an output connection.
Nodes are then organized into layers to comprise a network.
A single-layer artificial neural network, also called a single-layer, has a single layer of nodes, as its name suggests. Each node in the single layer connects directly to an input variable and contributes to an output variable.
Single-layer networks have just one layer of active units. Inputs connect directly to the outputs through a single layer of weights. The outputs do not interact, so a network with N outputs can be treated as N separate single-output networks.
— Page 15, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.
A single-layer network can be extended to a multiple-layer network, referred to as a Multilayer Perceptron. A Multilayer Perceptron, or MLP for short, is an artificial neural network with more than a single layer.
It has an input layer that connects to the input variables, one or more hidden layers, and an output layer that produces the output variables.
The standard multilayer perceptron (MLP) is a cascade of single-layer perceptrons. There is a layer of input nodes, a layer of output nodes, and one or more intermediate layers. The interior layers are sometimes called “hidden layers” because they are not directly observable from the systems inputs and outputs.
— Page 31, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.
We can summarize the types of layers in an MLP as follows:
- Input Layer: Input variables, sometimes called the visible layer.
- Hidden Layers: Layers of nodes between the input and output layers. There may be one or more of these layers.
- Output Layer: A layer of nodes that produce the output variables.
Finally, there are terms used to describe the shape and capability of a neural network; for example:
- Size: The number of nodes in the model.
- Width: The number of nodes in a specific layer.
- Depth: The number of layers in a neural network.
- Capacity: The type or structure of functions that can be learned by a network configuration. Sometimes called “representational capacity“.
- Architecture: The specific arrangement of the layers and nodes in the network.
How to Count Layers?
Traditionally, there is some disagreement about how to count the number of layers.
The disagreement centers around whether or not the input layer is counted. There is an argument to suggest it should not be counted because the inputs are not active; they are simply the input variables. We will use this convention; this is also the convention recommended in the book “Neural Smithing“.
Therefore, an MLP that has an input layer, one hidden layer, and one output layer is a 2-layer MLP.
The structure of an MLP can be summarized using a simple notation.
This convenient notation summarizes both the number of layers and the number of nodes in each layer. The number of nodes in each layer is specified as an integer, in order from the input layer to the output layer, with the size of each layer separated by a forward-slash character (“/”).
For example, a network with two variables in the input layer, one hidden layer with eight nodes, and an output layer with one node would be described using the notation: 2/8/1.
I recommend using this notation when describing the layers and their size for a Multilayer Perceptron neural network.
Why Have Multiple Layers?
Before we look at how many layers to specify, it is important to think about why we would want to have multiple layers.
A single-layer neural network can only be used to represent linearly separable functions. This means very simple problems where, say, the two classes in a classification problem can be neatly separated by a line. If your problem is relatively simple, perhaps a single layer network would be sufficient.
Most problems that we are interested in solving are not linearly separable.
A Multilayer Perceptron can be used to represent convex regions. This means that in effect, they can learn to draw shapes around examples in some high-dimensional space that can separate and classify them, overcoming the limitation of linear separability.
In fact, there is a theoretical finding by Lippmann in the 1987 paper “An introduction to computing with neural nets” that shows that an MLP with two hidden layers is sufficient for creating classification regions of any desired shape. This is instructive, although it should be noted that no indication of how many nodes to use in each layer or how to learn the weights is given.
A further theoretical finding and proof has shown that MLPs are universal approximators. That with one hidden layer, an MLP can approximate any function that we require.
Specifically, the universal approximation theorem states that a feedforward network with a linear output layer and at least one hidden layer with any “squashing” activation function (such as the logistic sigmoid activation function) can approximate any Borel measurable function from one finite-dimensional space to another with any desired non-zero amount of error, provided that the network is given enough hidden units.
— Page 198, Deep Learning, 2016.
This is an often-cited theoretical finding and there is a ton of literature on it. In practice, we again have no idea how many nodes to use in the single hidden layer for a given problem nor how to learn or set their weights effectively. Further, many counterexamples have been presented of functions that cannot directly be learned via a single one-hidden-layer MLP or require an infinite number of nodes.
Even for those functions that can be learned via a sufficiently large one-hidden-layer MLP, it can be more efficient to learn it with two (or more) hidden layers.
Since a single sufficiently large hidden layer is adequate for approximation of most functions, why would anyone ever use more? One reason hangs on the words “sufficiently large”. Although a single hidden layer is optimal for some functions, there are others for which a single-hidden-layer-solution is very inefficient compared to solutions with more layers.
— Page 38, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.
How Many Layers and Nodes to Use?
With the preamble of MLPs out of the way, let’s get down to your real question.
How many layers should you use in your Multilayer Perceptron and how many nodes per layer?
In this section, we will enumerate five approaches to solving this problem.
In general, when I’m asked how many layers and nodes to use for an MLP, I often reply:
I don’t know. Use systematic experimentation to discover what works best for your specific dataset.
I still stand by this answer.
In general, you cannot analytically calculate the number of layers or the number of nodes to use per layer in an artificial neural network to address a specific real-world predictive modeling problem.
The number of layers and the number of nodes in each layer are model hyperparameters that you must specify.
You are likely to be the first person to attempt to address your specific problem with a neural network. No one has solved it before you. Therefore, no one can tell you the answer of how to configure the network.
You must discover the answer using a robust test harness and controlled experiments. For example, see the post:
Regardless of the heuristics you might encounter, all answers will come back to the need for careful experimentation to see what works best for your specific dataset.
The network can be configured via intuition.
For example, you may have an intuition that a deep network is required to address a specific predictive modeling problem.
A deep model provides a hierarchy of layers that build up increasing levels of abstraction from the space of the input variables to the output variables.
Given an understanding of the problem domain, we may believe that a deep hierarchical model is required to sufficiently solve the prediction problem. In which case, we may choose a network configuration that has many layers of depth.
Choosing a deep model encodes a very general belief that the function we want to learn should involve composition of several simpler functions. This can be interpreted from a representation learning point of view as saying that we believe the learning problem consists of discovering a set of underlying factors of variation that can in turn be described in terms of other, simpler underlying factors of variation.
— Page 201, Deep Learning, 2016.
This intuition can come from experience with the domain, experience with modeling problems with neural networks, or some mixture of the two.
In my experience, intuitions are often invalidated via experiments.
3) Go For Depth
In their important textbook on deep learning, Goodfellow, Bengio, and Courville highlight that empirically, on problems of interest, deep neural networks appear to perform better.
Specifically, they state the choice of using deep neural networks as a statistical argument in cases where depth may be intuitively beneficial.
Empirically, greater depth does seem to result in better generalization for a wide variety of tasks. […] This suggests that using deep architectures does indeed express a useful prior over the space of functions the model learns.
— Page 201, Deep Learning, 2016.
We may use this argument to suggest that using deep networks, those with many layers, may be a heuristic approach to configuring networks for challenging predictive modeling problems.
This is similar to the advice for starting with Random Forest and Stochastic Gradient Boosting on a predictive modeling problem with tabular data to quickly get an idea of an upper-bound on model skill prior to testing other methods.
4) Borrow Ideas
A simple, but perhaps time consuming approach, is to leverage findings reported in the literature.
Find research papers that describe the use of MLPs on instances of prediction problems similar in some way to your problem. Note the configuration of the networks used in those papers and use them as a starting point for the configurations to test on your problem.
Transferability of model hyperparameters that result in skillful models from one problem to another is a challenging open problem and the reason why model hyperparameter configuration is more art than science.
Nevertheless, the network layers and number of nodes used on related problems is a good starting point for testing ideas.
Design an automated search to test different network configurations.
You can seed the search with ideas from literature and intuition.
Some popular search strategies include:
- Random: Try random configurations of layers and nodes per layer.
- Grid: Try a systematic search across the number of layers and nodes per layer.
- Heuristic: Try a directed search across configurations such as a genetic algorithm or Bayesian optimization.
- Exhaustive: Try all combinations of layers and the number of nodes; it might be feasible for small networks and datasets.
This can be challenging with large models, large datasets and combinations of the two. Some ideas to reduce or manage the computational burden include:
- Fit models on a smaller subset of the training dataset to speed up the search.
- Aggressively bound the size of the search space.
- Parallelize the search across multiple server instances (e.g. use Amazon EC2 service).
I recommend being systematic if time and resources permit.
I have seen countless heuristics of how to estimate the number of layers and either the total number of neurons or the number of neurons per layer.
I do not want to enumerate them; I’m skeptical that they add practical value beyond the special cases on which they are demonstrated.
If this area is interesting to you, perhaps start with “Section 4.4 Capacity versus Size” in the book “Neural Smithing“. It summarizes a ton of findings in this area. The book is dated from 1999, so there are another nearly 20 years of ideas to wade through in this area if you’re up for it.
Also, see some of the discussions linked in the Further Reading section (below).
Did I miss your favorite method for configuring a neural network? Or do you know a good reference on the topic?
Let me know in the comments below.
This section provides more resources on the topic if you are looking to go deeper.
- Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.
- Deep Learning, 2016.
- Artificial neural network on Wikipedia
- Universal approximation theorem on Wikipedia
- How many hidden layers should I use?, comp.ai.neural-nets FAQ
- How to choose the number of hidden layers and nodes in a feedforward neural network?
- Number of nodes in hidden layers of neural network
- multi-layer perceptron (MLP) architecture: criteria for choosing number of hidden layers and size of the hidden layer?
- In deep learning, how do I select the optimal number of layers and neurons?
In this post, you discovered the role of layers and nodes and how to configure a multilayer perceptron neural network.
Specifically, you learned:
- The difference between single-layer and multiple-layer perceptron networks.
- The value of having one and more than one hidden layers in a network.
- Five approaches for configuring the number of layers and nodes in a network.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Thanks for the blog post.
There is indeed a large number of recent research going to answer this question automatically, dubbed (neural) architecture search. Here a list of papers which I maintain:
Hello Mr. Lindauer,
In regards to the concept on “counting the input layer or not”. How about this as a thought:
” I would have thought the process of “Application initialization” and
connecting data sources” would be considered the “first process”, so
why wouldn’t the initialization of the model not be considered?
Is it because the input layer isn’t active consistently? Or the input layer
One of the things I’ve seen a lot of in researching Neural Networks is that they always
discribe the first layer as the “input layer”, which sort of makes MLP2 a little mis-leading?
Trying to figure out the pro’s and con’s!
Not looking to take you back through the argument process again! Sorry
Great! Thanks for the blog post 🙂 There is also an interesting post here which tries to address the same question.
Thanks for sharing.
This is grate blog and nice information
Hi, Very nice summary. Thank you very much! I’m a deep learning researcher working in an inter-disciplinary team in Univ Edi. May I ask about the template you used to create this site? It looks quite professional and great!
Thanks, you can learn more about the software I use for the site here:
Thank you for such a great post, but I have a question;
Let’s say our image size is 64*64*3 so what would be the number of nodes in our input layer?
I would recommend using a CNN and perhaps try 32 filters?
Sorry for ambiguity in my question….Suppose I’m using a CNN and I have a picture of size 64*64*3 and my question is what would be the number of nodes in my input layer?
It would be input_shape=(64,64,3) if using channels last format.
Will the number of neuron in hidden layer mention here work for RNN/LSTM as well?
Perhaps test and compare results with different configurations?
There is practically no way to know ahead of time how many layers or nodes you will need for a certain neural network learning task. I have a solution: a neural network, called ALNfitDeep, which can automatically *grow* during training to fit the problem. Software to do this is at https://github.com/Bill-Armstrong/ . There is a new executable release available, which you can get by clicking where it says “4 releases” near the top of the main page. You can forget the source code for now. Use the Help button. Since the release is new, I would appreciate any feedback on problems you encounter. From my point of view, neural nets which can’t learn automatically based on the problem are a total waste of time. Also having to use a lot of valuable data for validation is a waste. My nets measure the noise variance and then train on all of the data not used in testing. My nets can grow to tens of thousands of nodes, yet the execution of the learned function remains very fast (because very little has to actually be computed for a given input). The secret is that all computation of linear functions is in the first layer. Instead of a one-input squashing function, there are two-input non-linearities: max, and min. People have to have the courage to try it. I will help.
Thanks for the note.
I have played with “growing” and “pruning” nets since the late 1990s, I remain skeptical.
A sensitivity analysis of model capacity vs skill is reliable and repeatable for me.
Totally agree with statement “Waste of time” ! Creation NN’s no big deal, but creating actually useful one, almost “mission impossible”. U have to have big employment “horde” doing one task only – finding the right shape of NN in testing it against the dataset manually. So underdeveloped those software solutions are.
I have 2 questions.
1. What if I want to predict financial time series (e.g. Forex, stock price) and I have decided to use mlp with 2 hidden layers. I have also decided to use 4 neurons for each of the hidden layers. How exactly do I split up my data set including the input. I’m assuming that the 4 neurons will be Open, Close, High and Low Values.
Will my input values be majority of my dataset in total? and then for the first hidden layer, a subset of the dataset including, so some of the ‘High values for one of the neurons’, some of the low values for the second neuron and so on, and then the same in the other hidden layer?
2. Do you have a python script example for iterating through different example layers.
Thank you very much!
If you have 4 classes as output, then the data must be prepared to match this expectation with a one hot encoding:
Data would not be split based on the number of nodes, I’m not sure I follow, sorry.
Here’s an example of a model for multi-class classification:
Hi, May I say that layers and nodes are relevant to how many training inputs numbers?
If more inputs, more nodes?
Not really. They are unrelated.
Thanks again for your wonderful tutorials.
Rephrasing the question by “James”: Does having more training samples requires an addition to the number of neurons?
I have designed a neural network model based on the “input shapes” which is also advised by many empirical rules of thumb. My question is, if I have datasets of 100 or 100k samples (each is representative enough), shall I leave the model shapes fixed? or grow it (perhaps linearly) with the increase of the samples?
Because I’ve noticed some complexities arises in the dataset as it grows.
It may or it may not. We cannot know for sure for a given dataset and model combination.
There is an interrelation between the number of layers (nodes per layer). Please have a look into our paper https://arxiv.org/abs/1902.02771. Thank you.
Thanks for sharing.
Thank you, Jason, for the post.
Is there a rule of thumb for the number of units when you want to increase the number of hidden layers? Let’s say for example that your model has a decent performance for 1 hidden layer and 30 units, would choosing 2 hidden layers means you would decrease the number of units for each of these layers or you can even increase it?
Not really, sorry.
Test and use a robust test harness so that the results are reliable.
lets say we want to differentiate between clear and blur images, can CNN train a model to do that and how do you go about it
Yes, perhaps a classification problem with a binary prediction (blur vs no-blur).
MLP? Or MLFFN? Is there a way to use the simpler perceptron update algorithm without using derivative or backprop or without separating the layers with a non-linearity activation without having the whole thing collapse into the equivslent of a single linear layer as Minsky pointed out way back?
There may be, I don’t have material on it, sorry.
We moved away from simple Perceptron because backprop on a MLP works really well in general.
Hi. jason. i have a network with 4 layers (4 hidden layers) which each layer has 32 nodes. i use adam optimizers and leaky relu. i want to know what is the name of my network? is it simple MLP ? can i name that as a deep network ?
It is an MLP, a deep MLP if you like.
Hello to you, Jason,
I have a forecasting project using machine learning to predict agricultural crops.
I need to build an algorithm to predict agricultural crops based on field size, local climate, season and soil chemical components (such as mineral salts, phosphorus ions, potassium and nitrates, moisture and gases in the air) at the input. And the output will be a list of optimized crops generated. Which method will I use? Any link to guide me will be useful to me.
That sounds like a fun project!
I recommend following this process as a first step:
Thank so much!
I will give you a feedback.
I will make this prototype to present it at panama city in competition. Am selected as a finalist.
I wouldn’t want us to see this project as a fun project! If you can redirect it better to make it look interesting, it would be great!
“A single-layer neural network can only be used to represent linearly separable functions.” I think this statement is wrong. I understand that “A feedforward network with a single layer is sufficient to represent any function, but the layer may be infeasibly large and may fail to learn and generalize correctly.”, do you mean single-layer with only one single neuron ?
No, a network with a single layer of nodes.
oh, yes,I checked it and everything is correct, I should add “A feedforward network with a single **hidden** layer” which would be universal approximation theorem. But a single-layer neural network has no hidden layers at all, then it can’t make anything more than linear separation, the simples example would be it can’t compute xor.
Hi Jason, am working on neural machine machine translation, one of my examiner asks me how many input layer, hidden layer and output layer In my experiment. Nothing to answer. the parallel sentence used is 7050 length of long sentence in input 25 and output 20.
I used 100 dimension. would you help me? vocabulary(unique words in input) is 12700.
Perhaps test different configurations and discover what works best for your model and dataset?
The reference to Deep Learning and the universal approximation theorem is incorrect–while the above reference states “p.198”, it’s actually on p.192 of the 2016 edition.
Thanks for the info regarding hidden layer structure selection.
I wanted to ask if you were familiar with using metaheuristics to train a network and whether different training strategies need differing model structures. For example, if you were using the Iris data set with 5 hidden neurons (one layer) when training with backpropagation, do you think it would be appropriate to use the same number of hidden layers and neurons if you were to train using PSO or SA?
In other words, does the training technique influence the number of hidden neurons or layers?
Backprop remains the most efficient training algorithm, regardless of choice of architecture.
Metaheuristics could be useful in finding the architecture to train though. I have seen many automl and NAS (network architecture search) algorithms that use an evolutionary algorithm at their core.
It’s funny because there’s a lot of resources on using metaheuristics to find optimal network hyperparameters or hidden structure, not a whole lot on training. Part of my current interest is in that area and I can see why BP is generally preferred for training. I’ve implemented a GA trained NN in lieu of BP and while it seems to converge nicely, it sure is slow (I’m talking 50x slower for equivalent networks). I’ve still yet to implement a PSO-NN but it’s still interesting to think about. A bioinspired network trained by a bioinspired metaheuristic has a nice ring to it.
here’s my question: if we have a summation function that takes the sum of the weighted inputs and forwards it to the activation function, how do we count the layers? ex: a single layer precepetron with 2 inputs and 2 weights, and the question specifically mentions that we have a summation function and an activation function. do we count both summation and activation as 1 layer or 2 layers?
Typically you count hidden layers only.
construct a dataset having 4 inputs against two input variables. you also have to assume target output against each input.
2. construct a topology for neural network having atleast 5 neurons (number of hidden layers and number of neurons in each layer will be of your own choice)
3. assume initial weights of your own choice and run a complete iteration (for all four inputs)
Iahve to submit this assignment plz anyone can help??
Perhaps start here:
Perhaps contact your teacher directly, after all, you have already paid for their help.
How many minimume number of layers in deep laerning
The minimum number of layers would be 0 hidden layers, e.g. connect inputs/visible layer directly to the output layer.
hello.thanks for your good tutorial. Im work on breast cancer detection using deep learning. Im beginnier but study too much and diffrent article .but i cant improve my CNN performance. what should i do?
You can discover 100s of tutorials on how to improve neural nets on this blog, perhaps start here:
Should hidden layers have same number of neurons? If yes, why?
A link towards resources is OK! Thanks.
No, you can have any number of nodes in each layer.
See the “Further Reading” section for resources.
Hi Jason, The content is great. I have a doubt i want to use 2 output regression model, with the input size of 5 . How many hidden layers and number of nodes i need to use??
There is no standard way to configure the model, use some trial and error and discover what works best for your dataset.
Yes the training loss was 2.99 and Validation loss =4.7 it was not decreasing further. I have used 2 hidden layers 4 neurons each (1st hidden layer = Relu, 2nd hidden layer=Exponential). 4 input nodes each are normalised to (0,1) and 2 output nodes. Any suggestions and modifications of network so that both the losses can come below 1.5 or so. Thanks in advance
Yes, the suggestions here will get you started:
Two simpleminded questions:
Is updating of neuron weights done locally by impulses that propagate backwards from outcome success, or by a separate process running alongside the neural net?
Is the flow of neuron output signals between layers necessarily one-way? If not, what can we say about the desirability and configuration of such feedback loop connections?
Yes, from output layer based on error, then on back through the net to input layer.
Flow is forward for inference back for error correction/weight updates.
Other net types can have loops and/or internal and things get harder for training, e.g. RNNs like LSTMs using back prop through time:
Hi Dr. Jason, I’m working in MLP and LSTM deep learning algorithms, to tune the best structure for these algorithms I started by tunned the number of hidden neurons in each hidden layer, I selected three hidden layers to start with, then I submitted the best neuron that works with my goal ( high specificity ), then I tunned the number of hidden layers from 3 to 8 and submitted the best number of hidden layers that works with my goal and continue the other hyper-parameters.
Is this way of choosing the number of hidden neurons and then the number of hidden layers correct !!!
and do you have papers that support this flow of choosing this way?
Ideally we would optimize all aspects of the model at once, bit it is very computationally expensive.
Instead, in practice we often have to optimize one thing at a time.
hi, i have a question, how many nodes the output layer can have? it is necesary just 1 node, or i can have more?
If you are predicting one value, then it must have one node. Predicting multiple values, then multiple nodes.
I really appreciate the information, it was very clear and understandable. I would like to cite your work in one of my projects. How should I do that?
Thanks, see this:
Thank you very much for your work!! In your book, we have many examples where we have # of neurons in the first hidden layer equal to # of neurons in the input layer, which is the number of input features. Is this a common practice? Is there any good reason for that?
Hi Evan…This is not required. I would investigate optimization techniques to select the number of neurons in each layer:
Can any body guide me for tutorial related to multi-variant optimization in deep learning through MATLAB
Hi AJ…While we focus on Python in our material, you may find the following of interest: