# Ensemble Learning Methods for Deep Learning Neural Networks

Last Updated on August 6, 2019

#### How to Improve Performance By Combining Predictions From Multiple Models.

Deep learning neural networks are nonlinear methods.

They offer increased flexibility and can scale in proportion to the amount of training data available. A downside of this flexibility is that they learn via a stochastic training algorithm which means that they are sensitive to the specifics of the training data and may find a different set of weights each time they are trained, which in turn produce different predictions.

Generally, this is referred to as neural networks having a high variance and it can be frustrating when trying to develop a final model to use for making predictions.

A successful approach to reducing the variance of neural network models is to train multiple models instead of a single model and to combine the predictions from these models. This is called ensemble learning and not only reduces the variance of predictions but also can result in predictions that are better than any single model.

In this post, you will discover methods for deep learning neural networks to reduce variance and improve prediction performance.

After reading this post, you will know:

• Neural network models are nonlinear and have a high variance, which can be frustrating when preparing a final model for making predictions.
• Ensemble learning combines the predictions from multiple neural network models to reduce the variance of predictions and reduce generalization error.
• Techniques for ensemble learning can be grouped by the element that is varied, such as training data, the model, and how predictions are combined.

Kick-start your project with my new book Better Deep Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Ensemble Methods to Reduce Variance and Improve Performance of Deep Learning Neural Networks
Photo by University of San Francisco’s Performing Arts, some rights reserved.

## Overview

This tutorial is divided into four parts; they are:

1. High Variance of Neural Network Models
2. Reduce Variance Using an Ensemble of Models
3. How to Ensemble Neural Network Models
4. Summary of Ensemble Techniques

## High Variance of Neural Network Models

Training deep neural networks can be very computationally expensive.

Very deep networks trained on millions of examples may take days, weeks, and sometimes months to train.

Google’s baseline model […] was a deep convolutional neural network […] that had been trained for about six months using asynchronous stochastic gradient descent on a large number of cores.

After the investment of so much time and resources, there is no guarantee that the final model will have low generalization error, performing well on examples not seen during training.

… train many different candidate networks and then to select the best, […] and to discard the rest. There are two disadvantages with such an approach. First, all of the effort involved in training the remaining networks is wasted. Second, […] the network which had best performance on the validation set might not be the one with the best performance on new test data.

— Pages 364-365, Neural Networks for Pattern Recognition, 1995.

Neural network models are a nonlinear method. This means that they can learn complex nonlinear relationships in the data. A downside of this flexibility is that they are sensitive to initial conditions, both in terms of the initial random weights and in terms of the statistical noise in the training dataset.

This stochastic nature of the learning algorithm means that each time a neural network model is trained, it may learn a slightly (or dramatically) different version of the mapping function from inputs to outputs, that in turn will have different performance on the training and holdout datasets.

As such, we can think of a neural network as a method that has a low bias and high variance. Even when trained on large datasets to satisfy the high variance, having any variance in a final model that is intended to be used to make predictions can be frustrating.

### Want Better Results with Deep Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Reduce Variance Using an Ensemble of Models

A solution to the high variance of neural networks is to train multiple models and combine their predictions.

The idea is to combine the predictions from multiple good but different models.

A good model has skill, meaning that its predictions are better than random chance. Importantly, the models must be good in different ways; they must make different prediction errors.

The reason that model averaging works is that different models will usually not make all the same errors on the test set.

— Page 256, Deep Learning, 2016.

Combining the predictions from multiple neural networks adds a bias that in turn counters the variance of a single trained neural network model. The results are predictions that are less sensitive to the specifics of the training data, choice of training scheme, and the serendipity of a single training run.

In addition to reducing the variance in the prediction, the ensemble can also result in better predictions than any single best model.

… the performance of a committee can be better than the performance of the best single network used in isolation.

— Page 365, Neural Networks for Pattern Recognition, 1995.

This approach belongs to a general class of methods called “ensemble learning” that describes methods that attempt to make the best use of the predictions from multiple models prepared for the same problem.

Generally, ensemble learning involves training more than one network on the same dataset, then using each of the trained models to make a prediction before combining the predictions in some way to make a final outcome or prediction.

In fact, ensembling of models is a standard approach in applied machine learning to ensure that the most stable and best possible prediction is made.

For example, Alex Krizhevsky, et al. in their famous 2012 paper titled “Imagenet classification with deep convolutional neural networks” that introduced very deep convolutional neural networks for photo classification (i.e. AlexNet) used model averaging across multiple well-performing CNN models to achieve state-of-the-art results at the time. Performance of one model was compared to ensemble predictions averaged over two, five, and seven different models.

Averaging the predictions of five similar CNNs gives an error rate of 16.4%. […] Averaging the predictions of two CNNs that were pre-trained […] with the aforementioned five CNNs gives an error rate of 15.3%.

Ensembling is also the approach used by winners in machine learning competitions.

Another powerful technique for obtaining the best possible results on a task is model ensembling. […] If you look at machine-learning competitions, in particular on Kaggle, you’ll see that the winners use very large ensembles of models that inevitably beat any single model, no matter how good.

— Page 264, Deep Learning With Python, 2017.

## How to Ensemble Neural Network Models

Perhaps the oldest and still most commonly used ensembling approach for neural networks is called a “committee of networks.”

A collection of networks with the same configuration and different initial random weights is trained on the same dataset. Each model is then used to make a prediction and the actual prediction is calculated as the average of the predictions.

The number of models in the ensemble is often kept small both because of the computational expense in training models and because of the diminishing returns in performance from adding more ensemble members. Ensembles may be as small as three, five, or 10 trained models.

The field of ensemble learning is well studied and there are many variations on this simple theme.

It can be helpful to think of varying each of the three major elements of the ensemble method; for example:

• Training Data: Vary the choice of data used to train each model in the ensemble.
• Ensemble Models: Vary the choice of the models used in the ensemble.
• Combinations: Vary the choice of the way that outcomes from ensemble members are combined.

Let’s take a closer look at each element in turn.

### Varying Training Data

The data used to train each member of the ensemble can be varied.

The simplest approach would be to use k-fold cross-validation to estimate the generalization error of the chosen model configuration. In this procedure, k different models are trained on k different subsets of the training data. These k models can then be saved and used as members of an ensemble.

Another popular approach involves resampling the training dataset with replacement, then training a network using the resampled dataset. The resampling procedure means that the composition of each training dataset is different with the possibility of duplicated examples allowing the model trained on the dataset to have a slightly different expectation of the density of the samples, and in turn different generalization error.

This approach is called bootstrap aggregation, or bagging for short, and was designed for use with unpruned decision trees that have high variance and low bias. Typically a large number of decision trees are used, such as hundreds or thousands, given that they are fast to prepare.

… a natural way to reduce the variance and hence increase the prediction accuracy of a statistical learning method is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions. […] Of course, this is not practical because we generally do not have access to multiple training sets. Instead, we can bootstrap, by taking repeated samples from the (single) training data set.

— Pages 216-317, An Introduction to Statistical Learning with Applications in R, 2013.

An equivalent approach might be to use a smaller subset of the training dataset without regularization to allow faster training and some overfitting.

The desire for slightly under-optimized models applies to the selection of ensemble members more generally.

… the members of the committee should not individually be chosen to have optimal trade-off between bias and variance, but should have relatively smaller bias, since the extra variance can be removed by averaging.

— Page 366, Neural Networks for Pattern Recognition, 1995.

Other approaches may involve selecting a random subspace of the input space to allocate to each model, such as a subset of the hyper-volume in the input space or a subset of input features.

#### Ensemble Tutorials

For examples of deep learning ensembles that vary training data see:

### Varying Models

Training the same under-constrained model on the same data with different initial conditions will result in different models given the difficulty of the problem, and the stochastic nature of the learning algorithm.

This is because the optimization problem that the network is trying to solve is so challenging that there are many “good” and “different” solutions to map inputs to outputs.

Most neural network algorithms achieve sub-optimal performance specifically due to the existence of an overwhelming number of sub-optimal local minima. If we take a set of neural networks which have converged to local minima and apply averaging we can construct an improved estimate. One way to understand this fact is to consider that, in general, networks which have fallen into different local minima will perform poorly in different regions of feature space and thus their error terms will not be strongly correlated.

This may result in a reduced variance, but may not dramatically improve generalization error. The errors made by the models may still be too highly correlated because the models all have learned similar mapping functions.

An alternative approach might be to vary the configuration of each ensemble model, such as using networks with different capacity (e.g. number of layers or nodes) or models trained under different conditions (e.g. learning rate or regularization).

The result may be an ensemble of models that have learned a more heterogeneous collection of mapping functions and in turn have a lower correlation in their predictions and prediction errors.

Differences in random initialization, random selection of minibatches, differences in hyperparameters, or different outcomes of non-deterministic implementations of neural networks are often enough to cause different members of the ensemble to make partially independent errors.

— Pages 257-258, Deep Learning, 2016.

Such an ensemble of differently configured models can be achieved through the normal process of developing the network and tuning its hyperparameters. Each model could be saved during this process and a subset of better models chosen to comprise the ensemble.

Slightly inferiorly trained networks are a free by-product of most tuning algorithms; it is desirable to use such extra copies even when their performance is significantly worse than the best performance found. Better performance yet can be achieved through careful planning for an ensemble classification by using the best available parameters and training different copies on different subsets of the available database.

Neural Network Ensembles, 1990.

In cases where a single model may take weeks or months to train, another alternative may be to periodically save the best model during the training process, called snapshot or checkpoint models, then select ensemble members among the saved models. This provides the benefits of having multiple models trained on the same data, although collected during a single training run.

Snapshot Ensembling produces an ensemble of accurate and diverse models from a single training process. At the heart of Snapshot Ensembling is an optimization process which visits several local minima before converging to a final solution. We take model snapshots at these various minima, and average their predictions at test time.

A variation on the Snapshot ensemble is to save models from a range of epochs, perhaps identified by reviewing learning curves of model performance on the train and validation datasets during training. Ensembles from such contiguous sequences of models are referred to as horizontal ensembles.

First, networks trained for a relatively stable range of epoch are selected. The predictions of the probability of each label are produced by standard classifiers [over] the selected epoch[s], and then averaged.

A further enhancement of the snapshot ensemble is to systematically vary the optimization procedure during training to force different solutions (i.e. sets of weights), the best of which can be saved to checkpoints. This might involve injecting an oscillating amount of noise over training epochs or oscillating the learning rate during training epochs. A variation of this approach called Stochastic Gradient Descent with Warm Restarts (SGDR) demonstrated faster learning and state-of-the-art results for standard photo classification tasks.

Our SGDR simulates warm restarts by scheduling the learning rate to achieve competitive results […] roughly two to four times faster. We also achieved new state-of-the-art results with SGDR, mainly by using even wider [models] and ensembles of snapshots from SGDR’s trajectory.

A benefit of very deep neural networks is that the intermediate hidden layers provide a learned representation of the low-resolution input data. The hidden layers can output their internal representations directly, and the output from one or more hidden layers from one very deep network can be used as input to a new classification model. This is perhaps most effective when the deep model is trained using an autoencoder model. This type of ensemble is referred to as a vertical ensemble.

This method ensembles a series of classifiers whose inputs are the representation of intermediate layers. A lower error rate is expected because these features seem diverse.

#### Ensemble Tutorials

For examples of deep learning ensembles that vary models see:

### Varying Combinations

The simplest way to combine the predictions is to calculate the average of the predictions from the ensemble members.

This can be improved slightly by weighting the predictions from each model, where the weights are optimized using a hold-out validation dataset. This provides a weighted average ensemble that is sometimes called model blending.

… we might expect that some members of the committee will typically make better predictions than other members. We would therefore expect to be able to reduce the error still further if we give greater weight to some committee members than to others. Thus, we consider a generalized committee prediction given by a weighted combination of the predictions of the members …

— Page 367, Neural Networks for Pattern Recognition, 1995.

One further step in complexity involves using a new model to learn how to best combine the predictions from each ensemble member.

The model could be a simple linear model (e.g. much like the weighted average), but could be a sophisticated nonlinear method that also considers the specific input sample in addition to the predictions provided by each member. This general approach of learning a new model is called model stacking, or stacked generalization.

Stacked generalization works by deducing the biases of the generalizer(s) with respect to a provided learning set. This deduction proceeds by generalizing in a second space whose inputs are (for example) the guesses of the original generalizers when taught with part of the learning set and trying to guess the rest of it, and whose output is (for example) the correct guess. […] When used with a single generalizer, stacked generalization is a scheme for estimating (and then correcting for) the error of a generalizer which has been trained on a particular learning set and then asked a particular question.

Stacked generalization, 1992.

There are more sophisticated methods for stacking models, such as boosting where ensemble members are added one at a time in order to correct the mistakes of prior models. The added complexity means this approach is less often used with large neural network models.

Another combination that is a little bit different is to combine the weights of multiple neural networks with the same structure. The weights of multiple networks can be averaged, to hopefully result in a new single model that has better overall performance than any original model. This approach is called model weight averaging.

… suggests it is promising to average these points in weight space, and use a network with these averaged weights, instead of forming an ensemble by averaging the outputs of networks in model space

#### Ensemble Tutorials

For examples of deep learning ensembles that vary combinations see:

## Summary of Ensemble Techniques

In summary, we can list some of the more common and interesting ensemble methods for neural networks organized by each element of the method that can be varied, as follows:

There is no single best ensemble method; perhaps experiment with a few approaches or let the constraints of your project guide you.

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

## Summary

In this post, you discovered ensemble methods for deep learning neural networks to reduce variance and improve prediction performance.

Specifically, you learned:

• Neural network models are nonlinear and have a high variance, which can be frustrating when preparing a final model for making predictions.
• Ensemble learning combines the predictions from multiple neural network models to reduce the variance of predictions and reduce generalization error.
• Techniques for ensemble learning can be grouped by the element that is varied, such as training data, the model, and how predictions are combined.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

## Develop Better Deep Learning Models Today!

#### Train Faster, Reduce Overftting, and Ensembles

...with just a few lines of python code

Discover how in my new Ebook:
Better Deep Learning

It provides self-study tutorials on topics like:
weight decay, batch normalization, dropout, model stacking and much more...

#### Bring better deep learning to your projects!

Skip the Academics. Just Results.

### 45 Responses to Ensemble Learning Methods for Deep Learning Neural Networks

1. Charles Pierre December 19, 2018 at 1:19 pm #

I truly enjoy your books. I am thinking about purchasing this one.

2. Wonbin February 8, 2019 at 12:24 am #

Thank you so much for this amazing post! I’m wishing one of ensembles for deep learning could help the project I’m working on.

3. Tarig Ballal March 22, 2019 at 10:04 am #

Thank you Jason for the great article, very insightful and very well-presented. I like your style, especially the quotes.

• Jason Brownlee March 22, 2019 at 2:31 pm #

Thanks, I’m grateful for your feedback and I’m glad it helped!

4. July April 14, 2019 at 4:27 pm #

Great summing up of ensemble neural network. Thanks for your work.

5. Nitin Pasumarthy April 25, 2019 at 12:57 pm #

Nice post Jason!
Dropouts can also thought of varying models ensemble technique, right? Also could you compare each of these techniques from your experience, where they worked well and where they did not?

6. Jess November 21, 2019 at 10:22 am #

great article. Will you call an LSTM (long short term memory) with more than one layer an ensemble approach?

• Jason Brownlee November 21, 2019 at 1:28 pm #

No. It is one model with many layers.

• Tommy Decito May 28, 2020 at 1:04 am #

Why would you say so ? Isn’t it a single layer with retained weights and output at different timesteps.

• Jason Brownlee May 28, 2020 at 6:18 am #

It is a single model.

Ensemble implies multiple models.

7. inamullah February 5, 2020 at 7:55 pm #

Nice post Jason sir.how can we create parameters diversity in CNN ensemble model?

• Jason Brownlee February 6, 2020 at 8:22 am #

One approach is to fit the models on different subsets of the training data, so-called bagging or random subspaces.

• inamullah February 19, 2020 at 4:28 am #

can we apply the different technique (weight initialization) approach for parameters diversity to create different models.?

8. Ariel_Arias March 17, 2020 at 6:31 am #

Thanks for your post. It’s very informative

9. Saif Ul Islam April 5, 2020 at 4:37 pm #

This article is really useful and insightful!

I don’t really understand what the first picture has to do with anything at all, though. 😛

10. Kelvin June 3, 2020 at 7:52 pm #

Hi, I read your post and I want to be cleared on some few things. Now I wrote 4 networks with same configuration with different initial weights separately, how do I do the averaging of the outputs of the predictions?
An example will be extremely appreciated. Thanks

• Jason Brownlee June 4, 2020 at 6:17 am #

You can use a simple average of their predictions, shown here:
https://machinelearningmastery.com/model-averaging-ensemble-for-deep-learning-neural-networks/

• Kelvin June 4, 2020 at 6:13 pm #

Yeah I have seen the tutorial, but in my case I have to use 4 different weight initializers for each model created(number of model is 4).

This is what i did:

I just created 4 different models with the same configuration, but with different weight initializers. And for the prediction, I added the 4 models and divided by 4.

Is this right?

11. Yogeeshwari June 14, 2020 at 8:57 pm #

Sir, can models differing in dropout rates can be ensembled?

12. Yogeeshwari June 15, 2020 at 2:22 pm #

Thank you!

13. Kelvin June 15, 2020 at 3:17 pm #

Thanks for this article.

However, I have a question. When doing a regression problem, and I employed ensemble learning is it okay for the variance error to be smaller than the mean error?

14. M July 1, 2020 at 4:30 am #

How much longer will it take to train an ensemble network as compared to a single CNN

• Jason Brownlee July 1, 2020 at 5:56 am #

If you fit 5 versions of your final model, then 5 times as long.

15. catherin July 22, 2020 at 9:31 am #

Really useful! Thank you for this post!

16. Brad October 3, 2020 at 2:00 am #

Hi Jason, I am wondering if you have ever tried ensembling the results from multiple ensembles and if it would boost performance. I am specifically referring to building multiple stacking models from the same member predictions (say 10 models from the 10-fold CV example, trained on the initial data) and then ensembling the predictions of those stacked models together with an additional stacking model. Do you know if nesting these techniques together would lead to increased performance? Is this something that is generally not done?

• Jason Brownlee October 3, 2020 at 6:10 am #

Yes. It typically offers a minor lift or no lift. Really depends on the dataset and models.

Try it and see on your data and models. Experiments are cheap and easy.

17. Arjun Roy January 25, 2021 at 1:16 pm #

Your articles are always nice and informative. A true inspiration…

18. Israa December 25, 2021 at 3:23 am #

Thank you so much Jason this is just amazingly explained!!

I would really appreciate getting some help from you, I have two separate neural models that were trained on different datasets and for different tasks, is it possible to use ensembling in this case? if not is there anyway I can train the second model on the first one’s dataset without making drastic changes to this model?
thank you in advance!

19. Manash Sarma March 13, 2022 at 12:28 pm #

Jason, is it possible to ensemble deep learning model with non deep learning model based on like SVM, RF etc. ? If there is any example people have done it ?