How to Develop a Weighted Average Ensemble With Python

Last Updated on May 8, 2021

Weighted average ensembles assume that some models in the ensemble have more skill than others and give them more contribution when making predictions.

The weighted average or weighted sum ensemble is an extension over voting ensembles that assume all models are equally skillful and make the same proportional contribution to predictions made by the ensemble.

Each model is assigned a fixed weight that is multiplied by the prediction made by the model and used in the sum or average prediction calculation. The challenge of this type of ensemble is how to calculate, assign, or search for model weights that result in performance that is better than any contributing model and an ensemble that uses equal model weights.

In this tutorial, you will discover how to develop Weighted Average Ensembles for classification and regression.

After completing this tutorial, you will know:

  • Weighted Average Ensembles are an extension to voting ensembles where model votes are proportional to model performance.
  • How to develop weighted average ensembles using the voting ensemble from scikit-learn.
  • How to evaluate the Weighted Average Ensembles for classification and regression and confirm the models are skillful.

Kick-start your project with my new book Ensemble Learning Algorithms With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Updated May/2021: Fixed definition of weighted average.
How to Develop a Weighted Average Ensemble With Python

How to Develop a Weighted Average Ensemble With Python
Photo by Alaina McDavid, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

  1. Weighted Average Ensemble
  2. Develop a Weighted Average Ensemble
  3. Weighted Average Ensemble for Classification
  4. Weighted Average Ensemble for Regression

Weighted Average Ensemble

Weighted average or weighted sum ensemble is an ensemble machine learning approach that combines the predictions from multiple models, where the contribution of each model is weighted proportionally to its capability or skill.

The weighted average ensemble is related to the voting ensemble.

Voting ensembles are composed of multiple machine learning models where the predictions from each model are averaged directly. For regression, this involves calculating the arithmetic mean of the predictions made by ensemble members. For classification, this may involve calculating the statistical mode (most common class label) or similar voting scheme or summing the probabilities predicted for each class and selecting the class with the largest summed probability.

For more on voting ensembles, see the tutorial:

A limitation of the voting ensemble technique is that it assumes that all models in the ensemble are equally effective. This may not be the case as some models may be better than others, especially if different machine learning algorithms are used to train each model ensemble member.

An alternative to voting is to assume that ensemble members are not all equally capable and instead some models are better than others and should be given more votes or more of a seat when making a prediction. This provides the motivation for the weighted sum or weighted average ensemble method.

In regression, an average prediction is calculated using the arithmetic mean, such as the sum of the predictions divided by the total predictions made. For example, if an ensemble had three ensemble members, the reductions may be:

  • Model 1: 97.2
  • Model 2: 100.0
  • Model 3: 95.8

The mean prediction would be calculated as follows:

  • yhat = (97.2 + 100.0 + 95.8) / 3
  • yhat = 293 / 3
  • yhat = 97.666

A weighted average prediction involves first assigning a fixed weight coefficient to each ensemble member. This could be a floating-point value between 0 and 1, representing a percentage of the weight. It could also be an integer starting at 1, representing the number of votes to give each model.

For example, we may have the fixed weights of 0.84, 0.87, 0.75 for the ensemble member. These weights can be used to calculate the weighted average by multiplying each prediction by the model’s weight to give a weighted sum, then dividing the value by the sum of the weights. For example:

  • yhat = ((97.2 * 0.84) + (100.0 * 0.87) + (95.8 * 0.75)) / (0.84 + 0.87 + 0.75)
  • yhat = (81.648 + 87 + 71.85) / (0.84 + 0.87 + 0.75)
  • yhat = 240.498 / 2.46
  • yhat = 97.763

We can see that as long as the scores have the same scale, and the weights have the same scale and are maximizing (meaning that larger weights are better), the weighted sum results in a sensible value, and in turn, the weighted average is also sensible, meaning the scale of the outcome matches the scale of the scores.

This same approach can be used to calculate the weighted sum of votes for each crisp class label or the weighted sum of probabilities for each class label on a classification problem.

The challenging aspect of using a weighted average ensemble is how to choose the relative weighting for each ensemble member.

There are many approaches that can be used. For example, the weights may be chosen based on the skill of each model, such as the classification accuracy or negative error, where large weights mean a better-performing model. Performance may be calculated on the dataset used for training or a holdout dataset, the latter of which may be more relevant.

The scores of each model can be used directly or converted into a different value, such as the relative ranking for each model. Another approach might be to use a search algorithm to test different combinations of weights.

Now that we are familiar with the weighted average ensemble method, let’s look at how to develop and evaluate them.

Want to Get Started With Ensemble Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Develop a Weighted Average Ensemble

In this section, we will develop, evaluate, and use weighted average or weighted sum ensemble models.

We can implement weighted average ensembles manually, although this is not required as we can use the voting ensemble in the scikit-learn library to achieve the desired effect. Specifically, the VotingRegressor and VotingClassifier classes can be used for regression and classification respectively and both provide a “weights” argument that specifies the relative contribution of each ensemble member when making a prediction.

A list of base-models is provided via the “estimators” argument. This is a Python list where each element in the list is a tuple with the name of the model and the configured model instance. Each model in the list must have a unique name.

For example, we can define a weighted average ensemble for classification with two ensemble members as follows:

Additionally, the voting ensemble for classification provides the “voting” argument that supports both hard voting (‘hard‘) for combining crisp class labels and soft voting (‘soft‘) for combining class probabilities when calculating the weighted sum for prediction; for example:

Soft voting is generally preferred if the contributing models support predicting class probabilities, as it often results in better performance. The same holds for the weighted sum of predicted probabilities.

Now that we are familiar with how to use the voting ensemble API to develop weighted average ensembles, let’s look at some worked examples.

Weighted Average Ensemble for Classification

In this section, we will look at using Weighted Average Ensemble for a classification problem.

First, we can use the make_classification() function to create a synthetic binary classification problem with 10,000 examples and 20 input features.

The complete example is listed below.

Running the example creates the dataset and summarizes the shape of the input and output components.

Next, we can evaluate a Weighted Average Ensemble algorithm on this dataset.

First, we will split the dataset into train and test sets with a 50-50 split. We will then split the full training set into a subset for training the models and a subset for validation.

Next, we will define a function to create a list of models to use in the ensemble. In this case, we will use a diverse collection of classification models, including logistic regression, a decision tree, and naive Bayes.

Next, we need to weigh each ensemble member.

In this case, we will use the performance of each ensemble model on the training dataset as the relative weighting of the model when making predictions. Performance will be calculated using classification accuracy as a percentage of correct predictions between 0 and 1, with larger values meaning a better model, and in turn, more contribution to the prediction.

Each ensemble model will first be fit on the training set, then evaluated on the validation set. The accuracy on the validation set will be used as the model weighting.

The evaluate_models() function below implements this, returning the performance of each model.

We can then call this function to get the scores and use them as a weighting for the ensemble.

We can then fit the ensemble on the full training dataset and evaluate it on the holdout test set.

Tying this together, the complete example is listed below.

Running the example first evaluates each standalone model and reports the accuracy scores that will be used as model weights. Finally, the weighted average ensemble is fit and evaluated on the test reporting the performance.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the voting ensemble achieved a classification accuracy of about 90.960 percent.

Our expectation is that the ensemble will perform better than any of the contributing ensemble members. The problem is the accuracy scores for the models used as weightings cannot be directly compared to the performance of the ensemble because the members were evaluated on a subset of training and the ensemble was evaluated on the test dataset.

We can update the example and add an evaluation of each standalone model for comparison.

We also expect the weighted average ensemble to perform better than an equally weighted voting ensemble.

This can also be checked by explicitly evaluating the voting ensemble.

Tying this together, the complete example is listed below.

Running the example first prepares and evaluates the weighted average ensemble as before, then reports the performance of each contributing model evaluated in isolation, and finally the voting ensemble that uses an equal weighting for the contributing models.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the weighted average ensemble performs better than any contributing ensemble member.

We can also see that an equal weighting ensemble (voting) achieved an accuracy of about 90.620, which is less than the weighted ensemble that achieved the slightly higher 90.760 percent accuracy.

Next, let’s take a look at how to develop and evaluate a weighted average ensemble for regression.

Weighted Average Ensemble for Regression

In this section, we will look at using Weighted Average Ensemble for a regression problem.

First, we can use the make_regression() function to create a synthetic regression problem with 1,000 examples and 20 input features.

The complete example is listed below.

Running the example creates the dataset and summarizes the shape of the input and output components.

Next, we can evaluate a Weighted Average Ensemble model on this dataset.

First, we can split the dataset into train and test sets, then further split the training set into train and validation sets so that we can estimate the performance of each contributing model.

We can define the list of models to use in the ensemble. In this case, we will use k-nearest neighbors, decision tree, and support vector regression.

Next, we can update the evaluate_models() function to calculate the mean absolute error (MAE) for each ensemble member on a hold out validation dataset.

We will use the negative MAE scores as a weight where large error values closer to zero indicate a better performing model.

We can then call this function to get the scores and use them to define the weighted average ensemble for regression.

We can then fit the ensemble on the entire training dataset and evaluate the performance on the holdout test dataset.

We expect the ensemble to perform better than any contributing ensemble member, and this can be checked directly by evaluating each member model on the full train and test sets independently.

Finally, we also expect the weighted average ensemble to perform better than the same ensemble with an equal weighting. This too can be confirmed.

Tying this together, the complete example of evaluating a weighted average ensemble for regression is listed below.

Running the example first reports the negative MAE of each ensemble member that will be used as scores, followed by the performance of the weighted average ensemble. Finally, the performance of each independent model is reported along with the performance of an ensemble with equal weight.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the weighted average ensemble achieved a mean absolute error of about 105.158, which is worse (large error) than the standalone kNN model that achieved an error of about 100.169. We can also see that the voting ensemble that assumes an equal weight for each model also performs better than the weighted average ensemble with an error of about 102.706.

The worse-than-expected performance for the weighted average ensemble might be related to the choice of how models were weighted.

An alternate strategy for weighting is to use a ranking to indicate the number of votes that each ensemble has in the weighted average.

For example, the worst-performing model has 1 vote the second-worst 2 votes and the best model 3 votes, in the case of three ensemble members.

This can be achieved using the argsort() numpy function.

The argsort function returns the indexes of the values in an array if they were sorted. So, if we had the array [300, 100, 200], the index of the smallest value is 1, the index of the next largest value is 2, and the index of the next largest value is 0.

Therefore, the argsort of [300, 100, 200] is [1, 2, 0].

We can then argsort the result of the argsort to give a ranking of the data in the original array. To see how, an argsort of [1, 2, 0] would indicate that index 2 is the smallest value, followed by index 0 and ending with index 1.

Therefore, the argsort of [1, 2, 0] is [2, 0, 1]. Put another way, the argsort of the argsort of [300, 100, 200] is [2, 0, 1], which is the relative ranking of each value in the array if values were sorted in ascending order. That is:

  • 300: Has rank 2
  • 100: Has rank 0
  • 200: Has rank 1

We can make this clear with a small example, listed below.

Running the example first reports the raw data, then the argsort of the raw data and the argsort of the argsort of the raw data.

The results match our manual calculation.

We can use the argsort of the argsort of the model scores to calculate a relative ranking of each ensemble member. If negative mean absolute errors are sorted in ascending order, then the best model would have the largest negative error, and in turn, the highest rank. The worst performing model would have the smallest negative error, and in turn, the lowest rank.

Again, we can confirm this with a worked example.

Running the example, we can see that the first model has the best score (-10) and the second model has the worst score (-100).

The argsort of the argsort of the scores shows that the best model gets the highest rank (most votes) with a value of 2 and the worst model gets the lowest rank (least votes) with a value of 0.

In practice, we don’t want any model to have zero votes because it would be excluded from the ensemble. Therefore, we can add 1 to all rankings.

After calculating the scores, we can calculate the argsort of the argsort of the model scores to give the rankings. Then use the model rankings as the model weights for the weighted average ensemble.

Tying this together, the complete example of a weighted average ensemble for regression with model ranking used as model weighs is listed below.

Running the example first scores each model, then converts the scores into rankings. The weighted average ensemble using ranking is then evaluated and compared to the performance of each standalone model and the ensemble with equally weighted models.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the ranking was performed as expected, with the best-performing member kNN with a score of 101 is assigned the rank of 3, and the other models are ranked accordingly. We can see that the weighted average ensemble achieved the MAE of about 96.692, which is better than any individual model and the unweighted voting ensemble.

This highlights the importance of exploring alternative approaches for selecting model weights in the ensemble.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Related Tutorials

APIs

Articles

Summary

In this tutorial, you discovered how to develop Weighted Average Ensembles for classification and regression.

Specifically, you learned:

  • Weighted Average Ensembles are an extension to voting ensembles where model votes are proportional to model performance.
  • How to develop weighted average ensembles using the voting ensemble from scikit-learn.
  • How to evaluate the Weighted Average Ensembles for classification and regression and confirm the models are skillful.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Modern Ensemble Learning!

Ensemble Learning Algorithms With Python

Improve Your Predictions in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Ensemble Learning Algorithms With Python

It provides self-study tutorials with full working code on:
Stacking, Voting, Boosting, Bagging, Blending, Super Learner, and much more...

Bring Modern Ensemble Learning Techniques to
Your Machine Learning Projects


See What's Inside

8 Responses to How to Develop a Weighted Average Ensemble With Python

  1. Damien May 7, 2021 at 7:53 am #

    In your example of weighted averages, you should divide by the sun of the weights. That is, this example
    yhat = ((97.2 * 0.84) + (100.0 * 0.87) + (95.8 * 0.75)) / 3
    yhat = (81.648 + 87 + 71.85) / 3
    yhat = 240.498 / 3
    yhat = 80.166

    Should be
    yhat = ((97.2 * 0.84) + (100.0 * 0.87) + (95.8 * 0.75)) / (0.84 + 0.87 + 0.75)
    yhat = (81.648 + 87 + 71.85) / 2.46
    yhat = 240.498 / 2.46
    yhat = 97.763

    The (incorrect) weighted average you posted is clearly wrong, as the weighted average should lie between the minimum and maximum datum being averaged (the value 80.166 lies outside of the interval [95.8, 100])

    • Jason Brownlee May 8, 2021 at 6:23 am #

      Ah yes, normalized by the sum of the weights. Thanks, fixed.

  2. Kingsley Udeh May 7, 2021 at 8:12 pm #

    Hi Dr. Jason,

    This is by all standards, a well-written article. Thank you for the time and effort that is put into the work.

    Why is the performance of each contributing model or member in VotingRegressor estimated with a negative MAE metric? Why can’t we use scores between 0 and 1 to obtain weight for each model? I’m confused why Weighted Avg MAE: 96.692 in the VotingRessor Ranking approach would outperform all ensemble member scores which are negative. I thought the smaller the MAE the better the performance of the model.

    Secondly, is it possible to add a neural network model to the ensemble? I tried doing this using KerasRegressor() method, but one of the errors I had was the ‘KerasRegressor’ object has no attribute ‘model’ when I tried to use the estimate to make a prediction on the holdout dataset.

    • Jason Brownlee May 8, 2021 at 6:34 am #

      Thanks.

      The use of negative MAE is by design, from the post:

      We will use the negative MAE scores as a weight where large error values closer to zero indicate a better performing model.

      Yes, you can use a neural net. Perhaps try the sklearn MLPRegressor.

  3. Varsha May 7, 2021 at 8:21 pm #

    Hello Sir,
    Will it be possible to use the weighting mechanism given here with superlearner?

  4. Rasoul May 7, 2021 at 8:43 pm #

    In the section “Weighted Average Ensemble”, the correct mathematical weighted average formula tells us,

    yhat = ((97.2 * 0.84) + (100.0 * 0.87) + (95.8 * 0.75)) / (0.84 + 0.87 + 0.75)
    yhat = 97.76

Leave a Reply