Ensemble methods involve combining the predictions from multiple models.
The combination of the predictions is a central part of the ensemble method and depends heavily on the types of models that contribute to the ensemble and the type of prediction problem that is being modeled, such as a classification or regression.
Nevertheless, there are common or standard techniques that can be used to combine predictions that can be easily implemented and often result in good or best predictive performance.
In this post, you will discover common techniques for combining predictions for ensemble learning.
After reading this post, you will know:
- Combining predictions from contributing models is a key property of an ensemble model.
- Voting techniques are most commonly used when combining predictions for classification.
- Statistical techniques are most commonly used when combining predictions for regression.
Kick-start your project with my new book Ensemble Learning Algorithms With Python, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
Tutorial Overview
This tutorial is divided into three parts; they are:
- Combining Predictions for Ensemble Learning
- Combining Classification Predictions
- Combining Predicted Class Labels
- Combining Predicted Class Probabilities
- Combining Regression Predictions
Combining Predictions for Ensemble Learning
A key part of an ensemble learning method involves combining the predictions from multiple models.
It is through the combination of the predictions that the benefit of the ensemble learning method is achieved, namely better predictive performance. As such, there are many ways that predictions can be combined, so much so that it is an entire field of study.
After generating a set of base learners, rather than trying to find the best single learner, ensemble methods resort to combination to achieve a strong generalization ability, where the combination method plays a crucial role.
— Page 67, Ensemble Methods, 2012.
Standard ensemble machine learning algorithms do prescribe how to combine predictions; nevertheless, it is important to consider the topic in isolation for a number of reasons, such as:
- Interpreting the predictions made by standard ensemble algorithms.
- Manually specifying a custom prediction combination method for an algorithm.
- Developing your own ensemble methods.
Ensemble learning methods are typically not very complex and developing your own ensemble method or specifying the manner in which predictions are combined is relatively easy and common practice.
The way that predictions are combined depends on the models that are making predictions and the type of prediction problem.
The strategy used in this step depends, in part, on the type of classifiers used as ensemble members. For example, some classifiers, such as support vector machines, provide only discrete-valued label outputs.
— Page 6, Ensemble Machine Learning, 2012.
For example, the form of the predictions made by the models will match the type of prediction problem, such as regression for predicting numbers and classification for predicting class labels. Additionally, some model types may be only able to predict a class label or class probability distribution, whereas others may be able to support both for a classification task.
We will use this division of prediction type based on problem type as the basis for exploring the common techniques used to combine predictions from contributing models in an ensemble.
In the next section, we will take a look at how to combine predictions for classification predictive modeling tasks.
Combining Classification Predictions
Classification refers to predictive modeling problems that involve predicting a class label given an input.
The prediction made by a model may be a crisp class label directly or may be a probability that an example belongs to each class, referred to as the probability of class membership.
The performance of a classification problem is often measured using accuracy or a related count or ratio of correct predictions. In the case of evaluating predicted probabilities, they may be converted to crisp class labels by selecting a cut-off threshold, or evaluated using specialized metrics such as cross-entropy.
We will review combining predictions for classification separately for both class labels and probabilities.
Want to Get Started With Ensemble Learning?
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Combining Predicted Class Labels
A predicted class label is often mapped to something meaningful to the problem domain.
For example, a model may predict a color such as “red” or “green“. Internally though, the model predicts a numerical representation for the class label such as 0 for “red“, 1 for “green“, and 2 for “blue” for our color classification example.
Methods for combining class labels are perhaps easier to consider if we work with the integer encoded class labels directly.
Perhaps the simplest, most common, and often most effective approach is to combine the predictions by voting.
Voting is the most popular and fundamental combination method for nominal outputs.
— Page 71, Ensemble Methods, 2012.
Voting generally involves each model that makes a prediction assigning a vote for the class that was predicted. The votes are tallied and an outcome is then chosen using the votes or tallies in some way.
There are many types of voting, so let’s look at the four most common:
- Plurality Voting.
- Majority Voting.
- Unanimous Voting.
- Weighted Voting.
Simple voting, called plurality voting, selects the class label with the most votes.
If two or more classes have the same number of votes, then the tie is broken arbitrarily, although in a consistent manner, such as sorting the class labels that have a tie and selecting the first, instead of selecting one randomly. This is important so that the same model with the same data always makes the same prediction.
Given ties, it is common to have an odd number of ensemble members in an attempt to automatically break ties, as opposed to an even number of ensemble members where ties may be more likely.
From a statistical perspective, this is called the mode or the most common value from the collection of predictions.
For example, consider the three predictions made by a model for a three-class color prediction problem:
- Model 1 predicts “green” or 1.
- Model 2 predicts “green” or 1.
- Model 3 predicts “red” or 0.
The votes are, therefore:
- Red Votes: 1
- Green Votes: 2
- Blue Votes: 0
The prediction would be “green” given it has the most votes.
Majority voting selects the class label that has more than half the votes. If no class has more than half the votes, then a “no prediction” is made. Interestingly, majority voting can be proven to be an optimal method for combining classifiers, if they are independent.
If the classifier outputs are independent, then it can be shown that majority voting is the optimal combination rule.
— Page 1, Ensemble Machine Learning, 2012.
Unanimous voting is related to majority voting in that instead of requiring half the votes, the method requires all models to predict the same value, otherwise, no prediction is made.
Weighted voting weighs the prediction made by each model in some way. One example would be to weigh predictions based on the average performance of the model, such as classification accuracy.
The weight of each classifier can be set proportional to its accuracy performance on a validation set.
— Page 67, Pattern Classification Using Ensemble Methods, 2010.
Assigning weights to classifiers can become a project in and of itself and could involve using an optimization algorithm and a holdout dataset, a linear model, or even another machine learning model entirely.
So, how do we assign the weights? If we knew, a priori, which classifiers would work better, we would only use those classifiers. In the absence of such information, a plausible and commonly used strategy is to use the performance of a classifier on a separate validation (or even training) dataset, as an estimate of that classifier’s generalization performance.
— Page 8, Ensemble Machine Learning, 2012.
The idea of weighted voting is that some classifiers are more likely to be accurate than others and we should reward them by giving them a larger share of the votes.
If we have reason to believe that some of the classifiers are more likely to be correct than others, weighting the decisions of those classifiers more heavily can further improve the overall performance compared to that of plurality voting.
— Page 7, Ensemble Machine Learning, 2012.
Combining Predicted Class Probabilities
Probabilities summarize the likelihood of an event as a numerical value between 0.0 and 1.0.
When predicted for class membership, it involves a probability assigned for each class, together summing to the value 1.0; for example, a model may predict:
- Red: 0.75
- Green: 0.10
- Blue: 0.15
We can see that class “red” has the highest probability or is the most likely outcome predicted by the model and that the distribution of probabilities across the classes (0.75 + 0.10 + 0.15) sum to 1.0.
The way that the probabilities are combined depends on the outcome that is required.
For example, if probabilities are required, then the independent predicted probabilities can be combined directly.
Perhaps the simplest approach for combining probabilities is to sum the probabilities for each class and pass the predicted values through a softmax function. This ensures that the scores are appropriately normalized, meaning the probabilities across the class labels sum to 1.0.
… such outputs – upon proper normalization (such as softmax normalization […]) – can be interpreted as the degree of support given to that class
— Page 8, Ensemble Machine Learning, 2012.
More commonly we wish to predict a class label from predicted probabilities.
The most common approach is to use voting, where the predicted probabilities represent the vote made by each model for each class. Votes are then summed and a voting method from the previous section can be used, such as selecting the label with the largest summed probabilities or the largest mean probability.
- Vote Using Mean Probabilities
- Vote Using Sum Probabilities
- Vote Using Weighted Sum Probabilities
Generally, this approach to treating probabilities as votes for choosing a class label is referred to as soft voting.
If all the individual classifiers are treated equally, the simple soft voting method generates the combined output by simply averaging all the individual outputs …
— Page 76, Ensemble Methods, 2012.
Combining Regression Predictions
Regression refers to predictive modeling problems that involve predicting a numeric value given an input.
The performance for a regression problem is often measured using average error, such as mean absolute error or root mean squared error.
Combining numerical predictions often involves using simple statistical methods; for example:
- Mean Predicted Value
- Median Predicted Value
Both give the central tendency of the distribution of predictions.
Averaging is the most popular and fundamental combination method for numeric outputs.
— Page 68, Ensemble Methods, 2012.
The mean, also called the average, is the normalized sum of the predictions. The Mean Predicted Value is more appropriate when the distribution of predictions is Gaussian or nearly Gaussian.
For example, the mean is calculated as the sum of predicted values divided by the total number of predictions. If three models predicted the following prices:
- Model 1: 99.00
- Model 2: 101.00
- Model 3: 98.00
The mean predicted would be calculated as:
- Mean Prediction = (99.00 + 101.00 + 98.00) / 3
- Mean Prediction = 298.00 / 3
- Mean Prediction = 99.33
Owing to its simplicity and effectiveness, simple averaging is among the most popularly used methods and represents the first choice in many real applications.
— Page 69, Ensemble Methods, 2012.
The median is the middle value if all predictions were ordered and is also referred to as the fifty-th percentile. The Median Predicted Value is more appropriate to use when the distribution of predictions is not known or does not follow a Gaussian probability distribution.
Depending on the nature of the prediction problem, a conservative prediction may be desired, such as the maximum or the minimum. Additionally, the distribution can be summarized to give a measure of uncertainty, such as reporting three values for each prediction:
- Minimum Predicted Value
- Median Predicted Value
- Maximum Predicted Value
As with classification, the predictions made by each model can be weighted by expected model performance or some other value, and the weighted mean of the predictions can be reported.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Books
- Pattern Classification Using Ensemble Methods, 2010.
- Ensemble Methods, 2012.
- Ensemble Machine Learning, 2012.
- Ensemble Methods in Data Mining, 2010.
Articles
Summary
In this post, you discovered common techniques for combining predictions for ensemble learning.
Specifically, you learned:
- Combining predictions from contributing models is a key property of an ensemble model.
- Voting techniques are most commonly used when combining predictions for classification.
- Statistical techniques are most commonly used when combining predictions for regression.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
No comments yet.