Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers.

In this post you will discover the AdaBoost Ensemble method for machine learning. After reading this post, you will know:

- What the boosting ensemble method is and generally how it works.
- How to learn to boost decision trees using the AdaBoost algorithm.
- How to make predictions using the learned AdaBoost model.
- How to best prepare your data for use with the AdaBoost algorithm

This post was written for developers and assumes no background in statistics or mathematics. The post focuses on how the algorithm works and how to use it for predictive modeling problems. If you have any questions, leave a comment and I will do my best to answer.

Let’s get started.

## Boosting Ensemble Method

Boosting is a general ensemble method that creates a strong classifier from a number of weak classifiers.

This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.

AdaBoost was the first really successful boosting algorithm developed for binary classification. It is the best starting point for understanding boosting.

Modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.

## Get your FREE Algorithms Mind Map

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Also get exclusive access to the machine learning algorithms email mini-course.

## Learning An AdaBoost Model From Data

AdaBoost is best used to boost the performance of decision trees on binary classification problems.

AdaBoost was originally called AdaBoost.M1 by the authors of the technique Freund and Schapire. More recently it may be referred to as discrete AdaBoost because it is used for classification rather than regression.

AdaBoost can be used to boost the performance of any machine learning algorithm. It is best used with weak learners. These are models that achieve accuracy just above random chance on a classification problem.

The most suited and therefore most common algorithm used with AdaBoost are decision trees with one level. Because these trees are so short and only contain one decision for classification, they are often called decision stumps.

Each instance in the training dataset is weighted. The initial weight is set to:

weight(xi) = 1/n

Where xi is the i’th training instance and n is the number of training instances.

## How To Train One Model

A weak classifier (decision stump) is prepared on the training data using the weighted samples. Only binary (two-class) classification problems are supported, so each decision stump makes one decision on one input variable and outputs a +1.0 or -1.0 value for the first or second class value.

The misclassification rate is calculated for the trained model. Traditionally, this is calculated as:

error = (correct – N) / N

Where error is the misclassification rate, correct are the number of training instance predicted correctly by the model and N is the total number of training instances. For example, if the model predicted 78 of 100 training instances correctly the error or misclassification rate would be (78-100)/100 or 0.22.

This is modified to use the weighting of the training instances:

error = sum(w(i) * terror(i)) / sum(w)

Which is the weighted sum of the misclassification rate, where w is the weight for training instance i and terror is the prediction error for training instance i which is 1 if misclassified and 0 if correctly classified.

For example, if we had 3 training instances with the weights 0.01, 0.5 and 0.2. The predicted values were -1, -1 and -1, and the actual output variables in the instances were -1, 1 and -1, then the terrors would be 0, 1, and 0. The misclassification rate would be calculated as:

error = (0.01*0 + 0.5*1 + 0.2*0) / (0.01 + 0.5 + 0.2)

or

error = 0.704

A stage value is calculated for the trained model which provides a weighting for any predictions that the model makes. The stage value for a trained model is calculated as follows:

stage = ln((1-error) / error)

Where stage is the stage value used to weight predictions from the model, ln() is the natural logarithm and error is the misclassification error for the model. The effect of the stage weight is that more accurate models have more weight or contribution to the final prediction.

The training weights are updated giving more weight to incorrectly predicted instances, and less weight to correctly predicted instances.

For example, the weight of one training instance (w) is updated using:

w = w * exp(stage * terror)

Where w is the weight for a specific training instance, exp() is the numerical constant e or Euler’s number raised to a power, stage is the misclassification rate for the weak classifier and terror is the error the weak classifier made predicting the output variable for the training instance, evaluated as:

terror = 0 if(y == p), otherwise 1

Where y is the output variable for the training instance and p is the prediction from the weak learner.

This has the effect of not changing the weight if the training instance was classified correctly and making the weight slightly larger if the weak learner misclassified the instance.

## AdaBoost Ensemble

Weak models are added sequentially, trained using the weighted training data.

The process continues until a pre-set number of weak learners have been created (a user parameter) or no further improvement can be made on the training dataset.

Once completed, you are left with a pool of weak learners each with a stage value.

## Making Predictions with AdaBoost

Predictions are made by calculating the weighted average of the weak classifiers.

For a new input instance, each weak learner calculates a predicted value as either +1.0 or -1.0. The predicted values are weighted by each weak learners stage value. The prediction for the ensemble model is taken as a the sum of the weighted predictions. If the sum is positive, then the first class is predicted, if negative the second class is predicted.

For example, 5 weak classifiers may predict the values 1.0, 1.0, -1.0, 1.0, -1.0. From a majority vote, it looks like the model will predict a value of 1.0 or the first class. These same 5 weak classifiers may have the stage values 0.2, 0.5, 0.8, 0.2 and 0.9 respectively. Calculating the weighted sum of these predictions results in an output of -0.8, which would be an ensemble prediction of -1.0 or the second class.

## Data Preparation for AdaBoost

This section lists some heuristics for best preparing your data for AdaBoost.

**Quality Data**: Because the ensemble method continues to attempt to correct misclassifications in the training data, you need to be careful that the training data is of a high-quality.**Outliers**: Outliers will force the ensemble down the rabbit hole of working hard to correct for cases that are unrealistic. These could be removed from the training dataset.**Noisy Data**: Noisy data, specifically noise in the output variable can be problematic. If possible, attempt to isolate and clean these from your training dataset.

## Further Reading

Below are some machine learning texts that describe AdaBoost from a machine learning perspective.

- An Introduction to Statistical Learning: with Applications in R, page 321
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Chapter 10
- Applied Predictive Modeling, pages 203 amd 389

Below are some seminal and good overview research articles on the method that may be useful if you are looking to dive deeper into the theoretical underpinnings of the method:

- A decision-theoretic generalization of on-line learning and an application to boosting, 1995
- Improved Boosting Algorithms Using Confidence-rated Predictions, 1999
- Explaining Adaboost, Chapter from Empirical Inference, 2013
- A Short Introduction to Boosting, 1999

## Summary

In this post you discovered the Boosting ensemble method for machine learning. You learned about:

- Boosting and how it is a general technique that keeps adding weak learners to correct classification errors.
- AdaBoost as the first successful boosting algorithm for binary classification problems.
- Learning the AdaBoost model by weighting training instances and the weak learners themselves.
- Predicting with AdaBoost by weighting predictions from weak learners.
- Where to look for more theoretical background on the AdaBoost algorithm.

If you have any questions about this post or the Boosting or the AdaBoost algorithm ask in the comments and I will do my best to answer.

## Frustrated With Machine Learning Math?

#### See How Algorithms Work in Minutes

...with just arithmetic and simple examples

Discover how in my new Ebook: Master Machine Learning Algorithms

It covers **explanations** and **examples** of **10 top algorithms**, including:*Linear Regression*, *k-Nearest Neighbors*, *Support Vector Machines* and much more...

#### Finally, Pull Back the Curtain on

Machine Learning Algorithms

Skip the Academics. Just Results.

Thank You! The article was really helpful to understand the AdaBoost Algorithm.

In the article, you said, “AdaBoost is best used to boost the performance of decision trees on binary classification problems.” what does that mean?

Can’t this algorithm be used in non-binary classification problems like “Fruit Recognition System” where training set contains feature matrix and associated name of different fruit. I need to know this for my research project.

AdaBoost was designed for binary classification (two output classes) and makes use of decision trees.

If you think the algorithm can be used on your problem, give it a shot and see if it works.

So, choosing of machine learning algorithm is heuristic? That, I should keep on implementing different algorithms and choose the best one that fits into my problem and gives the best accuracy ?

Choosing the best algorithm and even the best representation of your problem is problem specific. You must discover the best combination empirically through experimentation.

This article is very helpful !

I have some questions about adaboost. First, every weak learner or classifier in adaboost is decision tree based, can other algorithms like KNN or SVM be the basic components of the ensemble learning? My second question is, how adaboost algorithm deal with large dynamic sequential dataset such as global weather data or sensor dataset?

Thank you very much!

Hi Jessica. I guess other methods could be used, but traditionally the weak learner is one-level decision trees (decision stumps). A neural net might work if it was “weak”, as in only had one layer and maybe very few neurons (perhaps just one). It might be an interesting experiment, and I bet there is some literature on this if you’d like to check http://scholar.google.com

I am not familiar with adaboost being used on time series. It may be possible, but I have not seen it. I expect large changes to the method may be required.

Shouldn’t the error formula be

error = (N – correct) / N ?

Or otherwise you would get a negative misclassification error rate.

Obviously 😉

Hi Jason, Thank you for the detailed clarification on adaboost algorithm. I have a question on this. How training weights are being used in this adaboost algorithm (meaning does adaboost algorithm repeat observations basis weights while building model or is it being used in some different way)?

Hi Prashant, the weights are used during the construction of subsequent decision stumps.

Hi Jason,

We would have a bunch of weights and models in the end. How is the final prediction decided for the example?

We use the weighted models to make a prediction.

Thanks for this Jason. It’s very helpful.

You’re welcome, I’m glad to hear it.

Hi Jason,

I was wondering if there any rule for alphas (learner weights) in H(x) = sign(Sum(alpha*h_i(x)) for h_i = 1…n. For example, the sum of alphas must be equal to 1, the value of each alpha can’t be > 1, etc.

No that I recall offhand. If you discover something in the literature, let me know.

Thank you. Will do.

Given the stage = ln((1-error) / error), the stage can be negative value if the error > 0.5. Since the stage will be used as the weight of the model, can it be a negative value?

I guess I found my answer and share it here:

The stage value can be negative when the error rate > 0.5. In this case, the weight of the weak classifier become nagative, which means we flip its predication so that what the weak classifier says true become false and vice versa. In term of the misclassified data points from the weak classifier, since we flip the sign of the weak classifier, those data points are actually classified correctly. As compensation, we reduce their weight by w*exp(negative value)….

Thanks Yi.