Last Updated on August 15, 2020
Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers.
In this post you will discover the AdaBoost Ensemble method for machine learning. After reading this post, you will know:
- What the boosting ensemble method is and generally how it works.
- How to learn to boost decision trees using the AdaBoost algorithm.
- How to make predictions using the learned AdaBoost model.
- How to best prepare your data for use with the AdaBoost algorithm
This post was written for developers and assumes no background in statistics or mathematics. The post focuses on how the algorithm works and how to use it for predictive modeling problems. If you have any questions, leave a comment and I will do my best to answer.
Kick-start your project with my new book Master Machine Learning Algorithms, including step-by-step tutorials and the Excel Spreadsheet files for all examples.
Let’s get started.
Boosting Ensemble Method
Boosting is a general ensemble method that creates a strong classifier from a number of weak classifiers.
This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.
AdaBoost was the first really successful boosting algorithm developed for binary classification. It is the best starting point for understanding boosting.
Modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.
Get your FREE Algorithms Mind Map
I've created a handy mind map of 60+ algorithms organized by type.
Download it, print it and use it.
Also get exclusive access to the machine learning algorithms email mini-course.
Learning An AdaBoost Model From Data
AdaBoost is best used to boost the performance of decision trees on binary classification problems.
AdaBoost was originally called AdaBoost.M1 by the authors of the technique Freund and Schapire. More recently it may be referred to as discrete AdaBoost because it is used for classification rather than regression.
AdaBoost can be used to boost the performance of any machine learning algorithm. It is best used with weak learners. These are models that achieve accuracy just above random chance on a classification problem.
The most suited and therefore most common algorithm used with AdaBoost are decision trees with one level. Because these trees are so short and only contain one decision for classification, they are often called decision stumps.
Each instance in the training dataset is weighted. The initial weight is set to:
weight(xi) = 1/n
Where xi is the i’th training instance and n is the number of training instances.
How To Train One Model
A weak classifier (decision stump) is prepared on the training data using the weighted samples. Only binary (two-class) classification problems are supported, so each decision stump makes one decision on one input variable and outputs a +1.0 or -1.0 value for the first or second class value.
The misclassification rate is calculated for the trained model. Traditionally, this is calculated as:
error = (correct – N) / N
Where error is the misclassification rate, correct are the number of training instance predicted correctly by the model and N is the total number of training instances. For example, if the model predicted 78 of 100 training instances correctly the error or misclassification rate would be (78-100)/100 or 0.22.
This is modified to use the weighting of the training instances:
error = sum(w(i) * terror(i)) / sum(w)
Which is the weighted sum of the misclassification rate, where w is the weight for training instance i and terror is the prediction error for training instance i which is 1 if misclassified and 0 if correctly classified.
For example, if we had 3 training instances with the weights 0.01, 0.5 and 0.2. The predicted values were -1, -1 and -1, and the actual output variables in the instances were -1, 1 and -1, then the terrors would be 0, 1, and 0. The misclassification rate would be calculated as:
error = (0.01*0 + 0.5*1 + 0.2*0) / (0.01 + 0.5 + 0.2)
error = 0.704
A stage value is calculated for the trained model which provides a weighting for any predictions that the model makes. The stage value for a trained model is calculated as follows:
stage = ln((1-error) / error)
Where stage is the stage value used to weight predictions from the model, ln() is the natural logarithm and error is the misclassification error for the model. The effect of the stage weight is that more accurate models have more weight or contribution to the final prediction.
The training weights are updated giving more weight to incorrectly predicted instances, and less weight to correctly predicted instances.
For example, the weight of one training instance (w) is updated using:
w = w * exp(stage * terror)
Where w is the weight for a specific training instance, exp() is the numerical constant e or Euler’s number raised to a power, stage is the misclassification rate for the weak classifier and terror is the error the weak classifier made predicting the output variable for the training instance, evaluated as:
terror = 0 if(y == p), otherwise 1
Where y is the output variable for the training instance and p is the prediction from the weak learner.
This has the effect of not changing the weight if the training instance was classified correctly and making the weight slightly larger if the weak learner misclassified the instance.
Weak models are added sequentially, trained using the weighted training data.
The process continues until a pre-set number of weak learners have been created (a user parameter) or no further improvement can be made on the training dataset.
Once completed, you are left with a pool of weak learners each with a stage value.
Making Predictions with AdaBoost
Predictions are made by calculating the weighted average of the weak classifiers.
For a new input instance, each weak learner calculates a predicted value as either +1.0 or -1.0. The predicted values are weighted by each weak learners stage value. The prediction for the ensemble model is taken as a the sum of the weighted predictions. If the sum is positive, then the first class is predicted, if negative the second class is predicted.
For example, 5 weak classifiers may predict the values 1.0, 1.0, -1.0, 1.0, -1.0. From a majority vote, it looks like the model will predict a value of 1.0 or the first class. These same 5 weak classifiers may have the stage values 0.2, 0.5, 0.8, 0.2 and 0.9 respectively. Calculating the weighted sum of these predictions results in an output of -0.8, which would be an ensemble prediction of -1.0 or the second class.
Data Preparation for AdaBoost
This section lists some heuristics for best preparing your data for AdaBoost.
- Quality Data: Because the ensemble method continues to attempt to correct misclassifications in the training data, you need to be careful that the training data is of a high-quality.
- Outliers: Outliers will force the ensemble down the rabbit hole of working hard to correct for cases that are unrealistic. These could be removed from the training dataset.
- Noisy Data: Noisy data, specifically noise in the output variable can be problematic. If possible, attempt to isolate and clean these from your training dataset.
Below are some machine learning texts that describe AdaBoost from a machine learning perspective.
- An Introduction to Statistical Learning: with Applications in R, page 321
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Chapter 10
- Applied Predictive Modeling, pages 203 amd 389
Below are some seminal and good overview research articles on the method that may be useful if you are looking to dive deeper into the theoretical underpinnings of the method:
- A decision-theoretic generalization of on-line learning and an application to boosting, 1995
- Improved Boosting Algorithms Using Confidence-rated Predictions, 1999
- Explaining Adaboost, Chapter from Empirical Inference, 2013
- A Short Introduction to Boosting, 1999
In this post you discovered the Boosting ensemble method for machine learning. You learned about:
- Boosting and how it is a general technique that keeps adding weak learners to correct classification errors.
- AdaBoost as the first successful boosting algorithm for binary classification problems.
- Learning the AdaBoost model by weighting training instances and the weak learners themselves.
- Predicting with AdaBoost by weighting predictions from weak learners.
- Where to look for more theoretical background on the AdaBoost algorithm.
If you have any questions about this post or the Boosting or the AdaBoost algorithm ask in the comments and I will do my best to answer.
Thank You! The article was really helpful to understand the AdaBoost Algorithm.
In the article, you said, “AdaBoost is best used to boost the performance of decision trees on binary classification problems.” what does that mean?
Can’t this algorithm be used in non-binary classification problems like “Fruit Recognition System” where training set contains feature matrix and associated name of different fruit. I need to know this for my research project.
AdaBoost was designed for binary classification (two output classes) and makes use of decision trees.
If you think the algorithm can be used on your problem, give it a shot and see if it works.
So, choosing of machine learning algorithm is heuristic? That, I should keep on implementing different algorithms and choose the best one that fits into my problem and gives the best accuracy ?
Choosing the best algorithm and even the best representation of your problem is problem specific. You must discover the best combination empirically through experimentation.
This article is very helpful !
I have some questions about adaboost. First, every weak learner or classifier in adaboost is decision tree based, can other algorithms like KNN or SVM be the basic components of the ensemble learning? My second question is, how adaboost algorithm deal with large dynamic sequential dataset such as global weather data or sensor dataset?
Thank you very much!
Hi Jessica. I guess other methods could be used, but traditionally the weak learner is one-level decision trees (decision stumps). A neural net might work if it was “weak”, as in only had one layer and maybe very few neurons (perhaps just one). It might be an interesting experiment, and I bet there is some literature on this if you’d like to check http://scholar.google.com
I am not familiar with adaboost being used on time series. It may be possible, but I have not seen it. I expect large changes to the method may be required.
hi JESSICA according to your question about the ADABOOST algorithm that is coincidently one of y questions about this algorithm ,and due to the fact you mentionned datasets on global weather data or sensor datasets such as channel of deferent radiometers and meteosats can i ask you are you doing research in the field of meteorology or precipitation and climate estimation from meteosat data ? because thats the field of my research study now and i have the same question about the possibility of use of other weak classifiers such as SVM ANN or instead of DECISION STUMPS , ? , perhaps mr jason BROWNLEE can help us with
Decision stumps are preferred in adaboost because they are weak learners.
You could use other weak learners in theory, such as knn with a small sample or small k value.
How the stage values are decided while prediction of a new instance, does it uses the training stage values?
Shouldn’t the error formula be
error = (N – correct) / N ?
Or otherwise you would get a negative misclassification error rate.
Hi Jason, Thank you for the detailed clarification on adaboost algorithm. I have a question on this. How training weights are being used in this adaboost algorithm (meaning does adaboost algorithm repeat observations basis weights while building model or is it being used in some different way)?
Hi Prashant, the weights are used during the construction of subsequent decision stumps.
We would have a bunch of weights and models in the end. How is the final prediction decided for the example?
We use the weighted models to make a prediction.
Thanks for this Jason. It’s very helpful.
You’re welcome, I’m glad to hear it.
I was wondering if there any rule for alphas (learner weights) in H(x) = sign(Sum(alpha*h_i(x)) for h_i = 1…n. For example, the sum of alphas must be equal to 1, the value of each alpha can’t be > 1, etc.
No that I recall offhand. If you discover something in the literature, let me know.
Thank you. Will do.
Given the stage = ln((1-error) / error), the stage can be negative value if the error > 0.5. Since the stage will be used as the weight of the model, can it be a negative value?
I guess I found my answer and share it here:
The stage value can be negative when the error rate > 0.5. In this case, the weight of the weak classifier become nagative, which means we flip its predication so that what the weak classifier says true become false and vice versa. In term of the misclassified data points from the weak classifier, since we flip the sign of the weak classifier, those data points are actually classified correctly. As compensation, we reduce their weight by w*exp(negative value)….
At first you said :-
Each instance in the training dataset is weighted. weight(xi) = 1/n
So every instance will have same weight , right ?
then in example , you said :-
“For example, if we had 3 training instances with the weights 0.01, 0.5 and 0.2.”
How come above 3 instances have different weights and not same weights from (1/n formula = .33,.33,.33)
As Jason mentions in the article, the weights get updated based on w = w * exp(stage * terror) before the next classifier is applied.
This article is really helpful! I searched a lot for an introduction to Adaboost. This is the best!
Glad to hear it.
Thanks for the perfect description. I have a question! Is Adaboost or generally boosting methods for improving the accuracy of weak learners from on type? I mean, if we have different types of learners that are not doing accurately, can we use boosting for them and should? Basically the question is that in this case, should we apply boosting on each learning type individually or we should consider all the weak learner (even from different types) together and apply one boosting ?
It was designed for decision trees (decision stumps).
You can try on other models if you like.
If the weak classifiers do not have much consensus about wrong predictions you can.
For example, you have 3 classifiers c1, c2, c3 and all of them classifies a sample wrongly. So there is no way for you to make a correct classification.
On the other hand, if at least one of them makes a correct classification for a sample, you might correctly classify the sample.
I think that’s the main reason people want weak classifiers. If you use let’s say 3 strong classifiers there is more chance that there are samples which are wrongly classified in all 3 classifiers.
That is a nice explanation, thanks. How exactly do the weights affect the tree learning algorithm? Do they feed directly into the entropy calculations for information gain? I assume they change the way the probabilities are calculated, but I can’t find a good explanation of how.
Exactly, the weights influence the splitting criteria used to select the split point of subsequent weak learners. It’s a way of weighting the model away from “easy” parts of the problem and toward “harder” parts, as defined by the parameter values for split points.
When I try to practice Adaboost, I found I am still not clear about “how to build a new weak classifier by weighted training data”. You mentioned weights are used in computing errors, but let me say we use ID3 or CART, they are not error driven. How do we use weighted training data to build a new ID3/CART?
(P.S. I found myself visit this place more and more often. Many thanks for very helpful information.)
Unlike CART and ID#, boosting will create multiple trees consecutively. Each tree is built on a weighted form of the dataset to ensure that the tree focuses on areas where the previous tree/trees performed poorly. The subsequent trees try to correct the errors of the prior trees.
I hope that helps.
Not really ~~~
question 1: How do we build CART stump or ID# stump (in adaboost)? Or they are not option?
question 2: I see some examples use very simple formula like (x5.5) as a stumps, they do use weight to measure error, but is this the only thing we can use in adaboost?
You can build a stump with one split point. The weight can be used to influence the choice of split point.
Yes, this is the Adaboost algorithm.
I do have a worked example in a spreadsheet in my “master machine learning algorithms” book.
Many thanks Jason!
I was thinking a typical adaboost shall use some “advanced” algs as stumps. For example if I try to use adaboost in kaggle. Your guidence saves me a lot time.
is there algorithm in ensemble can work in similarity?
You could create an ensemble of similarity algorithms, sure.
is there algorithm in ensemble techniques work as similarity for predict based on specific rules?
I don’t follow, can you give an example?
Attribute 1 Attribute 2 Attribute 3 Class label
Class1 4 5 5
Class 2 2 3 1
Class 3 4 5 5
Class 4 1 3 5
Class5 2 3 1
Class 6 4 5 5
now class label the value of it 1 or 0
1 if the vale of the attributes is the same, 0 if no one have same value
ex. class 1 and class 3 and class 6 are same
class label value is 1
can i use ensemble techniques to get this results ?
Sorry, I don’t follow, perhaps can you explain your question a different way?
Hi Jason, shouldn’t the terror be -1 or 1? Since in the update rule, (W = W * exp(stage * y_i * h_t)). if the y_i is -1, and h_t is -1 (thus correct classification), their multiplication would yield +1, while an incorrect classification would lead to -1?
This article seems to suggest the same (https://towardsdatascience.com/boosting-algorithm-adaboost-b6737a9ee60c), when it says: “If a misclassified case is from a positive weighted classifier, the “exp” term in the numerator would be always larger than 1 (y*f is always -1, theta_m is positive).”
In manual calculation, setting terror to 0 for correct classification instead of 1 would lower the total exp/euler’s value, while setting terror to 1 for incorrect classification instead of -1 would conversely raise the total euler’s value, which is not what we want. Am I missing something here?
hey Jason, have you an example for AdaBoost M1?, and thank you for your article in my research for AdaBoost this is my favorite 🙂
Thanks. I don’t recall sorry.
I am competing in challenge for binary classification problem. Data is 250k for training. I need to produce probabilities. I find every time I run probabilities change. That makes it problematic as I cannot have results that I reproduce. Any thoughts
Yes, this is a feature of machine learning algorithms. Learn more about this here:
Here is one approach to effectively evaluate a stochastic model:
Can u suggest how we can implement adaboost on SVR predicted model?
ValueError Traceback (most recent call last)
—> 11 er_tree = generic_clf(y_train, X_train, y_test, X_test, clf_tree)
in generic_clf(y_train, X_train_scaled, y_test, X_test_scaled, clf)
—> 11 clf.fit(X_train_scaled,y_train)
12 pred_train = clf.predict(X_train_scaled)
13 pred_test = clf.predict(X_test_scaled)
~/anaconda3/lib/python3.6/site-packages/sklearn/tree/tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
–> 790 X_idx_sorted=X_idx_sorted)
791 return self
~/anaconda3/lib/python3.6/site-packages/sklearn/tree/tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
139 if is_classification:
–> 140 check_classification_targets(y)
141 y = np.copy(y)
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/multiclass.py in check_classification_targets(y)
170 if y_type not in [‘binary’, ‘multiclass’, ‘multiclass-multioutput’,
171 ‘multilabel-indicator’, ‘multilabel-sequences’]:
–> 172 raise ValueError(“Unknown label type: %r” % y_type)
ValueError: Unknown label type: ‘continuous’
my code showing error like this……..
I’m sorry to hear that.
What code are you trying to run?
Not off hand, sorry, some careful development would be required. Also, I don’t think SVR would be a good fit as a weak model in adaboost.
It is very helpful Jason for my masters thesis write up. Thanks for your contribution.
Glad to hear that!
sir please provide code for adaboost and random forest ensemble classfiiers
I believe I do, try searching the blog.
Hi Jason. I’m back to this question after almost 2 years. I was wondering, how do you ‘select’ an instance to be part of the training set using their weights (probabilities). Do you sort the instances based on their weights and then only take those with higher weight as part of your next training set (for the next classifier) ?.
Ok. I verified if the answer I had was correct. I used a simple technique for weighted random number generator without replacement.
How do you decide to split the training set when you select examples with high probability. Do you make the selected sample from the training set be of the same size as the training set by putting back to the training set each example that has a higher probability? or you only take a subset of the training set (e.g. take only the top 70% examples for training) then classify the whole training set afterwards to get the training error and update their weights ?.
It really depends on the problem.
Your goal is to have a train and test set that are representative of the broader dataset.
Just to update my question:
The algorithm states that the instances in the training set are weighted each time they are correctly classified or misclassified to ensure that subsequent learners focus on instances that are hard to classify. My question is, how many of those hard to classify instances from the training set are selected to train a weak classifier. Do you take 70%/50%/30%, etc. of those hard to classify instances and use them to train the weak learner in the next iteration? or do you use the entire weighted training set. If the answer is the latter, how does the weak learner utilize instance weights so that it can focus on those that have a bigger weight value ?.
It is all based on the training dataset.
Hi Jason should not there be a 0.5 multiplied to the stage value? I find the same formula with 0.5 multiplied with the log value
Perhaps it would be a bad idea to mix and match formula. Pick one approach.
Thanks a lot for the explanation with such simplicity.
I have a question,
1) In terms of the stability of the predictions of the model, how would you compare AdaBoost with XGBoost and random forest, I understand these are different implementations of decision tree, please correct me if I’m wrong.
They are all ensemble methods and use decision trees.
Comparison depends on the data you have available. Perhaps repeated cross validation would be a good start?
Thanks, will try that.
Hi Jason, thanks a lot for sharing the knowledge!
A question about stage value and making predictions. Since stage = ln((1-error) / error), and in reality we could get a predictor that has error > 0.5 and hence stage < 0. In this case, the predictor is not only "weak" but also "bad" (because it is even worse than a random guess if the training dataset is balanced.) Should we include this predictor in the prediction vote and assign it a negative stage to ask it contributing negatively to the final score(imagine we have a "horrible" predictor with error = 1, then it is perfect in the sense that we can make opposite predictions based on it), or just simply discard it because it does not belong to the group of predictors "slightly better than random guess"?
Yes, include them all, you don’t really want to introduce a bias into the process.
Hi, how can you make adaboost from scratch using python?
I recommend using the sklearn library:
How is it used for detecting frauds specially in credit card transactions
Start with a strong definition of your problem:
Does the imbalanced data affect boosting algorithm performance?
It may. You must choose your metrics for model evaluation carefully.
Hi, thank you for the explanation. I want to ask to determine the e_t we us D or Dt?
D is training data and Dt is sample data based on weight.
Sorry, I don’t know what you are referring to?
When we calculate the error rate, we use whole training data or sample data that we use to train our model? And also what happen when the weighted sum of predictions equals to 0?
Yes, the whole training dataset.
What happen when the weighted sum of predictions equals to 0?
Hi, I have a question. AdaBoost algorithm with any type of weak classifiers or combination of some weak classifiers, eventually reach to zero training error?
Probably, but it will have overfit the training dat.
All models will have some random error that we cannot (and do not want to) capture.
What do you really mean by more weight while training a new model?
I think more weight means the probability of choosing that instance is high while sampling for training the model.
if above is correct then is it really make any difference if you give the same instances many times to model while training.
As in a weighting of the input, a coefficient applied to the input to increase or decrease the importance in the decision making.
Even after 4 years, there still isn’t another article that is as simple and clear as this.
Algorithms don’t change much 🙂
Hi I am unable to understand how the weights on the correctly classified data are reduced? The terror is zero for them so the exponential term becomes and updated w = w. What am I missing?
Perhaps the resources in the further reading section will provide a helpful alternate description for you?
There is something that’s been bugging me for a while. Is it possible for the algorithm to fall into some kind of loop in which one classifier misclassifies some points, and the next one corrects these points, but now missclasifies the ones that were previously well classified? Or
Perhaps. This might be an interesting aspect of the algorithm to investigate.
Hi Jason , plz tell what is the use of misclassification rate?
It gives you an idea of how well (or not) your model is performing.
As i am doing research related to the air sea interaction.
If we don’t know the relation between the two or more parameters, can I used adaboost technique to find the teleconnection between the two parameter, like how the two parameters are related to each other?
SOM will be useful for such scenario.
Perhaps prototype a model and compare the results to other methods.
What do these abbreviations HLSTA and HLSPA mean?
Sorry, I have not seen them before.
In the stage formulae, ln((1 – error) / error) if the calculation inside ln() comes negative, it gives the nan as result. Is that correct? Can we use abs() like this → ln(abs((1 – error) / error)) ?
Thanks for an amazing article!
Correct. Error in this case should be less than 1 always. If not, your function can prevent NaN but it also means your function is not monotonic to the error. This is usually not what we wanted.
For error calculation you’ve mentioned error = (correct-N)/N but this quantity will be negative in most cases. Shouldn’t error be = (N-correct)/N
Thank you for your feedback! The absolute value of the error should be considered.
Is this the same as logitboost?
Hi Greg…They are very similar: