Boosting and AdaBoost for Machine Learning

Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers.

In this post you will discover the AdaBoost Ensemble method for machine learning. After reading this post, you will know:

  • What the boosting ensemble method is and generally how it works.
  • How to learn to boost decision trees using the AdaBoost algorithm.
  • How to make predictions using the learned AdaBoost model.
  • How to best prepare your data for use with the AdaBoost algorithm

This post was written for developers and assumes no background in statistics or mathematics. The post focuses on how the algorithm works and how to use it for predictive modeling problems. If you have any questions, leave a comment and I will do my best to answer.

Let’s get started.

Boosting and AdaBoost for Machine Learning

Boosting and AdaBoost for Machine Learning
Photo by KatieThebeau, some rights reserved.

Boosting Ensemble Method

Boosting is a general ensemble method that creates a strong classifier from a number of weak classifiers.

This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.

AdaBoost was the first really successful boosting algorithm developed for binary classification. It is the best starting point for understanding boosting.

Modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.

Get your FREE Algorithms Mind Map

Machine Learning Algorithms Mind Map

Sample of the handy machine learning algorithms mind map.

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it. 

Download For Free


Also get exclusive access to the machine learning algorithms email mini-course.

 

 

Learning An AdaBoost Model From Data

AdaBoost is best used to boost the performance of decision trees on binary classification problems.

AdaBoost was originally called AdaBoost.M1 by the authors of the technique Freund and Schapire. More recently it may be referred to as discrete AdaBoost because it is used for classification rather than regression.

AdaBoost can be used to boost the performance of any machine learning algorithm. It is best used with weak learners. These are models that achieve accuracy just above random chance on a classification problem.

The most suited and therefore most common algorithm used with AdaBoost are decision trees with one level. Because these trees are so short and only contain one decision for classification, they are often called decision stumps.

Each instance in the training dataset is weighted. The initial weight is set to:

weight(xi) = 1/n

Where xi is the i’th training instance and n is the number of training instances.

How To Train One Model

A weak classifier (decision stump) is prepared on the training data using the weighted samples. Only binary (two-class) classification problems are supported, so each decision stump makes one decision on one input variable and outputs a +1.0 or -1.0 value for the first or second class value.

The misclassification rate is calculated for the trained model. Traditionally, this is calculated as:

error = (correct – N) / N

Where error is the misclassification rate, correct are the number of training instance predicted correctly by the model and N is the total number of training instances. For example, if the model predicted 78 of 100 training instances correctly the error or misclassification rate would be (78-100)/100 or 0.22.

This is modified to use the weighting of the training instances:

error = sum(w(i) * terror(i)) / sum(w)

Which is the weighted sum of the misclassification rate, where w is the weight for training instance i and terror is the prediction error for training instance i which is 1 if misclassified and 0 if correctly classified.

For example, if we had 3 training instances with the weights 0.01, 0.5 and 0.2. The predicted values were -1, -1 and -1, and the actual output variables in the instances were -1, 1 and -1, then the terrors would be 0, 1, and 0. The misclassification rate would be calculated as:

error = (0.01*0 + 0.5*1 + 0.2*0) / (0.01 + 0.5 + 0.2)

or

error = 0.704

A stage value is calculated for the trained model which provides a weighting for any predictions that the model makes. The stage value for a trained model is calculated as follows:

stage = ln((1-error) / error)

Where stage is the stage value used to weight predictions from the model, ln() is the natural logarithm and error is the misclassification error for the model. The effect of the stage weight is that more accurate models have more weight or contribution to the final prediction.

The training weights are updated giving more weight to incorrectly predicted instances, and less weight to correctly predicted instances.

For example, the weight of one training instance (w) is updated using:

w = w * exp(stage * terror)

Where w is the weight for a specific training instance, exp() is the numerical constant e or Euler’s number raised to a power, stage is the misclassification rate for the weak classifier and terror is the error the weak classifier made predicting the output variable for the training instance, evaluated as:

terror = 0 if(y == p), otherwise 1

Where y is the output variable for the training instance and p is the prediction from the weak learner.

This has the effect of not changing the weight if the training instance was classified correctly and making the weight slightly larger if the weak learner misclassified the instance.

AdaBoost Ensemble

Weak models are added sequentially, trained using the weighted training data.

The process continues until a pre-set number of weak learners have been created (a user parameter) or no further improvement can be made on the training dataset.

Once completed, you are left with a pool of weak learners each with a stage value.

Making Predictions with AdaBoost

Predictions are made by calculating the weighted average of the weak classifiers.

For a new input instance, each weak learner calculates a predicted value as either +1.0 or -1.0. The predicted values are weighted by each weak learners stage value. The prediction for the ensemble model is taken as a the sum of the weighted predictions. If the sum is positive, then the first class is predicted, if negative the second class is predicted.

For example, 5 weak classifiers may predict the values 1.0, 1.0, -1.0, 1.0, -1.0. From a majority vote, it looks like the model will predict a value of 1.0 or the first class. These same 5 weak classifiers may have the stage values 0.2, 0.5, 0.8, 0.2 and 0.9 respectively. Calculating the weighted sum of these predictions results in an output of -0.8, which would be an ensemble prediction of -1.0 or the second class.

Data Preparation for AdaBoost

This section lists some heuristics for best preparing your data for AdaBoost.

  • Quality Data: Because the ensemble method continues to attempt to correct misclassifications in the training data, you need to be careful that the training data is of a high-quality.
  • Outliers: Outliers will force the ensemble down the rabbit hole of working hard to correct for cases that are unrealistic. These could be removed from the training dataset.
  • Noisy Data: Noisy data, specifically noise in the output variable can be problematic. If possible, attempt to isolate and clean these from your training dataset.

Further Reading

Below are some machine learning texts that describe AdaBoost from a machine learning perspective.

Below are some seminal and good overview research articles on the method that may be useful if you are looking to dive deeper into the theoretical underpinnings of the method:

Summary

In this post you discovered the Boosting ensemble method for machine learning. You learned about:

  • Boosting and how it is a general technique that keeps adding weak learners to correct classification errors.
  • AdaBoost as the first successful boosting algorithm for binary classification problems.
  • Learning the AdaBoost model by weighting training instances and the weak learners themselves.
  • Predicting with AdaBoost by weighting predictions from weak learners.
  • Where to look for more theoretical background on the AdaBoost algorithm.

If you have any questions about this post or the Boosting or the AdaBoost algorithm ask in the comments and I will do my best to answer.


Frustrated With Machine Learning Math?

Mater Machine Learning Algorithms

See How Algorithms Work in Minutes

…with just arithmetic and simple examples

Discover how in my new Ebook: Master Machine Learning Algorithms

It covers explanations and examples of 10 top algorithms, like:
Linear Regression, k-Nearest Neighbors, Support Vector Machines and much more…

Finally, Pull Back the Curtain on
Machine Learning Algorithms

Skip the Academics. Just Results.

Click to learn more.


64 Responses to Boosting and AdaBoost for Machine Learning

  1. Sagar Giri July 26, 2016 at 5:49 am #

    Thank You! The article was really helpful to understand the AdaBoost Algorithm.
    In the article, you said, “AdaBoost is best used to boost the performance of decision trees on binary classification problems.” what does that mean?

    Can’t this algorithm be used in non-binary classification problems like “Fruit Recognition System” where training set contains feature matrix and associated name of different fruit. I need to know this for my research project.

    • Jason Brownlee July 26, 2016 at 6:00 am #

      AdaBoost was designed for binary classification (two output classes) and makes use of decision trees.

      If you think the algorithm can be used on your problem, give it a shot and see if it works.

      • Sagar Giri July 26, 2016 at 6:31 am #

        So, choosing of machine learning algorithm is heuristic? That, I should keep on implementing different algorithms and choose the best one that fits into my problem and gives the best accuracy ?

        • Jason Brownlee July 26, 2016 at 8:03 am #

          Choosing the best algorithm and even the best representation of your problem is problem specific. You must discover the best combination empirically through experimentation.

  2. Jessica August 23, 2016 at 2:22 am #

    This article is very helpful !
    I have some questions about adaboost. First, every weak learner or classifier in adaboost is decision tree based, can other algorithms like KNN or SVM be the basic components of the ensemble learning? My second question is, how adaboost algorithm deal with large dynamic sequential dataset such as global weather data or sensor dataset?

    Thank you very much!

    • Jason Brownlee August 23, 2016 at 5:53 am #

      Hi Jessica. I guess other methods could be used, but traditionally the weak learner is one-level decision trees (decision stumps). A neural net might work if it was “weak”, as in only had one layer and maybe very few neurons (perhaps just one). It might be an interesting experiment, and I bet there is some literature on this if you’d like to check http://scholar.google.com

      I am not familiar with adaboost being used on time series. It may be possible, but I have not seen it. I expect large changes to the method may be required.

    • amar March 5, 2018 at 11:24 pm #

      hi JESSICA according to your question about the ADABOOST algorithm that is coincidently one of y questions about this algorithm ,and due to the fact you mentionned datasets on global weather data or sensor datasets such as channel of deferent radiometers and meteosats can i ask you are you doing research in the field of meteorology or precipitation and climate estimation from meteosat data ? because thats the field of my research study now and i have the same question about the possibility of use of other weak classifiers such as SVM ANN or instead of DECISION STUMPS , ? , perhaps mr jason BROWNLEE can help us with

      • Jason Brownlee March 6, 2018 at 6:14 am #

        Decision stumps are preferred in adaboost because they are weak learners.

        You could use other weak learners in theory, such as knn with a small sample or small k value.

  3. Shouldn’t the error formula be

    error = (N – correct) / N ?

    Or otherwise you would get a negative misclassification error rate.

    • Niels January 27, 2017 at 8:42 pm #

      Obviously 😉

  4. Prashant Nemade November 2, 2016 at 4:51 pm #

    Hi Jason, Thank you for the detailed clarification on adaboost algorithm. I have a question on this. How training weights are being used in this adaboost algorithm (meaning does adaboost algorithm repeat observations basis weights while building model or is it being used in some different way)?

    • Jason Brownlee November 3, 2016 at 7:55 am #

      Hi Prashant, the weights are used during the construction of subsequent decision stumps.

  5. Shreejit April 3, 2017 at 10:05 am #

    Hi Jason,

    We would have a bunch of weights and models in the end. How is the final prediction decided for the example?

  6. Ntate Ndaba April 10, 2017 at 10:26 pm #

    Thanks for this Jason. It’s very helpful.

  7. Ntate Ndaba April 13, 2017 at 12:41 am #

    Hi Jason,

    I was wondering if there any rule for alphas (learner weights) in H(x) = sign(Sum(alpha*h_i(x)) for h_i = 1…n. For example, the sum of alphas must be equal to 1, the value of each alpha can’t be > 1, etc.

    • Jason Brownlee April 13, 2017 at 10:02 am #

      No that I recall offhand. If you discover something in the literature, let me know.

      • Ntate Ndaba April 13, 2017 at 6:54 pm #

        Thank you. Will do.

  8. Yi June 22, 2017 at 12:28 am #

    Given the stage = ln((1-error) / error), the stage can be negative value if the error > 0.5. Since the stage will be used as the weight of the model, can it be a negative value?

    • Yi June 22, 2017 at 10:31 am #

      I guess I found my answer and share it here:

      The stage value can be negative when the error rate > 0.5. In this case, the weight of the weak classifier become nagative, which means we flip its predication so that what the weak classifier says true become false and vice versa. In term of the misclassified data points from the weak classifier, since we flip the sign of the weak classifier, those data points are actually classified correctly. As compensation, we reduce their weight by w*exp(negative value)….

  9. Lalit June 29, 2017 at 1:20 am #

    Hi,
    At first you said :-
    Each instance in the training dataset is weighted. weight(xi) = 1/n
    So every instance will have same weight , right ?

    then in example , you said :-
    “For example, if we had 3 training instances with the weights 0.01, 0.5 and 0.2.”

    How come above 3 instances have different weights and not same weights from (1/n formula = .33,.33,.33)

    • Ali July 28, 2017 at 1:22 am #

      As Jason mentions in the article, the weights get updated based on w = w * exp(stage * terror) before the next classifier is applied.

  10. Julia June 29, 2017 at 6:25 am #

    This article is really helpful! I searched a lot for an introduction to Adaboost. This is the best!

  11. Hossein Sayadi August 4, 2017 at 2:24 am #

    Hi Jason,
    Thanks for the perfect description. I have a question! Is Adaboost or generally boosting methods for improving the accuracy of weak learners from on type? I mean, if we have different types of learners that are not doing accurately, can we use boosting for them and should? Basically the question is that in this case, should we apply boosting on each learning type individually or we should consider all the weak learner (even from different types) together and apply one boosting ?

    Thanks,

    • Jason Brownlee August 4, 2017 at 7:01 am #

      It was designed for decision trees (decision stumps).

      You can try on other models if you like.

  12. Kevin October 30, 2017 at 8:12 am #

    Hi Jason,
    That is a nice explanation, thanks. How exactly do the weights affect the tree learning algorithm? Do they feed directly into the entropy calculations for information gain? I assume they change the way the probabilities are calculated, but I can’t find a good explanation of how.

    • Jason Brownlee October 30, 2017 at 3:48 pm #

      Exactly, the weights influence the splitting criteria used to select the split point of subsequent weak learners. It’s a way of weighting the model away from “easy” parts of the problem and toward “harder” parts, as defined by the parameter values for split points.

  13. hobart November 1, 2017 at 5:09 pm #

    When I try to practice Adaboost, I found I am still not clear about “how to build a new weak classifier by weighted training data”. You mentioned weights are used in computing errors, but let me say we use ID3 or CART, they are not error driven. How do we use weighted training data to build a new ID3/CART?

    Thanks!
    (P.S. I found myself visit this place more and more often. Many thanks for very helpful information.)

    • Jason Brownlee November 2, 2017 at 5:07 am #

      Unlike CART and ID#, boosting will create multiple trees consecutively. Each tree is built on a weighted form of the dataset to ensure that the tree focuses on areas where the previous tree/trees performed poorly. The subsequent trees try to correct the errors of the prior trees.

      I hope that helps.

      • Hobart November 2, 2017 at 3:28 pm #

        Not really ~~~
        question 1: How do we build CART stump or ID# stump (in adaboost)? Or they are not option?
        question 2: I see some examples use very simple formula like (x5.5) as a stumps, they do use weight to measure error, but is this the only thing we can use in adaboost?

        • Jason Brownlee November 2, 2017 at 3:59 pm #

          You can build a stump with one split point. The weight can be used to influence the choice of split point.

          Yes, this is the Adaboost algorithm.

          I do have a worked example in a spreadsheet in my “master machine learning algorithms” book.

          • Hobart November 2, 2017 at 4:25 pm #

            Many thanks Jason!
            I was thinking a typical adaboost shall use some “advanced” algs as stumps. For example if I try to use adaboost in kaggle. Your guidence saves me a lot time.

  14. San December 30, 2017 at 1:59 am #

    Hello,
    is there algorithm in ensemble can work in similarity?

    • Jason Brownlee December 30, 2017 at 5:25 am #

      You could create an ensemble of similarity algorithms, sure.

  15. Mohammad December 30, 2017 at 2:42 am #

    Hello,
    is there algorithm in ensemble techniques work as similarity for predict based on specific rules?

  16. Mohammad December 30, 2017 at 9:37 am #

    Attribute 1 Attribute 2 Attribute 3 Class label
    Class1 4 5 5
    Class 2 2 3 1
    Class 3 4 5 5
    Class 4 1 3 5
    Class5 2 3 1
    Class 6 4 5 5

    now class label the value of it 1 or 0
    1 if the vale of the attributes is the same, 0 if no one have same value
    ex. class 1 and class 3 and class 6 are same
    class label value is 1
    can i use ensemble techniques to get this results ?

    • Jason Brownlee December 31, 2017 at 5:20 am #

      Sorry, I don’t follow, perhaps can you explain your question a different way?

  17. Haebi April 6, 2018 at 12:26 pm #

    Hi Jason, shouldn’t the terror be -1 or 1? Since in the update rule, (W = W * exp(stage * y_i * h_t)). if the y_i is -1, and h_t is -1 (thus correct classification), their multiplication would yield +1, while an incorrect classification would lead to -1?

    This article seems to suggest the same (https://towardsdatascience.com/boosting-algorithm-adaboost-b6737a9ee60c), when it says: “If a misclassified case is from a positive weighted classifier, the “exp” term in the numerator would be always larger than 1 (y*f is always -1, theta_m is positive).”

    In manual calculation, setting terror to 0 for correct classification instead of 1 would lower the total exp/euler’s value, while setting terror to 1 for incorrect classification instead of -1 would conversely raise the total euler’s value, which is not what we want. Am I missing something here?

  18. Brenda April 8, 2018 at 10:53 am #

    hey Jason, have you an example for AdaBoost M1?, and thank you for your article in my research for AdaBoost this is my favorite 🙂

  19. Arun Athavale April 11, 2018 at 9:05 am #

    I am competing in challenge for binary classification problem. Data is 250k for training. I need to produce probabilities. I find every time I run probabilities change. That makes it problematic as I cannot have results that I reproduce. Any thoughts

  20. reys April 16, 2018 at 5:40 pm #

    Can u suggest how we can implement adaboost on SVR predicted model?

    • reys April 16, 2018 at 5:42 pm #

      ValueError Traceback (most recent call last)
      in ()
      9 print(X_test)
      10 #print(y_test)
      —> 11 er_tree = generic_clf(y_train, X_train, y_test, X_test, clf_tree)

      in generic_clf(y_train, X_train_scaled, y_test, X_test_scaled, clf)
      9 print(pred_train)
      10 print(pred_test)
      —> 11 clf.fit(X_train_scaled,y_train)
      12 pred_train = clf.predict(X_train_scaled)
      13 pred_test = clf.predict(X_test_scaled)

      ~/anaconda3/lib/python3.6/site-packages/sklearn/tree/tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
      788 sample_weight=sample_weight,
      789 check_input=check_input,
      –> 790 X_idx_sorted=X_idx_sorted)
      791 return self
      792

      ~/anaconda3/lib/python3.6/site-packages/sklearn/tree/tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
      138
      139 if is_classification:
      –> 140 check_classification_targets(y)
      141 y = np.copy(y)
      142

      ~/anaconda3/lib/python3.6/site-packages/sklearn/utils/multiclass.py in check_classification_targets(y)
      170 if y_type not in [‘binary’, ‘multiclass’, ‘multiclass-multioutput’,
      171 ‘multilabel-indicator’, ‘multilabel-sequences’]:
      –> 172 raise ValueError(“Unknown label type: %r” % y_type)
      173
      174

      ValueError: Unknown label type: ‘continuous’

      my code showing error like this……..

      • Jason Brownlee April 17, 2018 at 5:55 am #

        I’m sorry to hear that.

        What code are you trying to run?

    • Jason Brownlee April 17, 2018 at 5:54 am #

      Not off hand, sorry, some careful development would be required. Also, I don’t think SVR would be a good fit as a weak model in adaboost.

  21. Tonoy Biswas June 2, 2018 at 5:05 am #

    It is very helpful Jason for my masters thesis write up. Thanks for your contribution.

  22. sana July 30, 2018 at 4:47 am #

    sir please provide code for adaboost and random forest ensemble classfiiers

  23. Ntate Ndaba August 28, 2018 at 6:06 am #

    Hi Jason. I’m back to this question after almost 2 years. I was wondering, how do you ‘select’ an instance to be part of the training set using their weights (probabilities). Do you sort the instances based on their weights and then only take those with higher weight as part of your next training set (for the next classifier) ?.

    • Ntate Ndaba August 28, 2018 at 7:39 am #

      Ok. I verified if the answer I had was correct. I used a simple technique for weighted random number generator without replacement.

  24. Kamohelo August 29, 2018 at 12:54 am #

    Hi Jason

    How do you decide to split the training set when you select examples with high probability. Do you make the selected sample from the training set be of the same size as the training set by putting back to the training set each example that has a higher probability? or you only take a subset of the training set (e.g. take only the top 70% examples for training) then classify the whole training set afterwards to get the training error and update their weights ?.

    • Jason Brownlee August 29, 2018 at 8:12 am #

      It really depends on the problem.

      Your goal is to have a train and test set that are representative of the broader dataset.

  25. Kamohelo August 29, 2018 at 1:47 am #

    Just to update my question:

    The algorithm states that the instances in the training set are weighted each time they are correctly classified or misclassified to ensure that subsequent learners focus on instances that are hard to classify. My question is, how many of those hard to classify instances from the training set are selected to train a weak classifier. Do you take 70%/50%/30%, etc. of those hard to classify instances and use them to train the weak learner in the next iteration? or do you use the entire weighted training set. If the answer is the latter, how does the weak learner utilize instance weights so that it can focus on those that have a bigger weight value ?.

  26. Sree September 11, 2018 at 5:29 am #

    Hi Jason should not there be a 0.5 multiplied to the stage value? I find the same formula with 0.5 multiplied with the log value

    • Jason Brownlee September 11, 2018 at 6:33 am #

      Perhaps it would be a bad idea to mix and match formula. Pick one approach.

Leave a Reply