Strong Learners vs. Weak Learners in Ensemble Learning

It is common to describe ensemble learning techniques in terms of weak and strong learners.

For example, we may desire to construct a strong learner from the predictions of many weak learners. In fact, this is the explicit goal of the boosting class of ensemble learning algorithms.

Although we may describe models as weak or strong generally, the terms have a specific formal definition and are used as the basis for an important finding from the field of computational learning theory.

In this tutorial, you will discover weak and strong learners and their relationship with ensemble learning.

After completing this tutorial, you will know:

  • Weak learners are models that perform slightly better than random guessing.
  • Strong learners are models that have arbitrarily good accuracy.
  • Weak and strong learners are tools from computational learning theory and provide the basis for the development of the boosting class of ensemble methods.

Kick-start your project with my new book Ensemble Learning Algorithms With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Strong Learners vs. Weak Learners for Ensemble Learning

Strong Learners vs. Weak Learners for Ensemble Learning
Photo by G. Lamar, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Weak Learners
  2. Strong Learners
  3. Weak vs. Strong Learners and Boosting

Weak Learners

A weak classifier is a model for binary classification that performs slightly better than random guessing.

A weak learner produces a classifier which is only slightly more accurate than random classification.

— Page 21, Pattern Classification Using Ensemble Methods, 2010.

This means that the model will make predictions that are known to have some skill, e.g. making the capabilities of the model weak, although not so weak that the model has no skill, e.g. performs worse than random.

  • Weak Classifier: Formally, a classifier that achieves slightly better than 50 percent accuracy.

A weak classifier is sometimes called a “weak learner” or “base learner” and the concept can be generalized beyond binary classification.

Although the concept of a weak learner is well understood in the context of binary classification, it can be taken colloquially to mean any model that performs slightly better than a naive prediction method. In this sense, it is a useful tool for thinking about the capability of classifiers and the composition of ensembles.

  • Weak Learner: Colloquially, a model that performs slightly better than a naive model.

More formally, the notion has been generalized to multi-class classification and has a different meaning beyond better than 50 percent accuracy.

For binary classification, it is well known that the exact requirement for weak learners is to be better than random guess. […] Notice that requiring base learners to be better than random guess is too weak for multi-class problems, yet requiring better than 50% accuracy is too stringent.

— Page 46, Ensemble Methods, 2012.

It is based on formal computational learning theory that proposes a class of learning methods that possess weakly learnability, meaning that they perform better than random guessing. Weak learnability is proposed as a simplification of the more desirable strong learnability, where a learnable achieved arbitrary good classification accuracy.

A weaker model of learnability, called weak learnability, drops the requirement that the learner be able to achieve arbitrarily high accuracy; a weak learning algorithm needs only output an hypothesis that performs slightly better (by an inverse polynomial) than random guessing.

The Strength of Weak Learnability, 1990.

It is a useful concept as it is often used to describe the capabilities of contributing members of ensemble learning algorithms. For example, sometimes members of a bootstrap aggregation are referred to as weak learners as opposed to strong, at least in the colloquial meaning of the term.

More specifically, weak learners are the basis for the boosting class of ensemble learning algorithms.

The term boosting refers to a family of algorithms that are able to convert weak learners to strong learners.

— Page 23, Ensemble Methods, 2012.

The most commonly used type of weak learning model is the decision tree. This is because the weakness of the tree can be controlled by the depth of the tree during construction.

The weakest decision tree consists of a single node that makes a decision on one input variable and outputs a binary prediction, for a binary classification task. This is generally referred to as a “decision stump.”

Here the weak classifier is just a “stump”: a two terminal-node classification tree.

— Page 339, The Elements of Statistical Learning, 2016.

It is used as a weak learner so often that decision stump and weak learner are practically synonyms.

  • Decision Stump: A decision tree with a single node operating on one input variable, the output of which makes a prediction directly.

Nevertheless, other models can also be configured to be weak learners.

Because boosting requires a weak learner, almost any technique with tuning parameters can be made into a weak learner. Trees, as it turns out, make an excellent base learner for boosting …

— Page 205, Applied Predictive Modeling, 2013.

Although not formally known as weak learners, we can consider the following as candidate weak learning models:

  • k-Nearest Neighbors, with k=1 operating on one or a subset of input variables.
  • Multi-Layer Perceptron, with a single node operating on one or a subset of input variables.
  • Naive Bayes, operating on a single input variable.

Now that we are familiar with a weak learner, let’s take a closer look at strong learners.

Want to Get Started With Ensemble Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Strong Learners

A strong classifier is a model for binary classification that performs with arbitrary performance, much better than random guessing.

A class of concepts is learnable (or strongly learnable) if there exists a polynomial-time algorithm that achieves low error with high confidence for all concepts in the class.

The Strength of Weak Learnability, 1990.

This is sometimes interpreted to mean perfect skill on a training or holdout dataset, although more likely refers to a “good” or “usefully skillful” model.

  • Strong Classifier: Formally, a classifier that achieves arbitrarily good accuracy.

We seek strong classifiers for predictive modeling problems. It is the goal of the modeling project to develop a strong classifier that makes mostly correct predictions with high confidence.

Again, although the concept of a strong classifier is well understood for binary classification, it can be generalized to other problem types and we can interpret the concept less formally as a well-performing model, perhaps near-optimal.

  • Strong Learner: Colloquially, a model that performs very well compared to a naive model.

We are attempting to develop a strong model when we fit a machine learning model directly on a dataset. For example, we might consider the following algorithms as techniques for fitting a strong model in the colloquial sense, where the hyperparameters of each method are tuned for the target problem:

  • Logistic Regression.
  • Support Vector Machine.
  • k-Nearest Neighbors.

And many more methods listed in the previous section or with which you may be familiar.

Strong learning is what we seek, and we can contrast their capability with weak learners, although we can also construct strong learners from weak learners.

Weak vs. Strong Learners and Boosting

We have established that weak learners perform slightly better than random, and that strong learners are good or even near-optimal and it is the latter that we seek for a predictive modeling project.

In computational learning theory, specifically PAC learning, the formal classes of weak and strong learnability were defined with the open question as to whether the two were equivalent or not.

The proof presented here is constructive; an explicit method is described for directly converting a weak learning algorithm into one that achieves arbitrary accuracy. The construction uses filtering to modify the distribution of examples in such a way as to force the weak learning algorithm to focus on the harder-to-learn parts of the distribution.

The Strength of Weak Learnability, 1990.

Later, it was discovered that they are indeed equivalent. More so that a strong learner can be constructed from many weak learners, formally defined. This provided the basis for the boosting class of ensemble learning methods.

The main result is a proof of the perhaps surprising equivalence of strong and weak learnability.

The Strength of Weak Learnability, 1990.

Although this theoretical finding was made, it still took years before the first viable boosting methods were developed, implementing the procedure.

Most notably Adaptive Boosting, referred to as AdaBoost, was the first successful boosting method, later leading to a large number of methods, culminating today in highly successful techniques such as gradient boosting and implementations such as Extreme Gradient Boosting (XGBoost).

Ensembles of weak learners was mostly studied in the machine learning community. In this thread, researchers often work on weak learners and try to design powerful algorithms to boost the performance from weak to strong. This thread of work has led to the birth of famous ensemble methods such as AdaBoost, Bagging, etc., and theoretical understanding on why and how weak learners can be boosted to strong ones.

— Page 16, Ensemble Methods, 2012.

Generally, the goal of boosting ensembles is to develop a large number of weak learners for a predictive learning problem, then best combine them in order to achieve a strong learner. This is good goal as weak learners are easy to prepare but not desirable, and strong learners are hard to prepare and highly desirable.

Since strong learners are desirable yet difficult to get, while weak learners are easy to obtain in real practice, this result opens a promising direction of generating strong learners by ensemble methods.

— Pages 16-17, Ensemble Methods, 2012.

  • Weak Learner: Easy to prepare, but not desirable due to their low skill.
  • Strong Learner: Hard to prepare, but desirable because of their high skill.

The procedure that was found to achieve this is to sequentially develop weak learners and add them to the ensemble, where each weak learner is trained in a way to pay more attention to parts of the problem domain that prior models got wrong. Although all boosting techniques follow this general procedure with specific differences and optimizations, the notion of weak and strong learners is a useful concept more generally for machine learning and ensemble learning.

For example, we have already seen how we can describe the goal of a predictive model is to develop a strong model. It is common practice to evaluate the performance of a model against a baseline or naive model, such as random predictions for binary classification. A weak learner is very much like the naive model, although slightly skillful and using a minimum of information from the problem domain, as opposed to completely naive.

Consider that although we do not technically construct weak learners in bootstrap aggregation (bagging), meaning the members are not decision stumps, we do aim to create weaker decision trees to comprise the ensemble. This is often achieved by fitting the trees on sampled subsets of the data and not pruning the trees, allowing them to overfit the training data slightly.

For classification we can understand the bagging effect in terms of a consensus of independent weak learners

— Page 286, The Elements of Statistical Learning, 2016.

Both changes are made to seek less correlated trees but have the effect of training weaker, but perhaps not weak, models to comprise the ensemble.

  • Bagging: explicitly trains weaker (but not weak) learners.

Consider stacked generalization (stacking) that trains a model to best combine the predictions from multiple different models fit on the same training dataset. Each contributing level-0 model is in effect a strong learner, and the meta level-1 model seeks to make a stronger model by combining the predictions from the strong models.

  • Stacking: explicitly combines the predictions from strong learners.

Mixture of experts (MoE) operates in a similar way, training multiple strong models (the experts) that are combined into hopefully stronger models via a meta-model, the gating network, and combing method.

Mixture-of-experts can also be seen as a classifier selection algorithm, where individual classifiers are trained to become experts in some portion of the feature space. In this setting, individual classifiers are indeed trained to become experts, and hence are usually not weak classifiers

— Page 16, Ensemble Machine Learning, 2012.

This highlights that although weak and strong learnability and learners are an important theoretical finding and basis for boosting, that the more generalized ideas of these classifiers are useful tools for designing and selecting ensemble methods.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Papers

Books

Articles

Summary

In this tutorial, you discovered weak and strong learners and their relationship with ensemble learning.

Specifically, you learned:

  • Weak learners are models that perform slightly better than random guessing.
  • Strong learners are models that have arbitrarily good accuracy.
  • Weak and strong learners are tools from computational learning theory and provide the basis for the development of the boosting class of ensemble methods.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Modern Ensemble Learning!

Ensemble Learning Algorithms With Python

Improve Your Predictions in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Ensemble Learning Algorithms With Python

It provides self-study tutorials with full working code on:
Stacking, Voting, Boosting, Bagging, Blending, Super Learner, and much more...

Bring Modern Ensemble Learning Techniques to
Your Machine Learning Projects


See What's Inside

4 Responses to Strong Learners vs. Weak Learners in Ensemble Learning

  1. Hannah November 11, 2021 at 10:21 am #

    Great article! Thank you 🙂

  2. Abao Jiang June 13, 2024 at 1:57 pm #

    Hi Jason,

    Thanks for sharing this high-quality article. I have some problems about the definitions of weak and strong learners.
    Assume we’re now solving a binary-classification problem. And, the random guessing has an error rate of 50% on the training set. As stated in the article, weak learners are those performing slightly better than random guessing. Hence, I assume they tend to underfit the data. On the other hand, strong learners can obtain arbitrarily low error rate, so overfitting usually happens.
    For bagging, the ultimate goal is to reduce the variance. For example, each decision tree can be trained to have a training loss near zero. And, the random forest can be simply thought of as a combined version of decision trees with bagging. Can I say that each base model in the random forest is a strong learner because it has a low bias? Thanks!

    • James Carmichael June 14, 2024 at 6:45 am #

      Hi Abao…Your understanding of weak and strong learners, as well as the concepts of bagging and random forests, touches on several important aspects of machine learning. Let me clarify these concepts further:

      ### Weak Learners vs. Strong Learners
      – **Weak Learners**:
      – A weak learner is a model that performs slightly better than random guessing. For binary classification, this means achieving an error rate that is better than 50%.
      – Weak learners tend to have high bias and might underfit the data. They are simple models that are not very powerful individually.
      – Examples of weak learners include decision stumps (single-level decision trees) or simple linear classifiers.

      – **Strong Learners**:
      – A strong learner is a model that can achieve a very low error rate on the training data, potentially approaching zero.
      – Strong learners can be complex models that are capable of capturing intricate patterns in the data, leading to low bias but possibly high variance, which can result in overfitting.
      – Examples of strong learners include deep neural networks or fully grown decision trees.

      ### Bagging and Random Forests
      – **Bagging (Bootstrap Aggregating)**:
      – The goal of bagging is to reduce variance and improve the stability and accuracy of machine learning algorithms. It involves training multiple models (typically the same type) on different random subsets of the training data and then averaging their predictions (for regression) or taking a majority vote (for classification).
      – Bagging works well with high-variance, low-bias models, such as fully grown decision trees, which are strong learners when considered individually.

      – **Random Forests**:
      – A random forest is an ensemble of decision trees trained using bagging, but with an additional layer of randomness. In each split of a decision tree, only a random subset of features is considered. This decorrelates the trees and further reduces variance.
      – Each individual tree in a random forest can indeed be considered a strong learner because it is grown to its full depth and can achieve a low training error (low bias). However, the random forest as a whole balances this by averaging over many trees, thereby reducing variance and mitigating the risk of overfitting.

      ### Clarifying Your Assumptions
      – **Underfitting and Weak Learners**:
      – Weak learners tend to underfit the data because they are simple models with high bias. Their performance is only slightly better than random guessing.

      – **Overfitting and Strong Learners**:
      – Strong learners can potentially overfit the data because they are powerful models capable of capturing complex patterns in the training data, leading to low bias but high variance.

      – **Base Models in Random Forests**:
      – It is correct to say that each base model (decision tree) in a random forest is a strong learner in the sense that it has low bias because it can achieve a low error on the training data. However, the use of bagging and random feature selection helps to reduce the overall variance of the ensemble, making the random forest a robust and accurate model.

      ### Conclusion
      Your understanding is mostly correct, with the following key points to remember:
      – Weak learners are simple models with high bias and typically underfit the data.
      – Strong learners are complex models with low bias and can overfit the data.
      – Random forests use strong learners (fully grown decision trees) and combine them using bagging and random feature selection to reduce variance and improve generalization.

      By combining these strong learners, the random forest leverages their low bias while mitigating the high variance through ensemble techniques, resulting in a model that performs well on unseen data.

      • Abao Jiang June 14, 2024 at 11:39 am #

        Hi James,

        Thanks for your quick reply. Recently, I observed that bagging was described as follows in ESL,

        > For classification we can understand the bagging effect in terms of a consensus of independent weak learners.
        – Page 286, The Elements of Statistical Learning.

        This was the main statement making me confused. May I ask why the authors of ESL use “weak” learners here, which seems to be contradictory to our discussion above?

Leave a Reply

Machine Learning Mastery is part of Guiding Tech Media, a leading digital media publisher focused on helping people figure out technology. Visit our corporate website to learn more about our mission and team.