The post Why Use Ensemble Learning? appeared first on Machine Learning Mastery.

]]>Ensembles are predictive models that combine predictions from two or more other models.

Ensemble learning methods are popular and the go-to technique when the best performance on a predictive modeling project is the most important outcome.

Nevertheless, they are not always the most appropriate technique to use and beginners the field of applied machine learning have the expectation that ensembles or a specific ensemble method are always the best method to use.

Ensembles offer two specific benefits on a predictive modeling project, and it is important to know what these benefits are and how to measure them to ensure that using an ensemble is the right decision on your project.

In this tutorial, you will discover the benefits of using ensemble methods for machine learning.

After reading this tutorial, you will know:

- A minimum benefit of using ensembles is to reduce the spread in the average skill of a predictive model.
- A key benefit of using ensembles is to improve the average prediction performance over any contributing member in the ensemble.
- The mechanism for improved performance with ensembles is often the reduction in the variance component of prediction errors made by the contributing models.

Let’s get started.

This tutorial is divided into four parts; they are:

- Ensemble Learning
- Use Ensembles to Improve Robustness
- Bias, Variance, and Ensembles
- Use Ensembles to Improve Performance

An ensemble is a machine learning model that combines the predictions from two or more models.

The models that contribute to the ensemble, referred to as ensemble members, may be the same type or different types and may or may not be trained on the same training data.

The predictions made by the ensemble members may be combined using statistics, such as the mode or mean, or by more sophisticated methods that learn how much to trust each member and under what conditions.

The study of ensemble methods really picked up in the 1990s, and that decade was when papers on the most popular and widely used methods were published, such as core bagging, boosting, and stacking methods.

In the late 2000s, adoption of ensembles picked up due in part to their huge success in machine learning competitions, such as the Netflix prize and later competitions on Kaggle.

Over the last couple of decades, multiple classifier systems, also called ensemble systems have enjoyed growing attention within the computational intelligence and machine learning community.

— Page 1, Ensemble Machine Learning, 2012.

Ensemble methods greatly increase computational cost and complexity. This increase comes from the expertise and time required to train and maintain multiple models rather than a single model. This forces the question:

**Why should we consider using an ensemble?**

There are two main reasons to use an ensemble over a single model, and they are related; they are:

**Performance:**An ensemble can make better predictions and achieve better performance than any single contributing model.**Robustness**: An ensemble reduces the spread or dispersion of the predictions and model performance.

Ensembles are used to achieve better predictive performance on a predictive modeling problem than a single predictive model. The way this is achieved can be understood as the model reducing the variance component of the prediction error by adding bias (i.e. in the context of the bias-variance trade-off).

Originally developed to reduce the variance—thereby improving the accuracy—of an automated decision-making system …

— Page 1, Ensemble Machine Learning, 2012.

There is another important and less discussed benefit of ensemble methods is improved robustness or reliability in the average performance of a model.

These are both important concerns on a machine learning project and sometimes we may prefer one or both properties from a model.

Let’s take a closer look at these two properties in order to better understand the benefits of using ensemble learning on a project.

On a predictive modeling project, we often evaluate multiple models or modeling pipelines and choose one that performs well or best as our final model.

The algorithm or pipeline is then fit on all available data and used to make predictions on new data.

We have an idea of how well the model will perform on average from our test harness, typically estimated using repeated k-fold cross-validation as a gold standard. The problem is, average performance might not be sufficient.

An average accuracy or error of a model is a summary of the expected performance, when in fact, some models performed better and some models performed worse on different subsets of the data.

The standard deviation is the average difference between an observation and the mean and summarizes the dispersion or spread of data. For an accuracy or error measure for a model, it can give you an idea of the spread of the model’s behavior.

Looking at the minimum and maximum model performance scores will give you an idea of the worst and best performance you might expect from the model, and this might not be acceptable for your application.

The simplest ensemble is to fit the model multiple times on the training datasets and combine the predictions using a summary statistic, such as the mean for regression or the mode for classification. Importantly, each model needs to be slightly different due to the stochastic learning algorithm, difference in the composition of the training dataset, or differences in the model itself.

This will reduce the spread in the predictions made by the model. The mean performance will probably be about the same, although the worst- and best-case performance will be brought closer to the mean performance.

In effect, it smooths out the expected performance of the model.

We can refer to this as the “*robustness*” in the expected performance of the model and is a minimum benefit of using an ensemble method.

An ensemble may or may not improve modeling performance over any single contributing member, discussed more further, but at minimum, it should reduce the spread in the average performance of the model.

For more on this topic, see the tutorial:

Machine learning models for classification and regression learn a mapping function from inputs to outputs.

This mapping is learned from examples from the problem domain, the training dataset, and is evaluated on data not used during training, the test dataset.

The errors made by a machine learning model are often described in terms of two properties: the **bias** and the **variance**.

The bias is a measure of how close the model can capture the mapping function between inputs and outputs. It captures the rigidity of the model: the strength of the assumption the model has about the functional form of the mapping between inputs and outputs.

The variance of the model is the amount the performance of the model changes when it is fit on different training data. It captures the impact of the specifics of the data has on the model.

Variance refers to the amount by which [the model] would change if we estimated it using a different training data set.

— Page 34, An Introduction to Statistical Learning with Applications in R, 2014.

The bias and the variance of a model’s performance are connected.

Ideally, we would prefer a model with low bias and low variance, although in practice, this is very challenging. In fact, this could be described as the goal of applied machine learning for a given predictive modeling problem.

Reducing the bias can often easily be achieved by increasing the variance. Conversely, reducing the variance can easily be achieved by increasing the bias.

This is referred to as a trade-off because it is easy to obtain a method with extremely low bias but high variance […] or a method with very low variance but high bias …

— Page 36, An Introduction to Statistical Learning with Applications in R, 2014.

Some models naturally have a high bias or a high variance, which can be often relaxed or increased using hyperparameters that change the learning behavior of the algorithm.

Ensembles provide a way to reduce the variance of the predictions; that is the amount of error in the predictions made that can be attributed to “*variance*.”

This is not always the case, but when it is, this reduction in variance, in turn, leads to improved predictive performance.

Empirical and theoretical evidence show that some ensemble techniques (such as bagging) act as a variance reduction mechanism, i.e., they reduce the variance component of the error. Moreover, empirical results suggest that other ensemble techniques (such as AdaBoost) reduce both the bias and the variance parts of the error.

— Page 39, Pattern Classification Using Ensemble Methods, 2010.

Using ensembles to reduce the variance properties of prediction errors leads to the key benefit of using ensembles in the first place: to improve predictive performance.

Reducing the variance element of the prediction error improves predictive performance.

We explicitly use ensemble learning to seek better predictive performance, such as lower error on regression or high accuracy for classification.

… there is a way to improve model accuracy that is easier and more powerful than judicious algorithm selection: one can gather models into ensembles.

— Page 2, Ensemble Methods in Data Mining, 2010.

This is the **primary use of ensemble learning methods** and the benefit demonstrated through the use of ensembles by the majority of winners of machine learning competitions, such as the Netflix prize and competitions on Kaggle.

In the Netflix Prize, a contest ran for two years in which the first team to submit a model improving on Netflix’s internal recommendation system by 10% would win $1,000,000. […] the final edge was obtained by weighing contributions from the models of up to 30 competitors.

— Page 8, Ensemble Methods in Data Mining, 2010.

This benefit has also been demonstrated with academic competitions, such as top solutions for the famous ImageNet dataset in computer vision.

An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task.

— Deep Residual Learning for Image Recognition, 2015.

When used in this way, an ensemble should only be adopted if it performs better on average than any contributing member of the ensemble. If this is not the case, then the contributing member that performs better should be used instead.

Consider the distribution of expected scores calculated by a model on a test harness, such as repeated k-fold cross-validation, as we did above when considering the “*robustness*” offered by an ensemble. An ensemble that reduces the variance in the error, in effect, will shift the distribution rather than simply shrink the spread of the distribution.

This can result in a better average performance as compared to any single model.

This is not always the case, and having this expectation is a common mistake made by beginners.

It is possible, and even common, for the performance of an ensemble to perform no better than the best-performing member of the ensemble. This can happen if the ensemble has one top-performing model and the other members do not offer any benefit or the ensemble is not able to harness their contribution effectively.

It is also possible for an ensemble to perform worse than the best-performing member of the ensemble. This too is common any typically involves one top-performing model whose predictions are made worse by one or more poor-performing other models and the ensemble is not able to harness their contributions effectively.

As such, it is important to test a suite of ensemble methods and tune their behavior, just as we do for any individual machine learning model.

This section provides more resources on the topic if you are looking to go deeper.

- How to Reduce Variance in a Final Machine Learning Model
- How to Develop a Horizontal Voting Deep Learning Ensemble to Reduce Variance

- Pattern Classification Using Ensemble Methods, 2010.
- Ensemble Methods, 2012.
- Ensemble Machine Learning, 2012.
- Ensemble Methods in Data Mining, 2010.

In this post, you discovered the benefits of using ensemble methods for machine learning.

Specifically, you learned:

- A minimum benefit of using ensembles is to reduce the spread in the average skill of a predictive model.
- A key benefit of using ensembles is to improve the average prediction performance over any contributing member in the ensemble.
- The mechanism for improved performance with ensembles is often the reduction in the variance component of prediction errors made by the contributing models.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Why Use Ensemble Learning? appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Ensemble Learning appeared first on Machine Learning Mastery.

]]>This includes choosing a book to read based on reviews, choosing a course of action based on the advice of multiple medical doctors, and determining guilt.

Often, decision making by a group of individuals results in a better outcome than a decision made by any one member of the group. This is generally referred to as the wisdom of the crowd.

We can achieve a similar result by combining the predictions of multiple machine learning models for regression and classification predictive modeling problems. This is referred to generally as ensemble machine learning, or simply **ensemble learning**.

In this post, you will discover a gentle introduction to ensemble learning.

After reading this post, you will know:

- Many decisions we make involve the opinions or votes of other people.
- The ability of groups of people to make better decisions than individuals is called the wisdom of the crowd.
- Ensemble machine learning involves combining predictions from multiple skillful models.

Let’s get started.

This tutorial is divided into three parts; they are:

- Making Important Decisions
- Wisdom of Crowds
- Ensemble Machine Learning

Consider important decisions you make in your life.

For example:

- What book to purchase and read next.
- What university to attend.

Candidate books are those that sound interesting, but the book we purchase might have the most favorable reviews. Candidate universities are those that offer the courses we’re interested in, but we might choose one based on the feedback from friends and acquaintances that have first-hand experience.

We might trust the reviews and star ratings because each individual that contributed a review was (hopefully) unaffiliated with the book and independent of the other people leaving a review. When this is not the case, trust in the outcome is questionable and trust in the system is shaken, which is why Amazon works hard to delete fake reviews for books.

Also, consider important decisions we make more personally.

For example, medical treatment for an illness.

We take advice from an expert, but we seek a second, third, and even more opinions to confirm we are taking the best course of action.

The advice from the second and third opinion may or may not match the first opinion, but we weigh it heavily because it is provided dispassionately, objectively, and independently. If the doctors colluded on their opinion, then we would feel like the process of seeking a second and third opinion has failed.

… whenever we are faced with making a decision that has some important consequence, we often seek the opinions of different “experts” to help us make that decision …

— Page 2, Ensemble Machine Learning, 2012.

Finally, consider decisions we make as a society.

For example:

- Who should represent a geographical area in a government.
- Whether someone is guilty of a crime.

The democratic election of representatives is based (in some form) on the independent votes of citizens.

Making decisions based on the input of multiple people or experts has been a common practice in human civilization and serves as the foundation of a democratic society.

— Page v, Ensemble Methods, 2012.

An individual’s guilt of a serious crime may be determined by a jury of independent peers, often sequestered to enforce the independence of their interpretation. Cases may also be appealed at multiple levels, providing second, third, and more opinions on the outcome.

The judicial system in many countries, whether based on a jury of peers or a panel of judges, is also based on ensemble-based decision making.

— Pages 1-2, Ensemble Machine Learning, 2012.

These are all examples of an outcome arrived at through the combination of lower-level opinions, votes, or decisions.

… ensemble-based decision making is nothing new to us; as humans, we use such systems in our daily lives so often that it is perhaps second nature to us.

— Page 1, Ensemble Machine Learning, 2012.

In each case, we can see that there are properties of the lower-level decisions that are critical for the outcome to be useful, such as a belief in their independence and that each has some validity on their own.

This approach to decision making is so common, it has a name.

This approach to decision making when using humans that make the lower-level decisions is often referred to as the “wisdom of the crowd.”

It refers to the case where the opinion calculated from the aggregate of a group of people is often more accurate, useful, or correct than the opinion of any individual in the group.

A famous case of this from more than 100 years ago, and often cited, is that of a contest at a fair in Plymouth, England to estimate the weight of an ox. Individuals made their guess and the person whose guess was closest to the actual weight won the meat.

The statistician Francis Galton collected all of the guesses afterward and calculated the average of the guesses.

… he added all the contestants’ estimates, and calculated the mean of the group’s guesses. That number represented, you could say, the collective wisdom of the Plymouth crowd. If the crowd were a single person, that was how much it would have guessed the ox weighed.

— Page xiii, The Wisdom of Crowds, 2004.

He found that the mean of the guesses made by the contestants was very close to the actual weight. That is, taking the average value of all the numerical weights from the 800 participants was an accurate way of determining the true weight.

The crowd had guessed that the ox, after it had been slaughtered and dressed, would weigh 1,197 pounds. After it had been slaughtered and dressed, the ox weighed 1,198 pounds. In other words, the crowd’s judgment was essentially perfect.

— Page xiii, The Wisdom of Crowds, 2004.

This example is given at the beginning of James Surowiecki’s 2004 book titled “The Wisdom of Crowds” that explores the ability of groups of humans to make decisions and predictions that are often better than the members of the group.

This intelligence, or what I’ll call “the wisdom of crowds,” is at work in the world in many different guises.

— Page xiv, The Wisdom of Crowds, 2004.

The book motivates the preference to average the guesses, votes, and opinions of groups of people when making some important decisions instead of searching for and consulting a single expert.

… we feel the need to “chase the expert.” The argument of this book is that chasing the expert is a mistake, and a costly one at that. We should stop hunting and ask the crowd (which, of course, includes the geniuses as well as everyone else) instead. Chances are, it knows.

— Page xv, The Wisdom of Crowds, 2004.

The book goes on to highlight a number properties of any system that makes decisions based on groups of people, summarized nicely in Lior Rokach’s 2010 book titled “Pattern Classification Using Ensemble Methods” (page 22), as:

**Diversity of opinion**: Each member should have private information even if it is just an eccentric interpretation of the known facts.**Independence**: Members’ opinions are not determined by the opinions of those around them.**Decentralization**: Members are able to specialize and draw conclusions based on local knowledge.**Aggregation**: Some mechanism exists for turning private judgments into a collective decision.

As a decision-making system, the approach is not always the most effective (e.g. stock market bubbles, fads, etc.), but can be effective in a range of different domains where the outcomes are important.

We can use this approach to decision making in applied machine learning.

Applied machine learning often involves fitting and evaluating models on a dataset.

Given that we cannot know which model will perform best on the dataset beforehand, this may involve a lot of trial and error until we find a model that performs well or best for our project.

This is akin to making a decision using a single expert. Perhaps the best expert we can find.

A complementary approach is to prepare multiple different models, then combine their predictions. This is called an ensemble machine learning model, or simply an ensemble, and the process of finding a well-performing ensemble model is referred to as “**ensemble learning**“.

Ensemble methodology imitates our second nature to seek several opinions before making a crucial decision.

— Page vii, Pattern Classification Using Ensemble Methods, 2010.

This is akin to making a decision using the opinions from multiple experts.

The most common type of ensemble involves training multiple versions of the same machine learning model in a way that ensures that each ensemble member is different (e.g. decision trees fit on different subsamples of the training dataset), then combining the predictions using averaging or voting.

A less common, although just as effective, approach involves training different algorithms on the same data (e.g. a decision tree, a support vector machine, and a neural network) and combining their predictions.

Like combining the opinions of humans in a crowd, the effectiveness of the ensemble relies on each model having some skill (better than random) and some independence from the other models. This latter point is often interpreted as meaning that the model is skillful in a different way from other models in the ensemble.

The hope is that the ensemble results in a better performing model than any contributing member.

The core principle is to weigh several individual pattern classifiers, and combine them in order to reach a classification that is better than the one obtained by each of them separately.

— Page vii, Pattern Classification Using Ensemble Methods, 2010.

At worst, the ensemble limits the worst case of predictions by reducing the variance of the predictions. Model performance can vary with the training data (and the stochastic nature of the learning algorithm in some cases), resulting in better or worse performance for any specific model.

… the goal of ensemble systems is to create several classifiers with relatively fixed (or similar) bias and then combining their outputs, say by averaging, to reduce the variance.

— Page 2, Ensemble Machine Learning, 2012.

An ensemble can smooth this out and ensure that predictions made are closer to the average performance of contributing members. Further, reducing the variance in predictions often results in a lift in the skill of the ensemble. This comes at the added computational cost of fitting and maintaining multiple models instead of a single model.

Although ensemble predictions will have a lower variance, they are not guaranteed to have better performance than any single contributing member.

… researchers in the computational intelligence and machine learning community have studied schemes that share such a joint decision procedure. These schemes are generally referred to as ensemble learning, which is known to reduce the classifiers’ variance and improve the decision system’s robustness and accuracy.

— Page v, Ensemble Methods, 2012.

Sometimes, the best performing model, e.g. the best expert, is sufficiently superior compared to other models that combining its predictions with other models can result in worse performance.

As such, selecting models, even ensemble models, still requires carefully controlled experiments on a robust test harness.

This section provides more resources on the topic if you are looking to go deeper.

- The Wisdom of Crowds, 2004.
- Pattern Classification Using Ensemble Methods, 2010.
- Ensemble Methods, 2012.
- Ensemble Machine Learning, 2012.

- Ensemble learning, Wikipedia.
- Ensemble learning, Scholarpedia.
- Wisdom of the crowd, Wikipedia.
- The Wisdom of Crowds, Wikipedia.

In this post, you discovered a gentle introduction to ensemble learning.

Specifically, you learned:

- Many decisions we make involve the opinions or votes of other people.
- The ability of groups of people to make better decisions than individuals is called the wisdom of the crowd.
- Ensemble machine learning involves combining predictions from multiple skillful models.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Ensemble Learning appeared first on Machine Learning Mastery.

]]>The post 6 Books on Ensemble Learning appeared first on Machine Learning Mastery.

]]>The effect can be both improved predictive performance and lower variance of the predictions made by the model.

Ensemble methods are covered in most textbooks on machine learning; nevertheless, there are books dedicated to the topic.

In this post, you will discover the top books on the topic of ensemble machine learning.

After reading this post, you will know:

- Books on ensemble learning, including their table of contents and where to learn more about them.
- Sections and chapters on ensemble learning in the most popular and common machine learning textbooks.
- Book recommendations for machine learning practitioners interested in ensemble learning.

Let’s get started.

The books dedicated to the topic of ensemble learning that we will cover are as follows:

- Supervised and Unsupervised Ensemble Methods and their Applications, 2008.
- Pattern Classification Using Ensemble Methods, 2010.
- Ensemble Learning, 2019.
- Ensemble Methods in Data Mining, 2010.
- Ensemble Methods, 2012.
- Ensemble Machine Learning, 2012.

There are also some books from Packt, but I won’t be reviewing them; they are:

- Hands-On Ensemble Learning with R, 2018.
- Hands-On Ensemble Learning with Python, 2019.
- Ensemble Machine Learning Cookbook, 2019.

**Did I miss a book on ensemble learning?**

Let me know in the comments below.

**Have you read any of these books on ensemble learning?**

What did you think? Let me know in the comments.

Let’s take a closer look at these books, including their author, table of contents, and where to learn more.

The full title of this book is “Supervised and Unsupervised Ensemble Methods and their Applications” and it was edited by Oleg Okun and Giorgio Valentini and published in 2008.

This book is a collection of academic papers by a range of different authors on the topic of applications of ensemble learning.

The book includes nine chapters divided into two parts, assembling contributions to the applications of supervised and unsupervised ensembles.

— Page VIII, Supervised and Unsupervised Ensemble Methods and their Applications, 2008.

- Part I: Ensembles of Clustering Methods and Their Applications
- Chapter 01: Cluster Ensemble Methods: From Single Clusterings to Combined Solutions
- Chapter 02: Random Subspace Ensembles for Clustering Categorical Data
- Chapter 03: Ensemble Clustering with a Fuzzy Approach
- Chapter 04: Collaborative Multi-Strategical Clustering for OBject-Oriented Image Analysis

- Part II: Ensembles of Classification Methods and Their Applications
- Chapter 05: Intrusion Detection in Computer Systems Using Multiple Classifier Systems
- Chapter 06: Ensembles of Nearest Neighbors for Gene Expression Based Cancer Classification
- Chapter 07: Multivariate Time Series Classification via Stacking of Univariate Classifiers
- Chapter 08: Gradient Boosting GARCH and Neural Networks for Time Series Prediction
- Chapter 09: Cascading with VDM and Binary Decision Trees for Nominal Data

I generally would not recommend this book to machine learning practitioners unless one of the applications covered by the book is directly related to your current project.

You can learn more about this book here:

The full title of this book is “Pattern Classification Using Ensemble Methods” and it was written by Lior Rokach and published in 2010.

This book provides a technical introduction to the topic of ensemble machine learning written for students and academics.

Throughout the book, special emphasis was put on the extensive use of illustrative examples. Accordingly, in addition to ensemble theory, the reader is also provided with an abundance of artificial as well as real-world applications from a wide range of fields. The data referred to in this book, as well as most of the Java implementations of the presented algorithms, can be obtained via the Web.

— Page viii, Pattern Classification Using Ensemble Methods, 2010.

- Chapter 01: Introduction to Pattern Classification
- Chapter 02: Introduction to Ensemble Learning
- Chapter 03: Ensemble Classification
- Chapter 04: Ensemble Diversity
- Chapter 05: Ensemble Selection
- Chapter 06: Error Correcting Output Codes
- Chapter 07: Evaluating Ensembles of Classifiers

I like the level of this book. It is technical, but not overly so, and stays grounded in the concerns of using ensemble algorithms on supervised predictive modeling projects. I think it is a good textbook on ensemble learning for practitioners.

You can learn more about this book here:

The full title of this book is “Ensemble Learning: Pattern Classification Using Ensemble Methods” and it was written by Lior Rokach and published in 2019.

This is a direct update to the book “Pattern Classification Using Ensemble Methods” and given a different title.

The first edition of this book was published a decade ago. The book was well-received by the machine learning and data science communities and was translated into Chinese. […] The second edition aims to update the previously presented material on the fundamental areas, and to present new findings in the field; more than a third of this edition is comprised of new material.

— Page vii, Ensemble Learning: Pattern Classification Using Ensemble Methods, 2019.

- Chapter 01: Introduction to Machine Learning
- Chapter 02: Classification and Regression Trees
- Chapter 03: Introduction to Ensemble Learning
- Chapter 04: Ensemble Classification
- Chapter 05: Gradient Boosting Machines
- Chapter 06: Ensemble Diversity
- Chapter 07: Ensemble Selection
- Chapter 08: Error Correcting Output Codes
- Chapter 09: Evaluating Ensemble Classifiers

This is a great textbook on ensemble learning for students and practitioners and is preferred over “*Pattern Classification Using Ensemble Methods*” if you must choose between the two.

You can learn more about this book here:

The full title of this book is “Ensemble Methods in Data Mining: Improving Accuracy

Through Combining Predictions” and it was written by Giovanni Seni and John Elder and published in 2010.

This is a technical book on ensembles, although concepts are demonstrated with full examples in R.

This book is aimed at novice and advanced analytic researchers and practitioners – especially in Engineering, Statistics, and Computer Science. Those with little exposure to ensembles will learn why and how to employ this breakthrough method, and advanced practitioners will gain insight into building even more powerful models. Throughout, snippets of code in R are provided to illustrate the algorithms described and to encourage the reader to try the technique.

— Page i, Ensemble Methods in Data Mining, 2010.

- Chapter 01: Ensembles Discovered
- Chapter 02: Predictive Learning and Decision Trees
- Chapter 03: Model Complexity, Model Selection and Regularization
- Chapter 04: Importance Sampling and the Classic Ensemble Methods
- Chapter 05: Rule Ensembles and Interpretation Statistics
- Chapter 06: Ensemble Complexity
- Appendix A: AdaBoost Equivalence to FSF Procedure
- Appendix B: Gradient Boosting and Robust Loss Functions

I believe this is the first book I purchased on ensemble learning years ago. It is a good crash course in ensemble learning for practitioners, especially those already using R. It may be a little too mathematical for most practitioners; nevertheless, I think it might be a good smaller substitute for the above textbook on ensemble methods.

You can learn more about this book here:

The full title of this book is “Ensemble Methods: Foundations and Algorithms” and it was written by Zhi-Hua Zhou and published in 2012.

This is another focused textbook on the topic of ensemble learning targeted at students and academics.

This book provides researchers, students and practitioners with an introduction to ensemble methods. The book consists of eight chapters which naturally constitute three parts.

— Page vii, Ensemble Methods: Foundations and Algorithms, 2012.

- Chapter 01: Introduction
- Chapter 02: Boosting
- Chapter 03: Bagging
- Chapter 04: Combination Methods
- Chapter 05: Diversity
- Chapter 06: Ensemble Pruning
- Chapter 07: Clustering Ensembles
- Chapter 08: Advanced Topics

This book is well written and covers the main methods with good references. I think it’s another great jump-start on the basics of ensemble methods as long as the reader is comfortable with a little math. I liked the algorithm descriptions and worked examples.

You can learn more about this book here:

The full title of this book is “Ensemble Machine Learning: Methods and Applications” and it was edited by Cha Zhang and Yunqian Ma and published in 2012.

This book is a collection of academic papers written by a range of authors on the topic of applications of ensemble machine learning.

Despite the great success of ensemble learning methods recently, we found very few books that were dedicated to this topic, and even fewer that provided insights about how such methods shall be applied in real-world applications. The primary goal of this book is to fill the existing gap in the literature and comprehensively cover the state-of-the-art ensemble learning methods, and provide a set of applications that demonstrate the various usages of ensemble learning methods in the real world.

— Page v, Ensemble Machine Learning: Methods and Applications, 2012.

- Chapter 01: Ensemble Learning
- Chapter 02: Boosting Algorithms: A Review of Methods, Theory, and Applications
- Chapter 03: Boosting Kernel Estimators
- Chapter 04: Targeted Learning
- Chapter 05: Random Forests
- Chapter 06: Ensemble Learning by Negative Correlation Learning
- Chapter 07: Ensemble Nystrom
- Chapter 08: Object Detection
- Chapter 09: Classifier Boosting for Human Activity Recognition
- Chapter 10: Discriminative Learning for Anatomical Structure Detection and Segmentation
- Chapter 1: Random Forest for Bioinformatics

Like other collections of papers, I would generally not recommend this book unless you are an academic or one of the chapters is directly related to your current machine learning project. Nevertheless, many of the chapters provide a solid and compact introduction to ensemble methods and how to use them on specific applications.

You can learn more about this book here:

Many machine learning textbooks have sections on ensemble learning.

In this section, we will take a quick tour of some of the more popular textbooks and the relevant sections on ensemble learning.

The book “An Introduction to Statistical Learning with Applications in R” published in 2016 provides a solid introduction to boosting and bagging for decision trees in chapter 8.

- Section 8.2: Bagging, RandomForests, Boosting

The book “Applied Predictive Modeling” published in 2013 covers the most popular ensemble algorithms with examples in R, with a focus on the ensembles of decision trees.

- Chapter 8: Regression Trees and Rule-Based Model
- Chapter 14: Classification Trees and Rule-Based Models

The book “Data Mining: Practical Machine Learning Tools and Techniques” published in 2016 provides a chapter dedicated to ensemble learning and covers a range of popular techniques, including boosting, bagging, and stacking.

- Chapter 12: Ensemble Learning.

The book “Machine Learning: A Probabilistic Perspective” published in 2012 provides a number of sections on algorithms that perform ensembling, as well as a dedicated section on the topic focused on stacking and error-correcting output codes.

- Section 16.2: Classification and regression trees (CART)
- Section 16.4: Boosting
- Section 16.6: Ensemble learning

The book “The Elements of Statistical Learning” published in 2016 covers the key ensemble learning algorithms as well as the theory for ensemble learning generally.

- Chapter 8: Model Inference and Averaging
- Chapter 10: Boosting and Additive Trees
- Chapter 15: Random Forests
- Chapter 16: Ensemble Learning

**Did I miss your favorite machine learning textbook that has a section on ensemble learning?**

Let me know in the comments below.

I have a copy of each of these books as I like to read about a given topic from multiple perspectives.

If you are looking for a solid textbook dedicated to the topic of ensemble learning, I would recommend one of the following:

- Ensemble Methods: Foundations and Algorithms, 2012.
- Ensemble Learning: Pattern Classification Using Ensemble Methods, 2019.

A close runner-up is “*Ensemble Methods in Data Mining*” that mixes theory and examples in R.

Also, I recommend “*Pattern Classification Using Ensemble Methods*” if you cannot get your hands on the more recent “*Ensemble Learning: Pattern Classification Using Ensemble Methods*“.

In this post, you discovered a suite of books on the topic of ensemble machine learning.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post 6 Books on Ensemble Learning appeared first on Machine Learning Mastery.

]]>The post How to Develop a Gradient Boosting Machine Ensemble in Python appeared first on Machine Learning Mastery.

]]>Boosting is a general ensemble technique that involves sequentially adding models to the ensemble where subsequent models correct the performance of prior models. AdaBoost was the first algorithm to deliver on the promise of boosting.

Gradient boosting is a generalization of AdaBoosting, improving the performance of the approach and introducing ideas from bootstrap aggregation to further improve the models, such as randomly sampling the samples and features when fitting ensemble members.

Gradient boosting performs well, if not the best, on a wide range of tabular datasets, and versions of the algorithm like XGBoost and LightBoost often play an important role in winning machine learning competitions.

In this tutorial, you will discover how to develop Gradient Boosting ensembles for classification and regression.

After completing this tutorial, you will know:

- Gradient Boosting ensemble is an ensemble created from decision trees added sequentially to the model.
- How to use the Gradient Boosting ensemble for classification and regression with scikit-learn.
- How to explore the effect of Gradient Boosting model hyperparameters on model performance.

Let’s get started.

**Update Aug/2020**: Added a common questions section. Added grid search example.

This tutorial is divided into five parts; they are:

- Gradient Boosting Algorithm
- Gradient Boosting Scikit-Learn API
- Gradient Boosting for Classification
- Gradient Boosting for Regression

- Gradient Boosting Hyperparameters
- Explore Number of Trees
- Explore Number of Samples
- Explore Number of Features
- Explore Learning Rate
- Explore Tree Depth

- Grid Search Hyperparameters
- Common Questions

Gradient boosting refers to a class of ensemble machine learning algorithms that can be used for classification or regression predictive modeling problems.

Gradient boosting is also known as gradient tree boosting, stochastic gradient boosting (an extension), and gradient boosting machines, or GBM for short.

Ensembles are constructed from decision tree models. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. This is a type of ensemble machine learning model referred to as boosting.

Models are fit using any arbitrary differentiable loss function and gradient descent optimization algorithm. This gives the technique its name, “*gradient boosting*,” as the loss gradient is minimized as the model is fit, much like a neural network.

One way to produce a weighted combination of classifiers which optimizes [the cost] is by gradient descent in function space

— Boosting Algorithms as Gradient Descent in Function Space, 1999.

Naive gradient boosting is a greedy algorithm and can overfit the training dataset quickly.

It can benefit from regularization methods that penalize various parts of the algorithm and generally improve the performance of the algorithm by reducing overfitting.

There are three types of enhancements to basic gradient boosting that can improve performance:

**Tree Constraints**: such as the depth of the trees and the number of trees used in the ensemble.**Weighted Updates**: such as a learning rate used to limit how much each tree contributes to the ensemble.**Random sampling**: such as fitting trees on random subsets of features and samples.

The use of random sampling often leads to a change in the name of the algorithm to “*stochastic gradient boosting*.”

… at each iteration a subsample of the training data is drawn at random (without replacement) from the full training dataset. The randomly selected subsample is then used, instead of the full sample, to fit the base learner.

— Stochastic Gradient Boosting, 1999.

Gradient boosting is an effective machine learning algorithm and is often the main, or one of the main, algorithms used to win machine learning competitions (like Kaggle) on tabular and similar structured datasets.

For more on the gradient boosting algorithm, see the tutorial:

Now that we are familiar with the gradient boosting algorithm, let’s look at how we can fit GBM models in Python.

Gradient Boosting ensembles can be implemented from scratch although can be challenging for beginners.

The scikit-learn Python machine learning library provides an implementation of Gradient Boosting ensembles for machine learning.

The algorithm is available in a modern version of the library.

First, confirm that you are using a modern version of the library by running the following script:

# check scikit-learn version import sklearn print(sklearn.__version__)

Running the script will print your version of scikit-learn.

Your version should be the same or higher. If not, you must upgrade your version of the scikit-learn library.

0.22.1

Gradient boosting is provided via the GradientBoostingRegressor and GradientBoostingClassifier classes.

Both models operate the same way and take the same arguments that influence how the decision trees are created and added to the ensemble.

Randomness is used in the construction of the model. This means that each time the algorithm is run on the same data, it will produce a slightly different model.

When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance across multiple runs or repeats of cross-validation. When fitting a final model, it may be desirable to either increase the number of trees until the variance of the model is reduced across repeated evaluations, or to fit multiple final models and average their predictions.

Let’s take a look at how to develop a Gradient Boosting ensemble for both classification and regression.

In this section, we will look at using Gradient Boosting for a classification problem.

First, we can use the make_classification() function to create a synthetic binary classification problem with 1,000 examples and 20 input features.

The complete example is listed below.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output components.

(1000, 20) (1000,)

Next, we can evaluate a Gradient Boosting algorithm on this dataset.

We will evaluate the model using repeated stratified k-fold cross-validation, with three repeats and 10 folds. We will report the mean and standard deviation of the accuracy of the model across all repeats and folds.

# evaluate gradient boosting algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import GradientBoostingClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7) # define the model model = GradientBoostingClassifier() # define the evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model on the dataset n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the Gradient Boosting ensemble with default hyperparameters achieves a classification accuracy of about 89.9 percent on this test dataset.

Mean Accuracy: 0.899 (0.030)

We can also use the Gradient Boosting model as a final model and make predictions for classification.

First, the Gradient Boosting ensemble is fit on all available data, then the *predict()* function can be called to make predictions on new data.

The example below demonstrates this on our binary classification dataset.

# make predictions using gradient boosting for classification from sklearn.datasets import make_classification from sklearn.ensemble import GradientBoostingClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7) # define the model model = GradientBoostingClassifier() # fit the model on the whole dataset model.fit(X, y) # make a single prediction row = [0.2929949, -4.21223056, -1.288332, -2.17849815, -0.64527665, 2.58097719, 0.28422388, -7.1827928, -1.91211104, 2.73729512, 0.81395695, 3.96973717, -2.66939799, 3.34692332, 4.19791821, 0.99990998, -0.30201875, -4.43170633, -2.82646737, 0.44916808] yhat = model.predict([row]) # summarize prediction print('Predicted Class: %d' % yhat[0])

Running the example fits the Gradient Boosting ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Predicted Class: 1

Now that we are familiar with using Gradient Boosting for classification, let’s look at the API for regression.

In this section, we will look at using Gradient Boosting for a regression problem.

First, we can use the make_regression() function to create a synthetic regression problem with 1,000 examples and 20 input features.

The complete example is listed below.

# test regression dataset from sklearn.datasets import make_regression # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output components.

(1000, 20) (1000,)

Next, we can evaluate a Gradient Boosting algorithm on this dataset.

As we did with the last section, we will evaluate the model using repeated k-fold cross-validation, with three repeats and 10 folds. We will report the mean absolute error (MAE) of the model across all repeats and folds. The scikit-learn library makes the MAE negative so that it is maximized instead of minimized. This means that larger negative MAE are better and a perfect model has a MAE of 0.

The complete example is listed below.

# evaluate gradient boosting ensemble for regression from numpy import mean from numpy import std from sklearn.datasets import make_regression from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.ensemble import GradientBoostingRegressor # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7) # define the model model = GradientBoostingRegressor() # define the evaluation procedure cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # report performance print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the Gradient Boosting ensemble with default hyperparameters achieves a MAE of about 62.

MAE: -62.475 (3.254)

We can also use the Gradient Boosting model as a final model and make predictions for regression.

First, the Gradient Boosting ensemble is fit on all available data, then the predict() function can be called to make predictions on new data.

The example below demonstrates this on our regression dataset.

# gradient boosting ensemble for making predictions for regression from sklearn.datasets import make_regression from sklearn.ensemble import GradientBoostingRegressor # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7) # define the model model = GradientBoostingRegressor() # fit the model on the whole dataset model.fit(X, y) # make a single prediction row = [0.20543991, -0.97049844, -0.81403429, -0.23842689, -0.60704084, -0.48541492, 0.53113006, 2.01834338, -0.90745243, -1.85859731, -1.02334791, -0.6877744, 0.60984819, -0.70630121, -1.29161497, 1.32385441, 1.42150747, 1.26567231, 2.56569098, -0.11154792] yhat = model.predict([row]) # summarize prediction print('Prediction: %d' % yhat[0])

Running the example fits the Gradient Boosting ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Prediction: 37

Now that we are familiar with using the scikit-learn API to evaluate and use Gradient Boosting ensembles, let’s look at configuring the model.

In this section, we will take a closer look at some of the hyperparameters you should consider tuning for the Gradient Boosting ensemble and their effect on model performance.

There are perhaps four key hyperparameters that have the biggest effect on model performance, they are the number of models in the ensemble, the learning rate, the variance of the model controlled via the size of the data sample used to train each model or features used in tree splits, and finally the depth of the decision tree.

We will take a closer look at the effect each of these hyperparameters in isolation in this section, although they all interact and should be tuned together or pairs, such as learning rate with ensemble size, and sample size/number of features with tree depth.

For more on tuning the hyperparameters of gradient boosting algorithms, see the tutorial:

An important hyperparameter for the Gradient Boosting ensemble algorithm is the number of decision trees used in the ensemble.

Recall that decision trees are added to the model sequentially in an effort to correct and improve upon the predictions made by prior trees. As such, more trees is often better. The number of trees must also be balanced with the learning rate, e.g. more trees may require a smaller learning rate, fewer trees may require a larger learning rate.

The number of trees can be set via the “*n_estimators*” argument and defaults to 100.

The example below explores the effect of the number of trees with values between 10 to 5,000.

# explore gradient boosting number of trees effect on performance from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import GradientBoostingClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7) return X, y # get a list of models to evaluate def get_models(): models = dict() # define number of trees to consider n_trees = [10, 50, 100, 500, 1000, 5000] for n in n_trees: models[str(n)] = GradientBoostingClassifier(n_estimators=n) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the results scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize the performance along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each configured number of decision trees.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that that performance improves on this dataset until about 500 trees, after which performance appears to level off. Unlike AdaBoost, Gradient Boosting appears to not overfit as the number of trees is increased in this case.

>10 0.830 (0.037) >50 0.880 (0.033) >100 0.899 (0.030) >500 0.919 (0.025) >1000 0.919 (0.025) >5000 0.918 (0.026)

A box and whisker plot is created for the distribution of accuracy scores for each configured number of trees.

We can see the general trend of increasing model performance and ensemble size.

The number of samples used to fit each tree can be varied. This means that each tree is fit on a randomly selected subset of the training dataset.

Using fewer samples introduces more variance for each tree, although it can improve the overall performance of the model.

The number of samples used to fit each tree is specified by the “*subsample*” argument and can be set to a fraction of the training dataset size. By default, it is set to 1.0 to use the entire training dataset.

The example below demonstrates the effect of the sample size on model performance.

# explore gradient boosting ensemble number of samples effect on performance from numpy import mean from numpy import std from numpy import arange from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import GradientBoostingClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7) return X, y # get a list of models to evaluate def get_models(): models = dict() # explore sample ratio from 10% to 100% in 10% increments for i in arange(0.1, 1.1, 0.1): key = '%.1f' % i models[key] = GradientBoostingClassifier(subsample=i) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the results scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize the performance along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each configured sample size.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that mean performance is probably best for a sample size that is about half the size of the training dataset, such as 0.4 or higher.

>0.1 0.872 (0.033) >0.2 0.897 (0.032) >0.3 0.904 (0.029) >0.4 0.907 (0.032) >0.5 0.906 (0.027) >0.6 0.908 (0.030) >0.7 0.902 (0.032) >0.8 0.901 (0.031) >0.9 0.904 (0.031) >1.0 0.899 (0.030)

A box and whisker plot is created for the distribution of accuracy scores for each configured number of trees.

We can see the general trend of increasing model performance perhaps peaking around 0.4 and staying somewhat level.

The number of features used to fit each decision tree can be varied.

Like changing the number of samples, changing the number of features introduces additional variance into the model, which may improve performance, although it might require an increase in the number of trees.

The number of features used by each tree is taken as a random sample and is specified by the “*max_features*” argument and defaults to all features in the training dataset.

The example below explores the effect of the number of features on model performance for the test dataset between 1 and 20.

# explore gradient boosting number of features on performance from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import GradientBoostingClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7) return X, y # get a list of models to evaluate def get_models(): models = dict() # explore number of features from 1 to 20 for i in range(1,21): models[str(i)] = GradientBoostingClassifier(max_features=i) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the results scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize the performance along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each configured number of features.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that mean performance increases to about half the number of features and stays somewhat level after that. It’s surprising that removing half of the input variables has so little effect.

>1 0.864 (0.036) >2 0.885 (0.032) >3 0.891 (0.031) >4 0.893 (0.036) >5 0.898 (0.030) >6 0.898 (0.032) >7 0.892 (0.032) >8 0.901 (0.032) >9 0.900 (0.029) >10 0.895 (0.034) >11 0.899 (0.032) >12 0.899 (0.030) >13 0.898 (0.029) >14 0.900 (0.033) >15 0.901 (0.032) >16 0.897 (0.028) >17 0.902 (0.034) >18 0.899 (0.032) >19 0.899 (0.032) >20 0.899 (0.030)

A box and whisker plot is created for the distribution of accuracy scores for each configured number of trees.

We can see the general trend of increasing model performance perhaps peaking around eight or nine features and staying somewhat level.

Learning rate controls the amount of contribution that each model has on the ensemble prediction.

Smaller rates may require more decision trees in the ensemble, whereas larger rates may require an ensemble with fewer trees. It is common to explore learning rate values on a log scale, such as between a very small value like 0.0001 and 1.0.

The learning rate can be controlled via the “*learning_rate*” argument and defaults to 0.1.

The example below explores the learning rate and compares the effect of values between 0.0001 and 1.0.

# explore gradient boosting ensemble learning rate effect on performance from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import GradientBoostingClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7) return X, y # get a list of models to evaluate def get_models(): models = dict() # define learning rates to explore for i in [0.0001, 0.001, 0.01, 0.1, 1.0]: key = '%.4f' % i models[key] = GradientBoostingClassifier(learning_rate=i) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the results scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize the performance along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each configured learning rate.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a larger learning rate results in better performance on this dataset. We would expect that adding more trees to the ensemble for the smaller learning rates would further lift performance.

This highlights the trade-off between the number of trees (speed of training) and learning rate, e.g. we can fit a model faster by using fewer trees and a larger learning rate.

>0.0001 0.761 (0.043) >0.0010 0.781 (0.034) >0.0100 0.836 (0.034) >0.1000 0.899 (0.030) >1.0000 0.908 (0.025)

We can see the general trend of increasing model performance with the increase in learning rate.

Like varying the number of samples and features used to fit each decision tree, varying the depth of each tree is another important hyperparameter for gradient boosting.

The tree depth controls how specialized each tree is to the training dataset: how general or overfit it might be. Trees are preferred that are not too shallow and general (like AdaBoost) and not too deep and specialized (like bootstrap aggregation).

Gradient boosting performs well with trees that have a modest depth finding a balance between skill and generality.

Tree depth is controlled via the “*max_depth*” argument and defaults to 3.

The example below explores tree depths between 1 and 10 and the effect on model performance.

# explore gradient boosting tree depth effect on performance from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import GradientBoostingClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7) return X, y # get a list of models to evaluate def get_models(): models = dict() # define max tree depths to explore between 1 and 10 for i in range(1,11): models[str(i)] = GradientBoostingClassifier(max_depth=i) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the results scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize the performance along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each configured tree depth.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that performance improves with tree depth, perhaps peaking around a depth of 3 to 6, after which the deeper, more specialized trees result in worse performance.

>1 0.834 (0.031) >2 0.877 (0.029) >3 0.899 (0.030) >4 0.905 (0.032) >5 0.916 (0.030) >6 0.912 (0.031) >7 0.908 (0.033) >8 0.888 (0.031) >9 0.853 (0.036) >10 0.835 (0.034)

A box and whisker plot is created for the distribution of accuracy scores for each configured tree depth.

We can see the general trend of increasing model performance with the tree depth to a point, after which performance begins to degrade rapidly with the over-specialized trees.

Gradient boosting can be challenging to configure as the algorithm as many key hyperparameters that influence the behavior of the model on training data and the hyperparameters interact with each other.

As such, it is a good practice to use a search process to discover a configuration of the model hyperparameters that works well or best for a given predictive modeling problem. Popular search processes include a random search and a grid search.

In this section we will look at grid searching common ranges for the key hyperparameters for the gradient boosting algorithm that you can use as starting point for your own projects. This can be achieving using the *GridSearchCV* class and specifying a dictionary that maps model hyperparameter names to the values to search.

In this case, we will grid search four key hyperparameters for gradient boosting: the number of trees used in the ensemble, the learning rate, subsample size used to train each tree, and the maximum depth of each tree. We will use a range of popular well performing values for each hyperparameter.

Each configuration combination will be evaluated using repeated k-fold cross-validation and configurations will be compared using the mean score, in this case, classification accuracy.

The complete example of grid searching the key hyperparameters of the gradient boosting algorithm on our synthetic classification dataset is listed below.

# example of grid searching key hyperparameters for gradient boosting on a classification dataset from sklearn.datasets import make_classification from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import GridSearchCV from sklearn.ensemble import GradientBoostingClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7) # define the model with default hyperparameters model = GradientBoostingClassifier() # define the grid of values to search grid = dict() grid['n_estimators'] = [10, 50, 100, 500] grid['learning_rate'] = [0.0001, 0.001, 0.01, 0.1, 1.0] grid['subsample'] = [0.5, 0.7, 1.0] grid['max_depth'] = [3, 7, 9] # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the grid search procedure grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy') # execute the grid search grid_result = grid_search.fit(X, y) # summarize the best score and configuration print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) # summarize all scores that were evaluated means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param))

Running the example many take a while depending on your hardware. At the end of the run, the configuration that achieved the best score is reported first, followed by the scores for all other configurations that were considered.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a configuration with a learning rate of 0.1, max depth of 7 levels, 500 trees and a subsample of 70% performed the best with a classification accuracy of about 94.6 percent.

The model may perform even better with more trees such as 1,000 or 5,000 although these configurations were not tested in this case to ensure that the grid search completed in a reasonable time.

Best: 0.946667 using {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 500, 'subsample': 0.7} 0.529667 (0.089012) with: {'learning_rate': 0.0001, 'max_depth': 3, 'n_estimators': 10, 'subsample': 0.5} 0.525667 (0.077875) with: {'learning_rate': 0.0001, 'max_depth': 3, 'n_estimators': 10, 'subsample': 0.7} 0.524000 (0.072874) with: {'learning_rate': 0.0001, 'max_depth': 3, 'n_estimators': 10, 'subsample': 1.0} 0.772667 (0.037500) with: {'learning_rate': 0.0001, 'max_depth': 3, 'n_estimators': 50, 'subsample': 0.5} 0.767000 (0.037696) with: {'learning_rate': 0.0001, 'max_depth': 3, 'n_estimators': 50, 'subsample': 0.7} ...

In this section we will take a closer look at some common sticking points you may have with the gradient boosting ensemble procedure.

**Q. What algorithm should be used in the ensemble?**

Technically any high variance algorithm that support instance weighting can be used as the basis for the ensemble.

The most common algorithm to use for speed and model performance is a decision tree with a limited tree depth, such as between 4 and 8 levels.

**Q. How many ensemble members should be used?**

The number of trees in the ensemble should be tuned based on the specific of the dataset and other hyperparametres such as the learning rate.

**Q. Won’t the ensemble overfit with too many trees?**

Yes, gradient boosting models can overfit.

It is important to carefully choose model hyperparameters using a search procedure, such as a grid search.

The learning rate, also called shrinkage, can be set to smaller values in order to slow down the rate of learning with the increase of the number of models used in the ensemble and in turn reduce the effect of overfitting.

**Q. What are the downsides of gradient boosting?**

Gradient boosting can be challenging to configure, often requiring a grid search or similar search procedure.

It can be very slow to train a gradient boosting model as trees must be added sequentially, unlike bagging and stacking based models where ensemble members can be trained in parallel.

**Q. What problems are well suited to boosting?**

Gradient boosting performs well on a wide range of regression and classification predictive modeling problems.

It might be one of the most popular algorithms for structured data (tabular data) given that it performs so well on average.

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning
- How to Configure the Gradient Boosting Algorithm
- Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost

- Arcing the edge, 1998.
- Stochastic Gradient Boosting, 1999.
- Boosting Algorithms as Gradient Descent in Function Space, 1999.

In this tutorial, you discovered how to develop Gradient Boosting ensembles for classification and regression.

Specifically, you learned:

- Gradient Boosting ensemble is an ensemble created from decision trees added sequentially to the model.
- How to use the Gradient Boosting ensemble for classification and regression with scikit-learn.
- How to explore the effect of Gradient Boosting model hyperparameters on model performance.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop a Gradient Boosting Machine Ensemble in Python appeared first on Machine Learning Mastery.

]]>The post How to Develop an AdaBoost Ensemble in Python appeared first on Machine Learning Mastery.

]]>A weak learner is a model that is very simple, although has some skill on the dataset. Boosting was a theoretical concept long before a practical algorithm could be developed, and the AdaBoost (adaptive boosting) algorithm was the first successful approach for the idea.

The AdaBoost algorithm involves using very short (one-level) decision trees as weak learners that are added sequentially to the ensemble. Each subsequent model attempts to correct the predictions made by the model before it in the sequence. This is achieved by weighing the training dataset to put more focus on training examples on which prior models made prediction errors.

In this tutorial, you will discover how to develop AdaBoost ensembles for classification and regression.

After completing this tutorial, you will know:

- AdaBoost ensemble is an ensemble created from decision trees added sequentially to the model
- How to use the AdaBoost ensemble for classification and regression with scikit-learn.
- How to explore the effect of AdaBoost model hyperparameters on model performance.

Let’s get started.

**Updated Aug/2020**: Added example of grid searching model hyperparameters.

This tutorial is divided into four parts; they are:

- AdaBoost Ensemble Algorithm
- AdaBoost Scikit-Learn API
- AdaBoost for Classification
- AdaBoost for Regression

- AdaBoost Hyperparameters
- Explore Number of Trees
- Explore Weak Learner
- Explore Learning Rate
- Explore Alternate Algorithm

- Grid Search AdaBoost Hyperparameters

Boosting refers to a class of machine learning ensemble algorithms where models are added sequentially and later models in the sequence correct the predictions made by earlier models in the sequence.

AdaBoost, short for “*Adaptive Boosting*,” is a boosting ensemble machine learning algorithm, and was one of the first successful boosting approaches.

We call the algorithm AdaBoost because, unlike previous algorithms, it adjusts adaptively to the errors of the weak hypotheses

— A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting, 1996.

AdaBoost combines the predictions from short one-level decision trees, called decision stumps, although other algorithms can also be used. Decision stump algorithms are used as the AdaBoost algorithm seeks to use many weak models and correct their predictions by adding additional weak models.

The training algorithm involves starting with one decision tree, finding those examples in the training dataset that were misclassified, and adding more weight to those examples. Another tree is trained on the same data, although now weighted by the misclassification errors. This process is repeated until a desired number of trees are added.

If a training data point is misclassified, the weight of that training data point is increased (boosted). A second classifier is built using the new weights, which are no longer equal. Again, misclassified training data have their weights boosted and the procedure is repeated.

— Multi-class AdaBoost, 2009.

The algorithm was developed for classification and involves combining the predictions made by all decision trees in the ensemble. A similar approach was also developed for regression problems where predictions are made by using the average of the decision trees. The contribution of each model to the ensemble prediction is weighted based on the performance of the model on the training dataset.

… the new algorithm needs no prior knowledge of the accuracies of the weak hypotheses. Rather, it adapts to these accuracies and generates a weighted majority hypothesis in which the weight of each weak hypothesis is a function of its accuracy.

— A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting, 1996.

Now that we are familiar with the AdaBoost algorithm, let’s look at how we can fit AdaBoost models in Python.

AdaBoost ensembles can be implemented from scratch, although this can be challenging for beginners.

For an example, see the tutorial:

The scikit-learn Python machine learning library provides an implementation of AdaBoost ensembles for machine learning.

It is available in a modern version of the library.

First, confirm that you are using a modern version of the library by running the following script:

# check scikit-learn version import sklearn print(sklearn.__version__)

Running the script will print your version of scikit-learn.

Your version should be the same or higher. If not, you must upgrade your version of the scikit-learn library.

0.22.1

AdaBoost is provided via the AdaBoostRegressor and AdaBoostClassifier classes.

Both models operate the same way and take the same arguments that influence how the decision trees are created.

Randomness is used in the construction of the model. This means that each time the algorithm is run on the same data, it will produce a slightly different model.

When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance across multiple runs or repeats of cross-validation. When fitting a final model it may be desirable to either increase the number of trees until the variance of the model is reduced across repeated evaluations, or to fit multiple final models and average their predictions.

Let’s take a look at how to develop an AdaBoost ensemble for both classification and regression.

In this section, we will look at using AdaBoost for a classification problem.

First, we can use the make_classification() function to create a synthetic binary classification problem with 1,000 examples and 20 input features.

The complete example is listed below.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output components.

(1000, 20) (1000,)

Next, we can evaluate an AdaBoost algorithm on this dataset.

We will evaluate the model using repeated stratified k-fold cross-validation, with three repeats and 10 folds. We will report the mean and standard deviation of the accuracy of the model across all repeats and folds.

# evaluate adaboost algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import AdaBoostClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6) # define the model model = AdaBoostClassifier() # evaluate the model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the AdaBoost ensemble with default hyperparameters achieves a classification accuracy of about 80 percent on this test dataset.

Accuracy: 0.806 (0.041)

We can also use the AdaBoost model as a final model and make predictions for classification.

First, the AdaBoost ensemble is fit on all available data, then the *predict()* function can be called to make predictions on new data.

The example below demonstrates this on our binary classification dataset.

# make predictions using adaboost for classification from sklearn.datasets import make_classification from sklearn.ensemble import AdaBoostClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6) # define the model model = AdaBoostClassifier() # fit the model on the whole dataset model.fit(X, y) # make a single prediction row = [[-3.47224758,1.95378146,0.04875169,-0.91592588,-3.54022468,1.96405547,-7.72564954,-2.64787168,-1.81726906,-1.67104974,2.33762043,-4.30273117,0.4839841,-1.28253034,-10.6704077,-0.7641103,-3.58493721,2.07283886,0.08385173,0.91461126]] yhat = model.predict(row) print('Predicted Class: %d' % yhat[0])

Running the example fits the AdaBoost ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Predicted Class: 0

Now that we are familiar with using AdaBoost for classification, let’s look at the API for regression.

In this section, we will look at using AdaBoost for a regression problem.

First, we can use the make_regression() function to create a synthetic regression problem with 1,000 examples and 20 input features.

The complete example is listed below.

# test regression dataset from sklearn.datasets import make_regression # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=6) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output components.

(1000, 20) (1000,)

Next, we can evaluate an AdaBoost algorithm on this dataset.

As we did with the last section, we will evaluate the model using repeated k-fold cross-validation, with three repeats and 10 folds. We will report the mean absolute error (MAE) of the model across all repeats and folds. The scikit-learn library makes the MAE negative so that it is maximized instead of minimized. This means that larger negative MAE are better and a perfect model has a MAE of 0.

The complete example is listed below.

# evaluate adaboost ensemble for regression from numpy import mean from numpy import std from sklearn.datasets import make_regression from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.ensemble import AdaBoostRegressor # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=6) # define the model model = AdaBoostRegressor() # evaluate the model cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') # report performance print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the AdaBoost ensemble with default hyperparameters achieves a MAE of about 100.

MAE: -72.327 (4.041)

We can also use the AdaBoost model as a final model and make predictions for regression.

First, the AdaBoost ensemble is fit on all available data, then the *predict()* function can be called to make predictions on new data.

The example below demonstrates this on our regression dataset.

# adaboost ensemble for making predictions for regression from sklearn.datasets import make_regression from sklearn.ensemble import AdaBoostRegressor # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=6) # define the model model = AdaBoostRegressor() # fit the model on the whole dataset model.fit(X, y) # make a single prediction row = [[1.20871625,0.88440466,-0.9030013,-0.22687731,-0.82940077,-1.14410988,1.26554256,-0.2842871,1.43929072,0.74250241,0.34035501,0.45363034,0.1778756,-1.75252881,-1.33337384,-1.50337215,-0.45099008,0.46160133,0.58385557,-1.79936198]] yhat = model.predict(row) print('Prediction: %d' % yhat[0])

Running the example fits the AdaBoost ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Prediction: -10

Now that we are familiar with using the scikit-learn API to evaluate and use AdaBoost ensembles, let’s look at configuring the model.

In this section, we will take a closer look at some of the hyperparameters you should consider tuning for the AdaBoost ensemble and their effect on model performance.

An important hyperparameter for AdaBoost algorithm is the number of decision trees used in the ensemble.

Recall that each decision tree used in the ensemble is designed to be a weak learner. That is, it has skill over random prediction, but is not highly skillful. As such, one-level decision trees are used, called decision stumps.

The number of trees added to the model must be high for the model to work well, often hundreds, if not thousands.

The number of trees can be set via the “*n_estimators*” argument and defaults to 50.

The example below explores the effect of the number of trees with values between 10 to 5,000.

# explore adaboost ensemble number of trees effect on performance from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import AdaBoostClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6) return X, y # get a list of models to evaluate def get_models(): models = dict() # define number of trees to consider n_trees = [10, 50, 100, 500, 1000, 5000] for n in n_trees: models[str(n)] = AdaBoostClassifier(n_estimators=n) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the results scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize the performance along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each configured number of decision trees.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that that performance improves on this dataset until about 50 trees and declines after that. This might be a sign of the ensemble overfitting the training dataset after additional trees are added.

>10 0.773 (0.039) >50 0.806 (0.041) >100 0.801 (0.032) >500 0.793 (0.028) >1000 0.791 (0.032) >5000 0.782 (0.031)

We can see the general trend of model performance and ensemble size.

A decision tree with one level is used as the weak learner by default.

We can make the models used in the ensemble less weak (more skillful) by increasing the depth of the decision tree.

The example below explores the effect of increasing the depth of the DecisionTreeClassifier weak learner on the AdBoost ensemble.

# explore adaboost ensemble tree depth effect on performance from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6) return X, y # get a list of models to evaluate def get_models(): models = dict() # explore depths from 1 to 10 for i in range(1,11): # define base model base = DecisionTreeClassifier(max_depth=i) # define ensemble model models[str(i)] = AdaBoostClassifier(base_estimator=base) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the results scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize the performance along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each configured weak learner tree depth.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that as the depth of the decision trees is increased, the performance of the ensemble is also increased on this dataset.

>1 0.806 (0.041) >2 0.864 (0.028) >3 0.867 (0.030) >4 0.889 (0.029) >5 0.909 (0.021) >6 0.923 (0.020) >7 0.927 (0.025) >8 0.928 (0.028) >9 0.923 (0.017) >10 0.926 (0.030)

A box and whisker plot is created for the distribution of accuracy scores for each configured weak learner depth.

We can see the general trend of model performance and weak learner depth.

AdaBoost also supports a learning rate that controls the contribution of each model to the ensemble prediction.

This is controlled by the “*learning_rate*” argument and by default is set to 1.0 or full contribution. Smaller or larger values might be appropriate depending on the number of models used in the ensemble. There is a balance between the contribution of the models and the number of trees in the ensemble.

More trees may require a smaller learning rate; fewer trees may require a larger learning rate. It is common to use values between 0 and 1 and sometimes very small values to avoid overfitting such as 0.1, 0.01 or 0.001.

The example below explores learning rate values between 0.1 and 2.0 in 0.1 increments.

# explore adaboost ensemble learning rate effect on performance from numpy import mean from numpy import std from numpy import arange from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import AdaBoostClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6) return X, y # get a list of models to evaluate def get_models(): models = dict() # explore learning rates from 0.1 to 2 in 0.1 increments for i in arange(0.1, 2.1, 0.1): key = '%.3f' % i models[key] = AdaBoostClassifier(learning_rate=i) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the results scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize the performance along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.xticks(rotation=45) pyplot.show()

Running the example first reports the mean accuracy for each configured learning rate.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see similar values between 0.5 to 1.0 and a decrease in model performance after that.

>0.100 0.767 (0.049) >0.200 0.786 (0.042) >0.300 0.802 (0.040) >0.400 0.798 (0.037) >0.500 0.805 (0.042) >0.600 0.795 (0.031) >0.700 0.799 (0.035) >0.800 0.801 (0.033) >0.900 0.805 (0.032) >1.000 0.806 (0.041) >1.100 0.801 (0.037) >1.200 0.800 (0.030) >1.300 0.799 (0.041) >1.400 0.793 (0.041) >1.500 0.790 (0.040) >1.600 0.775 (0.034) >1.700 0.767 (0.054) >1.800 0.768 (0.040) >1.900 0.736 (0.047) >2.000 0.682 (0.048)

A box and whisker plot is created for the distribution of accuracy scores for each configured learning rate.

We can see the general trend of decreasing model performance with a learning rate larger than 1.0 on this dataset.

The default algorithm used in the ensemble is a decision tree, although other algorithms can be used.

The intent is to use very simple models, called weak learners. Also, the scikit-learn implementation requires that any models used must also support weighted samples, as they are how the ensemble is created by fitting models based on a weighted version of the training dataset.

The base model can be specified via the “*base_estimator*” argument. The base model must also support predicting probabilities or probability-like scores in the case of classification. If the specified model does not support a weighted training dataset, you will see an error message as follows:

ValueError: KNeighborsClassifier doesn't support sample_weight.

One example of a model that supports a weighted training is the logistic regression algorithm.

The example below demonstrates an AdaBoost algorithm with a LogisticRegression weak learner.

# evaluate adaboost algorithm with logistic regression weak learner for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import AdaBoostClassifier from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6) # define the model model = AdaBoostClassifier(base_estimator=LogisticRegression()) # evaluate the model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the AdaBoost ensemble with a logistic regression weak model achieves a classification accuracy of about 79 percent on this test dataset.

Accuracy: 0.794 (0.032)

AdaBoost can be challenging to configure as the algorithm as many key hyperparameters that influence the behavior of the model on training data and the hyperparameters interact with each other.

As such, it is a good practice to use a search process to discover a configuration of the model hyperparameters that works well or best for a given predictive modeling problem. Popular search processes include a random search and a grid search.

In this section we will look at grid searching common ranges for the key hyperparameters for the AdaBoost algorithm that you can use as starting point for your own projects. This can be achieving using the *GridSearchCV* class and specifying a dictionary that maps model hyperparameter names to the values to search.

In this case, we will grid search two key hyperparameters for AdaBoost: the number of trees used in the ensemble and the learning rate. We will use a range of popular well performing values for each hyperparameter.

Each configuration combination will be evaluated using repeated k-fold cross-validation and configurations will be compared using the mean score, in this case, classification accuracy.

The complete example of grid searching the key hyperparameters of the AdaBoost algorithm on our synthetic classification dataset is listed below.

# example of grid searching key hyperparameters for adaboost on a classification dataset from sklearn.datasets import make_classification from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import GridSearchCV from sklearn.ensemble import AdaBoostClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6) # define the model with default hyperparameters model = AdaBoostClassifier() # define the grid of values to search grid = dict() grid['n_estimators'] = [10, 50, 100, 500] grid['learning_rate'] = [0.0001, 0.001, 0.01, 0.1, 1.0] # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the grid search procedure grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy') # execute the grid search grid_result = grid_search.fit(X, y) # summarize the best score and configuration print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) # summarize all scores that were evaluated means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param))

Running the example many take a while depending on your hardware. At the end of the run, the configuration that achieved the best score is reported first, followed by the scores for all other configurations that were considered.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a configuration with 500 trees and a learning rate of 0.1 performed the best with a classification accuracy of about 81.3 percent.

The model may perform even better with more trees such as 1,000 or 5,000 although these configurations were not tested in this case to ensure that the grid search completed in a reasonable time.

Best: 0.813667 using {'learning_rate': 0.1, 'n_estimators': 500} 0.646333 (0.036376) with: {'learning_rate': 0.0001, 'n_estimators': 10} 0.646667 (0.036545) with: {'learning_rate': 0.0001, 'n_estimators': 50} 0.646667 (0.036545) with: {'learning_rate': 0.0001, 'n_estimators': 100} 0.647000 (0.038136) with: {'learning_rate': 0.0001, 'n_estimators': 500} 0.646667 (0.036545) with: {'learning_rate': 0.001, 'n_estimators': 10} 0.647000 (0.038136) with: {'learning_rate': 0.001, 'n_estimators': 50} 0.654333 (0.045511) with: {'learning_rate': 0.001, 'n_estimators': 100} 0.672667 (0.046543) with: {'learning_rate': 0.001, 'n_estimators': 500} 0.648333 (0.042197) with: {'learning_rate': 0.01, 'n_estimators': 10} 0.671667 (0.045613) with: {'learning_rate': 0.01, 'n_estimators': 50} 0.715000 (0.053213) with: {'learning_rate': 0.01, 'n_estimators': 100} 0.767667 (0.045948) with: {'learning_rate': 0.01, 'n_estimators': 500} 0.716667 (0.048876) with: {'learning_rate': 0.1, 'n_estimators': 10} 0.767000 (0.049271) with: {'learning_rate': 0.1, 'n_estimators': 50} 0.784667 (0.042874) with: {'learning_rate': 0.1, 'n_estimators': 100} 0.813667 (0.032092) with: {'learning_rate': 0.1, 'n_estimators': 500} 0.773333 (0.038759) with: {'learning_rate': 1.0, 'n_estimators': 10} 0.806333 (0.040701) with: {'learning_rate': 1.0, 'n_estimators': 50} 0.801000 (0.032491) with: {'learning_rate': 1.0, 'n_estimators': 100} 0.792667 (0.027560) with: {'learning_rate': 1.0, 'n_estimators': 500}

This section provides more resources on the topic if you are looking to go deeper.

- A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting, 1996.
- Multi-class AdaBoost, 2009.
- Improving Regressors using Boosting Techniques, 1997.

In this tutorial, you discovered how to develop AdaBoost ensembles for classification and regression.

Specifically, you learned:

- AdaBoost ensemble is an ensemble created from decision trees added sequentially to the model.
- How to use the AdaBoost ensemble for classification and regression with scikit-learn.
- How to explore the effect of AdaBoost model hyperparameters on model performance.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop an AdaBoost Ensemble in Python appeared first on Machine Learning Mastery.

]]>The post How to Develop a Bagging Ensemble with Python appeared first on Machine Learning Mastery.

]]>It is also easy to implement given that it has few key hyperparameters and sensible heuristics for configuring these hyperparameters.

Bagging performs well in general and provides the basis for a whole field of ensemble of decision tree algorithms such as the popular random forest and extra trees ensemble algorithms, as well as the lesser-known Pasting, Random Subspaces, and Random Patches ensemble algorithms.

In this tutorial, you will discover how to develop Bagging ensembles for classification and regression.

After completing this tutorial, you will know:

- Bagging ensemble is an ensemble created from decision trees fit on different samples of a dataset.
- How to use the Bagging ensemble for classification and regression with scikit-learn.
- How to explore the effect of Bagging model hyperparameters on model performance.

Let’s get started.

**Update Aug/2020**: Added a common questions section.

This tutorial is divided into five parts; they are:

- Bagging Ensemble Algorithm
- Bagging Scikit-Learn API
- Bagging for Classification
- Bagging for Regression

- Bagging Hyperparameters
- Explore Number of Trees
- Explore Number of Samples
- Explore Alternate Algorithm

- Bagging Extensions
- Pasting Ensemble
- Random Subspaces Ensemble
- Random Patches Ensemble

- Common Questions

Bootstrap Aggregation, or Bagging for short, is an ensemble machine learning algorithm.

Specifically, it is an ensemble of decision tree models, although the bagging technique can also be used to combine the predictions of other types of models.

As its name suggests, bootstrap aggregation is based on the idea of the “*bootstrap*” sample.

A bootstrap sample is a sample of a dataset with replacement. Replacement means that a sample drawn from the dataset is replaced, allowing it to be selected again and perhaps multiple times in the new sample. This means that the sample may have duplicate examples from the original dataset.

The bootstrap sampling technique is used to estimate a population statistic from a small data sample. This is achieved by drawing multiple bootstrap samples, calculating the statistic on each, and reporting the mean statistic across all samples.

An example of using bootstrap sampling would be estimating the population mean from a small dataset. Multiple bootstrap samples are drawn from the dataset, the mean calculated on each, then the mean of the estimated means is reported as an estimate of the population.

Surprisingly, the bootstrap method provides a robust and accurate approach to estimating statistical quantities compared to a single estimate on the original dataset.

This same approach can be used to create an ensemble of decision tree models.

This is achieved by drawing multiple bootstrap samples from the training dataset and fitting a decision tree on each. The predictions from the decision trees are then combined to provide a more robust and accurate prediction than a single decision tree (typically, but not always).

Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. […] The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets

— Bagging predictors, 1996.

Predictions are made for regression problems by averaging the prediction across the decision trees. Predictions are made for classification problems by taking the majority vote prediction for the classes from across the predictions made by the decision trees.

The bagged decision trees are effective because each decision tree is fit on a slightly different training dataset, which in turn allows each tree to have minor differences and make slightly different skillful predictions.

Technically, we say that the method is effective because the trees have a low correlation between predictions and, in turn, prediction errors.

Decision trees, specifically unpruned decision trees, are used as they slightly overfit the training data and have a high variance. Other high-variance machine learning algorithms can be used, such as a k-nearest neighbors algorithm with a low *k* value, although decision trees have proven to be the most effective.

If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.

— Bagging predictors, 1996.

Bagging does not always offer an improvement. For low-variance models that already perform well, bagging can result in a decrease in model performance.

The evidence, both experimental and theoretical, is that bagging can push a good but unstable procedure a significant step towards optimality. On the other hand, it can slightly degrade the performance of stable procedures.

— Bagging predictors, 1996.

Bagging ensembles can be implemented from scratch, although this can be challenging for beginners.

For an example, see the tutorial:

The scikit-learn Python machine learning library provides an implementation of Bagging ensembles for machine learning.

It is available in modern versions of the library.

First, confirm that you are using a modern version of the library by running the following script:

# check scikit-learn version import sklearn print(sklearn.__version__)

Running the script will print your version of scikit-learn.

Your version should be the same or higher. If not, you must upgrade your version of the scikit-learn library.

0.22.1

Bagging is provided via the BaggingRegressor and BaggingClassifier classes.

Both models operate the same way and take the same arguments that influence how the decision trees are created.

Randomness is used in the construction of the model. This means that each time the algorithm is run on the same data, it will produce a slightly different model.

When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance across multiple runs or repeats of cross-validation. When fitting a final model, it may be desirable to either increase the number of trees until the variance of the model is reduced across repeated evaluations, or to fit multiple final models and average their predictions.

Let’s take a look at how to develop a Bagging ensemble for both classification and regression.

In this section, we will look at using Bagging for a classification problem.

First, we can use the make_classification() function to create a synthetic binary classification problem with 1,000 examples and 20 input features.

The complete example is listed below.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=5) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output components.

(1000, 20) (1000,)

Next, we can evaluate a Bagging algorithm on this dataset.

We will evaluate the model using repeated stratified k-fold cross-validation, with three repeats and 10 folds. We will report the mean and standard deviation of the accuracy of the model across all repeats and folds.

# evaluate bagging algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import BaggingClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=5) # define the model model = BaggingClassifier() # evaluate the model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the Bagging ensemble with default hyperparameters achieves a classification accuracy of about 85 percent on this test dataset.

Accuracy: 0.856 (0.037)

We can also use the Bagging model as a final model and make predictions for classification.

First, the Bagging ensemble is fit on all available data, then the *predict()* function can be called to make predictions on new data.

The example below demonstrates this on our binary classification dataset.

# make predictions using bagging for classification from sklearn.datasets import make_classification from sklearn.ensemble import BaggingClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=5) # define the model model = BaggingClassifier() # fit the model on the whole dataset model.fit(X, y) # make a single prediction row = [[-4.7705504,-1.88685058,-0.96057964,2.53850317,-6.5843005,3.45711663,-7.46225013,2.01338213,-0.45086384,-1.89314931,-2.90675203,-0.21214568,-0.9623956,3.93862591,0.06276375,0.33964269,4.0835676,1.31423977,-2.17983117,3.1047287]] yhat = model.predict(row) print('Predicted Class: %d' % yhat[0])

Running the example fits the Bagging ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Predicted Class: 1

Now that we are familiar with using Bagging for classification, let’s look at the API for regression.

In this section, we will look at using Bagging for a regression problem.

First, we can use the make_regression() function to create a synthetic regression problem with 1,000 examples and 20 input features.

The complete example is listed below.

# test regression dataset from sklearn.datasets import make_regression # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=5) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output components.

(1000, 20) (1000,)

Next, we can evaluate a Bagging algorithm on this dataset.

As we did with the last section, we will evaluate the model using repeated k-fold cross-validation, with three repeats and 10 folds. We will report the mean absolute error (MAE) of the model across all repeats and folds. The scikit-learn library makes the MAE negative so that it is maximized instead of minimized. This means that larger negative MAE are better and a perfect model has a MAE of 0.

The complete example is listed below.

# evaluate bagging ensemble for regression from numpy import mean from numpy import std from sklearn.datasets import make_regression from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.ensemble import BaggingRegressor # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=5) # define the model model = BaggingRegressor() # evaluate the model cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') # report performance print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the Bagging ensemble with default hyperparameters achieves a MAE of about 100.

MAE: -101.133 (9.757)

We can also use the Bagging model as a final model and make predictions for regression.

First, the Bagging ensemble is fit on all available data, then the *predict()* function can be called to make predictions on new data.

The example below demonstrates this on our regression dataset.

# bagging ensemble for making predictions for regression from sklearn.datasets import make_regression from sklearn.ensemble import BaggingRegressor # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=5) # define the model model = BaggingRegressor() # fit the model on the whole dataset model.fit(X, y) # make a single prediction row = [[0.88950817,-0.93540416,0.08392824,0.26438806,-0.52828711,-1.21102238,-0.4499934,1.47392391,-0.19737726,-0.22252503,0.02307668,0.26953276,0.03572757,-0.51606983,-0.39937452,1.8121736,-0.00775917,-0.02514283,-0.76089365,1.58692212]] yhat = model.predict(row) print('Prediction: %d' % yhat[0])

Running the example fits the Bagging ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Prediction: -134

Now that we are familiar with using the scikit-learn API to evaluate and use Bagging ensembles, let’s look at configuring the model.

In this section, we will take a closer look at some of the hyperparameters you should consider tuning for the Bagging ensemble and their effect on model performance.

An important hyperparameter for the Bagging algorithm is the number of decision trees used in the ensemble.

Typically, the number of trees is increased until the model performance stabilizes. Intuition might suggest that more trees will lead to overfitting, although this is not the case. Bagging and related ensemble of decision trees algorithms (like random forest) appear to be somewhat immune to overfitting the training dataset given the stochastic nature of the learning algorithm.

The number of trees can be set via the “*n_estimators*” argument and defaults to 100.

The example below explores the effect of the number of trees with values between 10 to 5,000.

# explore bagging ensemble number of trees effect on performance from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import BaggingClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=5) return X, y # get a list of models to evaluate def get_models(): models = dict() # define number of trees to consider n_trees = [10, 50, 100, 500, 500, 1000, 5000] for n in n_trees: models[str(n)] = BaggingClassifier(n_estimators=n) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the results scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize the performance along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each configured number of decision trees.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that that performance improves on this dataset until about 100 trees and remains flat after that.

>10 0.855 (0.037) >50 0.876 (0.035) >100 0.882 (0.037) >500 0.885 (0.041) >1000 0.885 (0.037) >5000 0.885 (0.038)

We can see the general trend of no further improvement beyond about 100 trees.

The size of the bootstrap sample can also be varied.

The default is to create a bootstrap sample that has the same number of examples as the original dataset. Using a smaller dataset can increase the variance of the resulting decision trees and could result in better overall performance.

The number of samples used to fit each decision tree is set via the “*max_samples*” argument.

The example below explores different sized samples as a ratio of the original dataset from 10 percent to 100 percent (the default).

# explore bagging ensemble number of samples effect on performance from numpy import mean from numpy import std from numpy import arange from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import BaggingClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=5) return X, y # get a list of models to evaluate def get_models(): models = dict() # explore ratios from 10% to 100% in 10% increments for i in arange(0.1, 1.1, 0.1): key = '%.1f' % i models[key] = BaggingClassifier(max_samples=i) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the results scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize the performance along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each sample set size.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the results suggest that performance generally improves with an increase in the sample size, highlighting that the default of 100 percent the size of the training dataset is sensible.

It might also be interesting to explore a smaller sample size with a corresponding increase in the number of trees in an effort to reduce the variance of the individual models.

>0.1 0.810 (0.036) >0.2 0.836 (0.044) >0.3 0.844 (0.043) >0.4 0.843 (0.041) >0.5 0.852 (0.034) >0.6 0.855 (0.042) >0.7 0.858 (0.042) >0.8 0.861 (0.033) >0.9 0.866 (0.041) >1.0 0.864 (0.042)

A box and whisker plot is created for the distribution of accuracy scores for each sample size.

We see a general trend of increasing accuracy with sample size.

Decision trees are the most common algorithm used in a bagging ensemble.

The reason for this is that they are easy to configure to have a high variance and because they perform well in general.

Other algorithms can be used with bagging and must be configured to have a modestly high variance. One example is the k-nearest neighbors algorithm where the *k* value can be set to a low value.

The algorithm used in the ensemble is specified via the “*base_estimator*” argument and must be set to an instance of the algorithm and algorithm configuration to use.

The example below demonstrates using a KNeighborsClassifier as the base algorithm used in the bagging ensemble. Here, the algorithm is used with default hyperparameters where *k* is set to 5.

# evaluate bagging with knn algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import BaggingClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=5) # define the model model = BaggingClassifier(base_estimator=KNeighborsClassifier()) # evaluate the model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the Bagging ensemble with KNN and default hyperparameters achieves a classification accuracy of about 88 percent on this test dataset.

Accuracy: 0.888 (0.036)

We can test different values of k to find the right balance of model variance to achieve good performance as a bagged ensemble.

The below example tests bagged KNN models with *k* values between 1 and 20.

# explore bagging ensemble k for knn effect on performance from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import BaggingClassifier from sklearn.neighbors import KNeighborsClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=5) return X, y # get a list of models to evaluate def get_models(): models = dict() # evaluate k values from 1 to 20 for i in range(1,21): # define the base model base = KNeighborsClassifier(n_neighbors=i) # define the ensemble model models[str(i)] = BaggingClassifier(base_estimator=base) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the results scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize the performance along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each k value.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the results suggest a small k value such as two to four results in the best mean accuracy when used in a bagging ensemble.

>1 0.884 (0.025) >2 0.890 (0.029) >3 0.886 (0.035) >4 0.887 (0.033) >5 0.878 (0.037) >6 0.879 (0.042) >7 0.877 (0.037) >8 0.877 (0.036) >9 0.871 (0.034) >10 0.877 (0.033) >11 0.876 (0.037) >12 0.877 (0.030) >13 0.874 (0.034) >14 0.871 (0.039) >15 0.875 (0.034) >16 0.877 (0.033) >17 0.872 (0.034) >18 0.873 (0.036) >19 0.876 (0.034) >20 0.876 (0.037)

A box and whisker plot is created for the distribution of accuracy scores for each *k* value.

We see a general trend of increasing accuracy with sample size in the beginning, then a modest decrease in performance as the variance of the individual KNN models used in the ensemble is increased with larger *k* values.

There are many modifications and extensions to the bagging algorithm in an effort to improve the performance of the approach.

Perhaps the most famous is the random forest algorithm.

There is a number of less famous, although still effective, extensions to bagging that may be interesting to investigate.

This section demonstrates some of these approaches, such as pasting ensemble, random subspace ensemble, and the random patches ensemble.

We are not racing these extensions on the dataset, but rather providing working examples of how to use each technique that you can copy-paste and try with your own dataset.

The Pasting Ensemble is an extension to bagging that involves fitting ensemble members based on random samples of the training dataset instead of bootstrap samples.

The approach is designed to use smaller sample sizes than the training dataset in cases where the training dataset does not fit into memory.

The procedure takes small pieces of the data, grows a predictor on each small piece and then pastes these predictors together. A version is given that scales up to terabyte data sets. The methods are also applicable to on-line learning.

— Pasting Small Votes for Classification in Large Databases and On-Line, 1999.

The example below demonstrates the Pasting ensemble by setting the “*bootstrap*” argument to “*False*” and setting the number of samples used in the training dataset via “*max_samples*” to a modest value, in this case, 50 percent of the training dataset size.

# evaluate pasting ensemble algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import BaggingClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=5) # define the model model = BaggingClassifier(bootstrap=False, max_samples=0.5) # evaluate the model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the Pasting ensemble achieves a classification accuracy of about 84 percent on this dataset.

Accuracy: 0.848 (0.039)

A Random Subspace Ensemble is an extension to bagging that involves fitting ensemble members based on datasets constructed from random subsets of the features in the training dataset.

It is similar to the random forest except the data samples are random rather than a bootstrap sample and the subset of features is selected for the entire decision tree rather than at each split point in the tree.

The classifier consists of multiple trees constructed systematically by pseudorandomly selecting subsets of components of the feature vector, that is, trees constructed in randomly chosen subspaces.

— The Random Subspace Method For Constructing Decision Forests, 1998.

The example below demonstrates the Random Subspace ensemble by setting the “*bootstrap*” argument to “*False*” and setting the number of features used in the training dataset via “*max_features*” to a modest value, in this case, 10.

# evaluate random subspace ensemble algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import BaggingClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=5) # define the model model = BaggingClassifier(bootstrap=False, max_features=10) # evaluate the model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the Random Subspace ensemble achieves a classification accuracy of about 86 percent on this dataset.

Accuracy: 0.862 (0.040)

We would expect that there would be a number of features in the random subspace that provides the right balance of model variance and model skill.

The example below demonstrates the effect of using different numbers of features in the random subspace ensemble from 1 to 20.

# explore random subspace ensemble ensemble number of features effect on performance from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import BaggingClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=5) return X, y # get a list of models to evaluate def get_models(): models = dict() for i in range(1, 21): models[str(i)] = BaggingClassifier(bootstrap=False, max_features=i) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model, X, y) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each number of features.

In this case, the results suggest that using about half the number of features in the dataset (e.g. between 9 and 13) might give the best results for the random subspace ensemble on this dataset.

>1 0.583 (0.047) >2 0.659 (0.048) >3 0.731 (0.038) >4 0.775 (0.045) >5 0.815 (0.044) >6 0.820 (0.040) >7 0.838 (0.034) >8 0.841 (0.035) >9 0.854 (0.036) >10 0.854 (0.041) >11 0.857 (0.034) >12 0.863 (0.035) >13 0.860 (0.043) >14 0.856 (0.038) >15 0.848 (0.043) >16 0.847 (0.042) >17 0.839 (0.046) >18 0.831 (0.044) >19 0.811 (0.043) >20 0.802 (0.048)

A box and whisker plot is created for the distribution of accuracy scores for each random subspace size.

We see a general trend of increasing accuracy with the number of features to about 10 to 13 where it is approximately level, then a modest decreasing trend in performance after that.

The Random Patches Ensemble is an extension to bagging that involves fitting ensemble members based on datasets constructed from random subsets of rows (samples) and columns (features) of the training dataset.

It does not use bootstrap samples and might be considered an ensemble that combines both the random sampling of the dataset of the Pasting ensemble and the random sampling of features of the Random Subspace ensemble.

We investigate a very simple, yet effective, ensemble framework that builds each individual model of the ensemble from a random patch of data obtained by drawing random subsets of both instances and features from the whole dataset.

— Ensembles on Random Patches, 2012.

The example below demonstrates the Random Patches ensemble with decision trees created from a random sample of the training dataset limited to 50 percent of the size of the training dataset, and with a random subset of 10 features.

# evaluate random patches ensemble algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import BaggingClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=5) # define the model model = BaggingClassifier(bootstrap=False, max_features=10, max_samples=0.5) # evaluate the model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the Random Patches ensemble achieves a classification accuracy of about 84 percent on this dataset.

Accuracy: 0.845 (0.036)

In this section we will take a closer look at some common sticking points you may have with the bagging ensemble procedure.

**Q. What algorithm should be used in the ensemble?**

The algorithm should have a moderate variance, meaning it is moderately dependent upon the specific training data.

The decision tree is the default model to use because it works well in practice. Other algorithms can be used as long as they are configured to have a moderate variance.

The chosen algorithm should be moderately stable, not unstable like a decision stump and not very stable like a pruned decision tree, typically an unpruned decision tree is used.

… it is well known that Bagging should be used with unstable learners, and generally, the more unstable, the larger the performance improvement.

— Page 52, Ensemble Methods, 2012.

**Q. How many ensemble members should be used?**

The performance of the model will converge with the increase of the number of decision trees to a point then remain level.

… the performance of Bagging converges as the ensemble size, i.e., the number of base learners, grows large …

— Page 52, Ensemble Methods, 2012.

Therefore, keep increasing the number of trees until the performance stabilizes on your dataset.

**Q. Won’t the ensemble overfit with too many trees?**

No. Bagging ensembles (do not) are very unlikely to overfit in general.

**Q. How large should the bootstrap sample be?**

It is good practice to make the bootstrap sample as large as the original dataset size.

That is 100% the size or an equal number of rows as the original dataset.

**Q. What problems are well suited to bagging?**

Generally, bagging is well suited to problems with small or modest sized datasets. But this is a rough guide.

Bagging is best suited for problems with relatively small available training datasets.

— Page 12, Ensemble Machine Learning, 2012.

Try it and see.

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to the Bootstrap Method
- How to Implement Bagging From Scratch With Python
- How to Create a Bagging Ensemble of Deep Learning Models in Keras
- Bagging and Random Forest Ensemble Algorithms for Machine Learning

- Bagging predictors, 1996.
- Pasting Small Votes for Classification in Large Databases and On-Line, 1999.
- The Random Subspace Method For Constructing Decision Forests, 1998.
- Ensembles on Random Patches, 2012.

In this tutorial, you discovered how to develop Bagging ensembles for classification and regression.

Specifically, you learned:

- Bagging ensemble is an ensemble created from decision trees fit on different samples of a dataset.
- How to use the Bagging ensemble for classification and regression with scikit-learn.
- How to explore the effect of Bagging model hyperparameters on model performance.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop a Bagging Ensemble with Python appeared first on Machine Learning Mastery.

]]>The post How to Develop an Extra Trees Ensemble with Python appeared first on Machine Learning Mastery.

]]>It is related to the widely used random forest algorithm. It can often achieve as-good or better performance than the random forest algorithm, although it uses a simpler algorithm to construct the decision trees used as members of the ensemble.

It is also easy to use given that it has few key hyperparameters and sensible heuristics for configuring these hyperparameters.

In this tutorial, you will discover how to develop Extra Trees ensembles for classification and regression.

After completing this tutorial, you will know:

- Extra Trees ensemble is an ensemble of decision trees and is related to bagging and random forest.
- How to use the Extra Trees ensemble for classification and regression with scikit-learn.
- How to explore the effect of Extra Trees model hyperparameters on model performance.

Let’s get started.

This tutorial is divided into three parts; they are:

- Extra Trees Algorithm
- Extra Trees Scikit-Learn API
- Extra Trees for Classification
- Extra Trees for Regression

- Extra Trees Hyperparameters
- Explore Number of Trees
- Explore Number of Features
- Explore Minimum Samples per Split

**Extremely Randomized Trees**, or Extra Trees for short, is an ensemble machine learning algorithm.

Specifically, it is an ensemble of decision trees and is related to other ensembles of decision trees algorithms such as bootstrap aggregation (bagging) and random forest.

The Extra Trees algorithm works by creating a large number of unpruned decision trees from the training dataset. Predictions are made by averaging the prediction of the decision trees in the case of regression or using majority voting in the case of classification.

**Regression**: Predictions made by averaging predictions from decision trees.**Classification**: Predictions made by majority voting from decision trees.

The predictions of the trees are aggregated to yield the final prediction, by majority vote in classification problems and arithmetic average in regression problems.

— Extremely Randomized Trees, 2006.

Unlike bagging and random forest that develop each decision tree from a bootstrap sample of the training dataset, the Extra Trees algorithm fits each decision tree on the whole training dataset.

Like random forest, the Extra Trees algorithm will randomly sample the features at each split point of a decision tree. Unlike random forest, which uses a greedy algorithm to select an optimal split point, the Extra Trees algorithm selects a split point at random.

The Extra-Trees algorithm builds an ensemble of unpruned decision or regression trees according to the classical top-down procedure. Its two main differences with other tree-based ensemble methods are that it splits nodes by choosing cut-points fully at random and that it uses the whole learning sample (rather than a bootstrap replica) to grow the trees.

— Extremely Randomized Trees, 2006.

As such, there are three main hyperparameters to tune in the algorithm; they are the number of decision trees in the ensemble, the number of input features to randomly select and consider for each split point, and the minimum number of samples required in a node to create a new split point.

It has two parameters: K, the number of attributes randomly selected at each node and nmin, the minimum sample size for splitting a node. […] we denote by M the number of trees of this ensemble.

— Extremely Randomized Trees, 2006.

The random selection of split points makes the decision trees in the ensemble less correlated, although this increases the variance of the algorithm. This increase in variance can be countered by increasing the number of trees used in the ensemble.

The parameters K, nmin and M have different effects: K determines the strength of the attribute selection process, nmin the strength of averaging output noise, and M the strength of the variance reduction of the ensemble model aggregation.

— Extremely Randomized Trees, 2006.

Extra Trees ensembles can be implemented from scratch, although this can be challenging for beginners.

The scikit-learn Python machine learning library provides an implementation of Extra Trees for machine learning.

It is available in a recent version of the library.

First, confirm that you are using a modern version of the library by running the following script:

# check scikit-learn version import sklearn print(sklearn.__version__)

Running the script will print your version of scikit-learn.

Your version should be the same or higher.

If not, you must upgrade your version of the scikit-learn library.

0.22.1

Extra Trees is provided via the ExtraTreesRegressor and ExtraTreesClassifier classes.

Both models operate the same way and take the same arguments that influence how the decision trees are created.

When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance across multiple runs or repeats of cross-validation. When fitting a final model, it may be desirable to either increase the number of trees until the variance of the model is reduced across repeated evaluations, or to fit multiple final models and average their predictions.

Let’s take a look at how to develop an Extra Trees ensemble for both classification and regression.

In this section, we will look at using Extra Trees for a classification problem.

The complete example is listed below.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=4) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output components.

(1000, 20) (1000,)

Next, we can evaluate an Extra Trees algorithm on this dataset.

# evaluate extra trees algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import ExtraTreesClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=4) # define the model model = ExtraTreesClassifier() # evaluate the model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the Extra Trees ensemble with default hyperparameters achieves a classification accuracy of about 91 percent on this test dataset.

Accuracy: 0.910 (0.027)

We can also use the Extra Trees model as a final model and make predictions for classification.

First, the Extra Trees ensemble is fit on all available data, then the *predict()* function can be called to make predictions on new data.

The example below demonstrates this on our binary classification dataset.

# make predictions using extra trees for classification from sklearn.datasets import make_classification from sklearn.ensemble import ExtraTreesClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=4) # define the model model = ExtraTreesClassifier() # fit the model on the whole dataset model.fit(X, y) # make a single prediction row = [[-3.52169364,4.00560592,2.94756812,-0.09755101,-0.98835896,1.81021933,-0.32657994,1.08451928,4.98150546,-2.53855736,3.43500614,1.64660497,-4.1557091,-1.55301045,-0.30690987,-1.47665577,6.818756,0.5132918,4.3598337,-4.31785495]] yhat = model.predict(row) print('Predicted Class: %d' % yhat[0])

Running the example fits the Extra Trees ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Predicted Class: 0

Now that we are familiar with using Extra Trees for classification, let’s look at the API for regression.

In this section, we will look at using Extra Trees for a regression problem.

The complete example is listed below.

# test regression dataset from sklearn.datasets import make_regression # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=3) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output components.

(1000, 20) (1000,)

Next, we can evaluate an Extra Trees algorithm on this dataset.

As we did with the last section, we will evaluate the model using repeated k-fold cross-validation, with three repeats and 10 folds. We will report the mean absolute error (MAE) of the model across all repeats and folds.

The scikit-learn library makes the MAE negative so that it is maximized instead of minimized. This means that larger negative MAE are better and a perfect model has a MAE of 0.

The complete example is listed below.

# evaluate extra trees ensemble for regression from numpy import mean from numpy import std from sklearn.datasets import make_regression from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.ensemble import ExtraTreesRegressor # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=3) # define the model model = ExtraTreesRegressor() # evaluate the model cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') # report performance print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the Extra Trees ensemble with default hyperparameters achieves a MAE of about 70.

MAE: -69.561 (5.616)

We can also use the Extra Trees model as a final model and make predictions for regression.

First, the Extra Trees ensemble is fit on all available data, then the *predict()* function can be called to make predictions on new data.

The example below demonstrates this on our regression dataset.

# extra trees for making predictions for regression from sklearn.datasets import make_regression from sklearn.ensemble import ExtraTreesRegressor # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=3) # define the model model = ExtraTreesRegressor() # fit the model on the whole dataset model.fit(X, y) # make a single prediction row = [[-0.56996683,0.80144889,2.77523539,1.32554027,-1.44494378,-0.80834175,-0.84142896,0.57710245,0.96235932,-0.66303907,-1.13994112,0.49887995,1.40752035,-0.2995842,-0.05708706,-2.08701456,1.17768469,0.13474234,0.09518152,-0.07603207]] yhat = model.predict(row) print('Prediction: %d' % yhat[0])

Running the example fits the Extra Trees ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Prediction: 53

Now that we are familiar with using the scikit-learn API to evaluate and use Extra Trees ensembles, let’s look at configuring the model.

In this section, we will take a closer look at some of the hyperparameters you should consider tuning for the Extra Trees ensemble and their effect on model performance.

An important hyperparameter for Extra Trees algorithm is the number of decision trees used in the ensemble.

Typically, the number of trees is increased until the model performance stabilizes. Intuition might suggest that more trees will lead to overfitting, although this is not the case. Bagging, Random Forest, and Extra Trees algorithms appear to be somewhat immune to overfitting the training dataset given the stochastic nature of the learning algorithm.

The number of trees can be set via the “*n_estimators*” argument and defaults to 100.

The example below explores the effect of the number of trees with values between 10 to 5,000.

# explore extra trees number of trees effect on performance from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import ExtraTreesClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=4) return X, y # get a list of models to evaluate def get_models(): models = dict() # define number of trees to consider n_trees = [10, 50, 100, 500, 1000, 5000] for n in n_trees: models[str(n)] = ExtraTreesClassifier(n_estimators=n) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the results scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize the performance along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each configured number of decision trees.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that performance rises and stays flat after about 100 trees. Mean accuracy scores fluctuate across 100, 500, and 1,000 trees and this may be statistical noise.

>10 0.860 (0.029) >50 0.904 (0.027) >100 0.908 (0.026) >500 0.910 (0.027) >1000 0.910 (0.026) >5000 0.912 (0.026)

We can see the general trend of increasing performance with the number of trees, perhaps leveling out after 100 trees.

The number of features that is randomly sampled for each split point is perhaps the most important feature to configure for Extra Trees, as it is for Random Forest.

Like Random Forest, the Extra Trees algorithm is not sensitive to the specific value used, although it is an important hyperparameter to tune.

It is set via the *max_features* argument and defaults to the square root of the number of input features. In this case for our test dataset, this would be sqrt(20) or about four features.

The example below explores the effect of the number of features randomly selected at each split point on model accuracy. We will try values from 1 to 20 and would expect a small value around four to perform well based on the heuristic.

# explore extra trees number of features effect on performance from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import ExtraTreesClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=4) return X, y # get a list of models to evaluate def get_models(): models = dict() # explore number of features from 1 to 20 for i in range(1, 21): models[str(i)] = ExtraTreesClassifier(max_features=i) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the results scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize the performance along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each feature set size.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the results suggest that a value between four and nine would be appropriate, confirming the sensible default of four on this dataset.

A value of nine might even be better given the larger mean and smaller standard deviation in classification accuracy, although the differences in scores may or may not be statistically significant.

>1 0.901 (0.028) >2 0.909 (0.028) >3 0.901 (0.026) >4 0.909 (0.030) >5 0.909 (0.028) >6 0.910 (0.025) >7 0.908 (0.030) >8 0.907 (0.025) >9 0.912 (0.024) >10 0.904 (0.029) >11 0.904 (0.025) >12 0.908 (0.026) >13 0.908 (0.026) >14 0.906 (0.030) >15 0.909 (0.024) >16 0.908 (0.023) >17 0.910 (0.021) >18 0.909 (0.023) >19 0.907 (0.025) >20 0.903 (0.025)

A box and whisker plot is created for the distribution of accuracy scores for each feature set size.

We see a trend in performance rising and peaking with values between four and nine and falling or staying flat as larger feature set sizes are considered.

A final interesting hyperparameter is the number of samples in a node of the decision tree before adding a split.

New splits are only added to a decision tree if the number of samples is equal to or exceeds this value. It is set via the “*min_samples_split*” argument and defaults to two samples (the lowest value). Smaller numbers of samples result in more splits and a deeper, more specialized tree. In turn, this can mean lower correlation between the predictions made by trees in the ensemble and potentially lift performance.

The example below explores the effect of Extra Trees minimum samples before splitting on model performance, test values between two and 14.

# explore extra trees minimum number of samples for a split effect on performance from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import ExtraTreesClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=4) return X, y # get a list of models to evaluate def get_models(): models = dict() # explore the number of samples per split from 2 to 14 for i in range(2, 15): models[str(i)] = ExtraTreesClassifier(min_samples_split=i) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the results scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize the performance along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each configured maximum tree depth.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that small values result in better performance, confirming the sensible default of two.

>2 0.909 (0.025) >3 0.907 (0.026) >4 0.907 (0.026) >5 0.902 (0.028) >6 0.902 (0.027) >7 0.904 (0.024) >8 0.899 (0.026) >9 0.896 (0.029) >10 0.896 (0.027) >11 0.897 (0.028) >12 0.894 (0.026) >13 0.890 (0.026) >14 0.892 (0.027)

A box and whisker plot is created for the distribution of accuracy scores for each configured maximum tree depth.

In this case, we can see a trend of improved performance with fewer minimum samples for a split, as we might expect.

This section provides more resources on the topic if you are looking to go deeper.

- Extremely Randomized Trees, 2006.

In this tutorial, you discovered how to develop Extra Trees ensembles for classification and regression.

Specifically, you learned:

- Extra Trees ensemble is an ensemble of decision trees and is related to bagging and random forest.
- How to use the Extra Trees ensemble for classification and regression with scikit-learn.
- How to explore the effect of Extra Trees model hyperparameters on model performance.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop an Extra Trees Ensemble with Python appeared first on Machine Learning Mastery.

]]>The post How to Develop a Random Forest Ensemble in Python appeared first on Machine Learning Mastery.

]]>It is perhaps the most popular and widely used machine learning algorithm given its good or excellent performance across a wide range of classification and regression predictive modeling problems.

It is also easy to use given that it has few key hyperparameters and sensible heuristics for configuring these hyperparameters.

In this tutorial, you will discover how to develop a random forest ensemble for classification and regression.

After completing this tutorial, you will know:

- Random forest ensemble is an ensemble of decision trees and a natural extension of bagging.
- How to use the random forest ensemble for classification and regression with scikit-learn.
- How to explore the effect of random forest model hyperparameters on model performance.

Let’s get started.

**Update Aug/2020**: Added a common questions section.

This tutorial is divided into four parts; they are:

- Random Forest Algorithm
- Random Forest Scikit-Learn API
- Random Forest for Classification
- Random Forest for Regression

- Random Forest Hyperparameters
- Explore Number of Samples
- Explore Number of Features
- Explore Number of Trees
- Explore Tree Depth

- Common Questions

Random forest is an ensemble of decision tree algorithms.

It is an extension of bootstrap aggregation (bagging) of decision trees and can be used for classification and regression problems.

In bagging, a number of decision trees are created where each tree is created from a different bootstrap sample of the training dataset. A bootstrap sample is a sample of the training dataset where a sample may appear more than once in the sample, referred to as **sampling with replacement**.

Bagging is an effective ensemble algorithm as each decision tree is fit on a slightly different training dataset, and in turn, has a slightly different performance. Unlike normal decision tree models, such as classification and regression trees (CART), trees used in the ensemble are unpruned, making them slightly overfit to the training dataset. This is desirable as it helps to make each tree more different and have less correlated predictions or prediction errors.

Predictions from the trees are averaged across all decision trees resulting in better performance than any single tree in the model.

Each model in the ensemble is then used to generate a prediction for a new sample and these m predictions are averaged to give the forest’s prediction

— Page 199, Applied Predictive Modeling, 2013.

A prediction on a regression problem is the average of the prediction across the trees in the ensemble. A prediction on a classification problem is the majority vote for the class label across the trees in the ensemble.

**Regression**: Prediction is the average prediction across the decision trees.**Classification**: Prediction is the majority vote class label predicted across the decision trees.

As with bagging, each tree in the forest casts a vote for the classification of a new sample, and the proportion of votes in each class across the ensemble is the predicted probability vector.

— Page 387, Applied Predictive Modeling, 2013.

Random forest involves constructing a large number of decision trees from bootstrap samples from the training dataset, like bagging.

Unlike bagging, random forest also involves selecting a subset of input features (columns or variables) at each split point in the construction of trees. Typically, constructing a decision tree involves evaluating the value for each input variable in the data in order to select a split point. By reducing the features to a random subset that may be considered at each split point, it forces each decision tree in the ensemble to be more different.

Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. […] But when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors.

— Page 320, An Introduction to Statistical Learning with Applications in R, 2014.

The effect is that the predictions, and in turn, prediction errors, made by each tree in the ensemble are more different or less correlated. When the predictions from these less correlated trees are averaged to make a prediction, it often results in better performance than bagged decision trees.

Perhaps the most important hyperparameter to tune for the random forest is the number of random features to consider at each split point.

Random forests’ tuning parameter is the number of randomly selected predictors, k, to choose from at each split, and is commonly referred to as mtry. In the regression context, Breiman (2001) recommends setting mtry to be one-third of the number of predictors.

— Page 199, Applied Predictive Modeling, 2013.

A good heuristic for regression is to set this hyperparameter to 1/3 the number of input features.

- num_features_for_split = total_input_features / 3

For classification problems, Breiman (2001) recommends setting mtry to the square root of the number of predictors.

— Page 387, Applied Predictive Modeling, 2013.

A good heuristic for classification is to set this hyperparameter to the square root of the number of input features.

- num_features_for_split = sqrt(total_input_features)

Another important hyperparameter to tune is the depth of the decision trees. Deeper trees are often more overfit to the training data, but also less correlated, which in turn may improve the performance of the ensemble. Depths from 1 to 10 levels may be effective.

Finally, the number of decision trees in the ensemble can be set. Often, this is increased until no further improvement is seen.

Random Forest ensembles can be implemented from scratch, although this can be challenging for beginners.

The scikit-learn Python machine learning library provides an implementation of Random Forest for machine learning.

It is available in modern versions of the library.

First, confirm that you are using a modern version of the library by running the following script:

# check scikit-learn version import sklearn print(sklearn.__version__)

Running the script will print your version of scikit-learn.

0.22.1

Random Forest is provided via the RandomForestRegressor and RandomForestClassifier classes.

Let’s take a look at how to develop a Random Forest ensemble for both classification and regression tasks.

In this section, we will look at using Random Forest for a classification problem.

The complete example is listed below.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output components.

(1000, 20) (1000,)

Next, we can evaluate a random forest algorithm on this dataset.

# evaluate random forest algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3) # define the model model = RandomForestClassifier() # evaluate the model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the random forest ensemble with default hyperparameters achieves a classification accuracy of about 90.5 percent.

Accuracy: 0.905 (0.025)

We can also use the random forest model as a final model and make predictions for classification.

First, the random forest ensemble is fit on all available data, then the *predict()* function can be called to make predictions on new data.

The example below demonstrates this on our binary classification dataset.

# make predictions using random forest for classification from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3) # define the model model = RandomForestClassifier() # fit the model on the whole dataset model.fit(X, y) # make a single prediction row = [[-8.52381793,5.24451077,-12.14967704,-2.92949242,0.99314133,0.67326595,-0.38657932,1.27955683,-0.60712621,3.20807316,0.60504151,-1.38706415,8.92444588,-7.43027595,-2.33653219,1.10358169,0.21547782,1.05057966,0.6975331,0.26076035]] yhat = model.predict(row) print('Predicted Class: %d' % yhat[0])

Running the example fits the random forest ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Predicted Class: 0

Now that we are familiar with using random forest for classification, let’s look at the API for regression.

In this section, we will look at using random forests for a regression problem.

The complete example is listed below.

# test regression dataset from sklearn.datasets import make_regression # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=2) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output components.

(1000, 20) (1000,)

Next, we can evaluate a random forest algorithm on this dataset.

The complete example is listed below.

# evaluate random forest ensemble for regression from numpy import mean from numpy import std from sklearn.datasets import make_regression from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.ensemble import RandomForestRegressor # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=2) # define the model model = RandomForestRegressor() # evaluate the model cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') # report performance print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation MAE of the model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the random forest ensemble with default hyperparameters achieves a MAE of about 90.

MAE: -90.149 (7.924)

We can also use the random forest model as a final model and make predictions for regression.

First, the random forest ensemble is fit on all available data, then the *predict()* function can be called to make predictions on new data.

The example below demonstrates this on our regression dataset.

# random forest for making predictions for regression from sklearn.datasets import make_regression from sklearn.ensemble import RandomForestRegressor # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=2) # define the model model = RandomForestRegressor() # fit the model on the whole dataset model.fit(X, y) # make a single prediction row = [[-0.89483109,-1.0670149,-0.25448694,-0.53850126,0.21082105,1.37435592,0.71203659,0.73093031,-1.25878104,-2.01656886,0.51906798,0.62767387,0.96250155,1.31410617,-1.25527295,-0.85079036,0.24129757,-0.17571721,-1.11454339,0.36268268]] yhat = model.predict(row) print('Prediction: %d' % yhat[0])

Running the example fits the random forest ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Prediction: -173

Now that we are familiar with using the scikit-learn API to evaluate and use random forest ensembles, let’s look at configuring the model.

In this section, we will take a closer look at some of the hyperparameters you should consider tuning for the random forest ensemble and their effect on model performance.

Each decision tree in the ensemble is fit on a bootstrap sample drawn from the training dataset.

This can be turned off by setting the “*bootstrap*” argument to *False*, if you desire. In that case, the whole training dataset will be used to train each decision tree. **This is not recommended**.

The “*max_samples*” argument can be set to a float between 0 and 1 to control the percentage of the size of the training dataset to make the bootstrap sample used to train each decision tree.

For example, if the training dataset has 100 rows, the *max_samples* argument could be set to 0.5 and each decision tree will be fit on a bootstrap sample with (100 * 0.5) or 50 rows of data.

A smaller sample size will make trees more different, and a larger sample size will make the trees more similar. Setting *max_samples* to “*None*” will make the sample size the same size as the training dataset and this is the default.

The example below demonstrates the effect of different bootstrap sample sizes from 10 percent to 100 percent on the random forest algorithm.

# explore random forest bootstrap sample size on performance from numpy import mean from numpy import std from numpy import arange from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3) return X, y # get a list of models to evaluate def get_models(): models = dict() # explore ratios from 10% to 100% in 10% increments for i in arange(0.1, 1.1, 0.1): key = '%.1f' % i # set max_samples=None to use 100% if i == 1.0: i = None models[key] = RandomForestClassifier(max_samples=i) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the results scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize the performance along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each dataset size.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the results suggest that using a bootstrap sample size that is equal to the size of the training dataset achieves the best results on this dataset.

This is the default and it should probably be used in most cases.

>10 0.856 (0.031) >20 0.873 (0.029) >30 0.881 (0.021) >40 0.891 (0.033) >50 0.893 (0.025) >60 0.897 (0.030) >70 0.902 (0.024) >80 0.903 (0.024) >90 0.900 (0.026) >100 0.903 (0.027)

A box and whisker plot is created for the distribution of accuracy scores for each bootstrap sample size.

In this case, we can see a general trend that the larger the sample, the better the performance of the model.

You might like to extend this example and see what happens if the bootstrap sample size is larger or even much larger than the training dataset (e.g. you can set an integer value as the number of samples instead of a float percentage of the training dataset size).

The number of features that is randomly sampled for each split point is perhaps the most important feature to configure for random forest.

It is set via the *max_features* argument and defaults to the square root of the number of input features. In this case, for our test dataset, this would be *sqrt(20)* or about four features.

The example below explores the effect of the number of features randomly selected at each split point on model accuracy. We will try values from 1 to 7 and would expect a small value, around four, to perform well based on the heuristic.

# explore random forest number of features effect on performance from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3) return X, y # get a list of models to evaluate def get_models(): models = dict() # explore number of features from 1 to 7 for i in range(1,8): models[str(i)] = RandomForestClassifier(max_features=i) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the results scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize the performance along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each feature set size.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the results suggest that a value between three and five would be appropriate, confirming the sensible default of four on this dataset. A value of five might even be better given the smaller standard deviation in classification accuracy as compared to a value of three or four.

>1 0.897 (0.023) >2 0.900 (0.028) >3 0.903 (0.027) >4 0.903 (0.022) >5 0.903 (0.019) >6 0.898 (0.025) >7 0.900 (0.024)

A box and whisker plot is created for the distribution of accuracy scores for each feature set size.

We can see a trend in performance rising and peaking with values between three and five and falling again as larger feature set sizes are considered.

The number of trees is another key hyperparameter to configure for the random forest.

Typically, the number of trees is increased until the model performance stabilizes. Intuition might suggest that more trees will lead to overfitting, although this is not the case. Both bagging and random forest algorithms appear to be somewhat immune to overfitting the training dataset given the stochastic nature of the learning algorithm.

The number of trees can be set via the “*n_estimators*” argument and defaults to 100.

The example below explores the effect of the number of trees with values between 10 to 1,000.

# explore random forest number of trees effect on performance from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3) return X, y # get a list of models to evaluate def get_models(): models = dict() # define number of trees to consider n_trees = [10, 50, 100, 500, 1000] for n in n_trees: models[str(n)] = RandomForestClassifier(n_estimators=n) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the results scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize the performance along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each configured number of trees.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that performance rises and stays flat after about 100 trees. Mean accuracy scores fluctuate across 100, 500, and 1,000 trees and this may be statistical noise.

>10 0.870 (0.036) >50 0.900 (0.028) >100 0.910 (0.024) >500 0.904 (0.024) >1000 0.906 (0.023)

A final interesting hyperparameter is the maximum depth of decision trees used in the ensemble.

By default, trees are constructed to an arbitrary depth and are not pruned. This is a sensible default, although we can also explore fitting trees with different fixed depths.

The maximum tree depth can be specified via the *max_depth* argument and is set to *None* (no maximum depth) by default.

The example below explores the effect of random forest maximum tree depth on model performance.

# explore random forest tree depth effect on performance from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3) return X, y # get a list of models to evaluate def get_models(): models = dict() # consider tree depths from 1 to 7 and None=full depths = [i for i in range(1,8)] + [None] for n in depths: models[str(n)] = RandomForestClassifier(max_depth=n) return models # evaluate a given model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the results scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize the performance along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each configured maximum tree depth.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that larger depth results in better model performance, with the default of no maximum depth achieving the best performance on this dataset.

>1 0.771 (0.040) >2 0.807 (0.037) >3 0.834 (0.034) >4 0.857 (0.030) >5 0.872 (0.025) >6 0.887 (0.024) >7 0.890 (0.025) >None 0.903 (0.027)

A box and whisker plot is created for the distribution of accuracy scores for each configured maximum tree depth.

In this case, we can see a trend of improved performance with increase in tree depth, supporting the default of no maximum depth.

In this section we will take a closer look at some common sticking points you may have with the radom forest ensemble procedure.

**Q. What algorithm should be used in the ensemble?**

Random forest is designed to be an ensemble of decision tree algorithms.

**Q. How many ensemble members should be used?**

The number of trees should be increased until no further improvement in performance is seen on your dataset.

As a starting point, we suggest using at least 1,000 trees. If the cross-validation performance profiles are still improving at 1,000 trees, then incorporate more trees until performance levels off.

— Page 200, Applied Predictive Modeling, 2013.

**Q. Won’t the ensemble overfit with too many trees?**

No. Random forest ensembles (do not) are very unlikely to overfit in general.

Another claim is that random forests “cannot overfit” the data. It is certainly true that increasing [the number of trees] does not cause the random forest sequence to overfit …

— Page 596, The Elements of Statistical Learning, 2016.

**Q. How large should the bootstrap sample be?**

It is good practice to make the bootstrap sample as large as the original dataset size.

That is 100% the size or an equal number of rows as the original dataset.

**Q. How many features should be chosen at each split point?**

The best practice is to test a suite of different values and discover what works best for your dataset.

As a heuristic, you can use:

**Classification**: Square root of the number of features.**Regression**: One third of the number of features.

**Q. What problems are well suited to random forest?**

Random forest is known to work well or even best on a wide range of classification and regression problems. Try it and see.

The authors make grand claims about the success of random forests: “most accurate”, “most interpretable”, and the like. In our experience random forests do remarkably well, with very little tuning required.

— Page 590, The Elements of Statistical Learning, 2016.

This section provides more resources on the topic if you are looking to go deeper.

- Random Forests, 2001.

- Applied Predictive Modeling, 2013.
- The Elements of Statistical Learning, 2016.
- An Introduction to Statistical Learning with Applications in R, 2014.

In this tutorial, you discovered how to develop random forest ensembles for classification and regression.

Specifically, you learned:

- Random forest ensemble is an ensemble of decision trees and a natural extension of bagging.
- How to use the random forest ensemble for classification and regression with scikit-learn.
- How to explore the effect of random forest model hyperparameters on model performance.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop a Random Forest Ensemble in Python appeared first on Machine Learning Mastery.

]]>The post How to Develop Voting Ensembles With Python appeared first on Machine Learning Mastery.

]]>For regression, a voting ensemble involves making a prediction that is the average of multiple other regression models.

In classification, a **hard voting** ensemble involves summing the votes for crisp class labels from other models and predicting the class with the most votes. A **soft voting** ensemble involves summing the predicted probabilities for class labels and predicting the class label with the largest sum probability.

In this tutorial, you will discover how to create voting ensembles for machine learning algorithms in Python.

After completing this tutorial, you will know:

- A voting ensemble involves summing the predictions made by classification models or averaging the predictions made by regression models.
- How voting ensembles work, when to use voting ensembles, and the limitations of the approach.
- How to implement a hard voting ensemble and soft voting ensemble for classification predictive modeling.

Let’s get started.

This tutorial is divided into four parts; they are:

- Voting Ensembles
- Voting Ensemble Scikit-Learn API
- Voting Ensemble for Classification
- Hard Voting Ensemble for Classification
- Soft Voting Ensemble for Classification

- Voting Ensemble for Regression

A voting ensemble (or a “*majority voting ensemble*“) is an ensemble machine learning model that combines the predictions from multiple other models.

It is a technique that may be used to improve model performance, ideally achieving better performance than any single model used in the ensemble.

A voting ensemble works by combining the predictions from multiple models. It can be used for classification or regression. In the case of regression, this involves calculating the average of the predictions from the models. In the case of classification, the predictions for each label are summed and the label with the majority vote is predicted.

**Regression Voting Ensemble**: Predictions are the average of contributing models.**Classification Voting Ensemble**: Predictions are the majority vote of contributing models.

There are two approaches to the majority vote prediction for classification; they are hard voting and soft voting.

Hard voting involves summing the predictions for each class label and predicting the class label with the most votes. Soft voting involves summing the predicted probabilities (or probability-like scores) for each class label and predicting the class label with the largest probability.

**Hard Voting**. Predict the class with the largest sum of votes from models**Soft Voting**. Predict the class with the largest summed probability from models.

A voting ensemble may be considered a meta-model, a model of models.

As a meta-model, it could be used with any collection of existing trained machine learning models and the existing models do not need to be aware that they are being used in the ensemble. This means you could explore using a voting ensemble on any set or subset of fit models for your predictive modeling task.

A voting ensemble is appropriate when you have two or more models that perform well on a predictive modeling task. The models used in the ensemble must mostly agree with their predictions.

One way to combine outputs is by voting—the same mechanism used in bagging. However, (unweighted) voting only makes sense if the learning schemes perform comparably well. If two of the three classifiers make predictions that are grossly incorrect, we will be in trouble!

— Page 497, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

Use voting ensembles when:

- All models in the ensemble have generally the same good performance.
- All models in the ensemble mostly already agree.

Hard voting is appropriate when the models used in the voting ensemble predict crisp class labels. Soft voting is appropriate when the models used in the voting ensemble predict the probability of class membership. Soft voting can be used for models that do not natively predict a class membership probability, although may require calibration of their probability-like scores prior to being used in the ensemble (e.g. support vector machine, k-nearest neighbors, and decision trees).

- Hard voting is for models that predict class labels.
- Soft voting is for models that predict class membership probabilities.

The voting ensemble is not guaranteed to provide better performance than any single model used in the ensemble. If any given model used in the ensemble performs better than the voting ensemble, that model should probably be used instead of the voting ensemble.

This is not always the case. A voting ensemble can offer lower variance in the predictions made over individual models. This can be seen in a lower variance in prediction error for regression tasks. This can also be seen in a lower variance in accuracy for classification tasks. This lower variance may result in a lower mean performance of the ensemble, which might be desirable given the higher stability or confidence of the model.

Use a voting ensemble if:

- It results in better performance than any model used in the ensemble.
- It results in a lower variance than any model used in the ensemble.

A voting ensemble is particularly useful for machine learning models that use a stochastic learning algorithm and result in a different final model each time it is trained on the same dataset. One example is neural networks that are fit using stochastic gradient descent.

For more on this topic, see the tutorial:

Another particularly useful case for voting ensembles is when combining multiple fits of the same machine learning algorithm with slightly different hyperparameters.

Voting ensembles are most effective when:

- Combining multiple fits of a model trained using stochastic learning algorithms.
- Combining multiple fits of a model with different hyperparameters.

A limitation of the voting ensemble is that it treats all models the same, meaning all models contribute equally to the prediction. This is a problem if some models are good in some situations and poor in others.

An extension to the voting ensemble to address this problem is to use a weighted average or weighted voting of the contributing models. This is sometimes called blending. A further extension is to use a machine learning model to learn when and how much to trust each model when making predictions. This is referred to as stacked generalization, or stacking for short.

Extensions to voting ensembles:

- Weighted Average Ensemble (blending).
- Stacked Generalization (stacking).

Now that we are familiar with voting ensembles, let’s take a closer look at how to create voting ensemble models.

Voting ensembles can be implemented from scratch, although it can be challenging for beginners.

The scikit-learn Python machine learning library provides an implementation of voting for machine learning.

It is available in version 0.22 of the library and higher.

First, confirm that you are using a modern version of the library by running the following script:

# check scikit-learn version import sklearn print(sklearn.__version__)

Running the script will print your version of scikit-learn.

0.22.1

Voting is provided via the VotingRegressor and VotingClassifier classes.

Both models operate the same way and take the same arguments. Using the model requires that you specify a list of estimators that make predictions and are combined in the voting ensemble.

A list of base models is provided via the “*estimators*” argument. This is a Python list where each element in the list is a tuple with the name of the model and the configured model instance. Each model in the list must have a unique name.

For example, below defines two base models:

... models = [('lr',LogisticRegression()),('svm',SVC())] ensemble = VotingClassifier(estimators=models)

Each model in the list may also be a Pipeline, including any data preparation required by the model prior to fitting the model on the training dataset.

For example:

... models = [('lr',LogisticRegression()),('svm',make_pipeline(StandardScaler(),SVC()))] ensemble = VotingClassifier(estimators=models)

When using a voting ensemble for classification, the type of voting, such as hard voting or soft voting, can be specified via the “*voting*” argument and set to the string ‘*hard*‘ (the default) or ‘*soft*‘.

For example:

... models = [('lr',LogisticRegression()),('svm',SVC())] ensemble = VotingClassifier(estimators=models, voting='soft')

Now that we are familiar with the voting ensemble API in scikit-learn, let’s look at some worked examples.

In this section, we will look at using stacking for a classification problem.

The complete example is listed below.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=2) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output components.

(1000, 20) (1000,)

Next, we will demonstrate hard voting and soft voting for this dataset.

We can demonstrate hard voting with a k-nearest neighbor algorithm.

We can fit five different versions of the KNN algorithm, each with a different number of neighbors used when making predictions. We will use 1, 3, 5, 7, and 9 neighbors (odd numbers in an attempt to avoid ties).

Our expectation is that by combining the predicted class labels predicted by each different KNN model that the hard voting ensemble will achieve a better predictive performance than any standalone model used in the ensemble, on average.

First, we can create a function named *get_voting()* that creates each KNN model and combines the models into a hard voting ensemble.

# get a voting ensemble of models def get_voting(): # define the base models models = list() models.append(('knn1', KNeighborsClassifier(n_neighbors=1))) models.append(('knn3', KNeighborsClassifier(n_neighbors=3))) models.append(('knn5', KNeighborsClassifier(n_neighbors=5))) models.append(('knn7', KNeighborsClassifier(n_neighbors=7))) models.append(('knn9', KNeighborsClassifier(n_neighbors=9))) # define the voting ensemble ensemble = VotingClassifier(estimators=models, voting='hard') return ensemble

We can then create a list of models to evaluate, including each standalone version of the KNN model configurations and the hard voting ensemble.

This will help us directly compare each standalone configuration of the KNN model with the ensemble in terms of the distribution of classification accuracy scores. The *get_models()* function below creates the list of models for us to evaluate.

# get a list of models to evaluate def get_models(): models = dict() models['knn1'] = KNeighborsClassifier(n_neighbors=1) models['knn3'] = KNeighborsClassifier(n_neighbors=3) models['knn5'] = KNeighborsClassifier(n_neighbors=5) models['knn7'] = KNeighborsClassifier(n_neighbors=7) models['knn9'] = KNeighborsClassifier(n_neighbors=9) models['hard_voting'] = get_voting() return models

Each model will be evaluated using repeated k-fold cross-validation.

The *evaluate_model()* function below takes a model instance and returns as a list of scores from three repeats of stratified 10-fold cross-validation.

# evaluate a give model using cross-validation def evaluate_model(model, X, y): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') return scores

We can then report the mean performance of each algorithm, and also create a box and whisker plot to compare the distribution of accuracy scores for each algorithm.

Tying this together, the complete example is listed below.

# compare hard voting to standalone classifiers from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import VotingClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=2) return X, y # get a voting ensemble of models def get_voting(): # define the base models models = list() models.append(('knn1', KNeighborsClassifier(n_neighbors=1))) models.append(('knn3', KNeighborsClassifier(n_neighbors=3))) models.append(('knn5', KNeighborsClassifier(n_neighbors=5))) models.append(('knn7', KNeighborsClassifier(n_neighbors=7))) models.append(('knn9', KNeighborsClassifier(n_neighbors=9))) # define the voting ensemble ensemble = VotingClassifier(estimators=models, voting='hard') return ensemble # get a list of models to evaluate def get_models(): models = dict() models['knn1'] = KNeighborsClassifier(n_neighbors=1) models['knn3'] = KNeighborsClassifier(n_neighbors=3) models['knn5'] = KNeighborsClassifier(n_neighbors=5) models['knn7'] = KNeighborsClassifier(n_neighbors=7) models['knn9'] = KNeighborsClassifier(n_neighbors=9) models['hard_voting'] = get_voting() return models # evaluate a give model using cross-validation def evaluate_model(model, X, y): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model, X, y) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean and standard deviation accuracy for each model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see the hard voting ensemble achieves a better classification accuracy of about 90.2% compared to all standalone versions of the model.

>knn1 0.873 (0.030) >knn3 0.889 (0.038) >knn5 0.895 (0.031) >knn7 0.899 (0.035) >knn9 0.900 (0.033) >hard_voting 0.902 (0.034)

A box-and-whisker plot is then created comparing the distribution accuracy scores for each model, allowing us to clearly see that hard voting ensemble performing better than all standalone models on average.

If we choose a hard voting ensemble as our final model, we can fit and use it to make predictions on new data just like any other model.

First, the hard voting ensemble is fit on all available data, then the *predict()* function can be called to make predictions on new data.

The example below demonstrates this on our binary classification dataset.

# make a prediction with a hard voting ensemble from sklearn.datasets import make_classification from sklearn.ensemble import VotingClassifier from sklearn.neighbors import KNeighborsClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=2) # define the base models models = list() models.append(('knn1', KNeighborsClassifier(n_neighbors=1))) models.append(('knn3', KNeighborsClassifier(n_neighbors=3))) models.append(('knn5', KNeighborsClassifier(n_neighbors=5))) models.append(('knn7', KNeighborsClassifier(n_neighbors=7))) models.append(('knn9', KNeighborsClassifier(n_neighbors=9))) # define the hard voting ensemble ensemble = VotingClassifier(estimators=models, voting='hard') # fit the model on all available data ensemble.fit(X, y) # make a prediction for one example data = [[5.88891819,2.64867662,-0.42728226,-1.24988856,-0.00822,-3.57895574,2.87938412,-1.55614691,-0.38168784,7.50285659,-1.16710354,-5.02492712,-0.46196105,-0.64539455,-1.71297469,0.25987852,-0.193401,-5.52022952,0.0364453,-1.960039]] yhat = ensemble.predict(data) print('Predicted Class: %d' % (yhat))

Running the example fits the hard voting ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Predicted Class: 1

We can demonstrate soft voting with the support vector machine (SVM) algorithm.

The SVM algorithm does not natively predict probabilities, although it can be configured to predict probability-like scores by setting the “*probability*” argument to “*True*” in the SVC class.

We can fit five different versions of the SVM algorithm with a polynomial kernel, each with a different polynomial degree, set via the “*degree*” argument. We will use degrees 1-5.

Our expectation is that by combining the predicted class membership probability scores predicted by each different SVM model that the soft voting ensemble will achieve a better predictive performance than any standalone model used in the ensemble, on average.

First, we can create a function named *get_voting()* that creates the SVM models and combines them into a soft voting ensemble.

# get a voting ensemble of models def get_voting(): # define the base models models = list() models.append(('svm1', SVC(probability=True, kernel='poly', degree=1))) models.append(('svm2', SVC(probability=True, kernel='poly', degree=2))) models.append(('svm3', SVC(probability=True, kernel='poly', degree=3))) models.append(('svm4', SVC(probability=True, kernel='poly', degree=4))) models.append(('svm5', SVC(probability=True, kernel='poly', degree=5))) # define the voting ensemble ensemble = VotingClassifier(estimators=models, voting='soft') return ensemble

We can then create a list of models to evaluate, including each standalone version of the SVM model configurations and the soft voting ensemble.

This will help us directly compare each standalone configuration of the SVM model with the ensemble in terms of the distribution of classification accuracy scores. The *get_models()* function below creates the list of models for us to evaluate.

# get a list of models to evaluate def get_models(): models = dict() models['svm1'] = SVC(probability=True, kernel='poly', degree=1) models['svm2'] = SVC(probability=True, kernel='poly', degree=2) models['svm3'] = SVC(probability=True, kernel='poly', degree=3) models['svm4'] = SVC(probability=True, kernel='poly', degree=4) models['svm5'] = SVC(probability=True, kernel='poly', degree=5) models['soft_voting'] = get_voting() return models

We can evaluate and report model performance using repeated k-fold cross-validation as we did in the previous section.

Tying this together, the complete example is listed below.

# compare soft voting ensemble to standalone classifiers from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.svm import SVC from sklearn.ensemble import VotingClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=2) return X, y # get a voting ensemble of models def get_voting(): # define the base models models = list() models.append(('svm1', SVC(probability=True, kernel='poly', degree=1))) models.append(('svm2', SVC(probability=True, kernel='poly', degree=2))) models.append(('svm3', SVC(probability=True, kernel='poly', degree=3))) models.append(('svm4', SVC(probability=True, kernel='poly', degree=4))) models.append(('svm5', SVC(probability=True, kernel='poly', degree=5))) # define the voting ensemble ensemble = VotingClassifier(estimators=models, voting='soft') return ensemble # get a list of models to evaluate def get_models(): models = dict() models['svm1'] = SVC(probability=True, kernel='poly', degree=1) models['svm2'] = SVC(probability=True, kernel='poly', degree=2) models['svm3'] = SVC(probability=True, kernel='poly', degree=3) models['svm4'] = SVC(probability=True, kernel='poly', degree=4) models['svm5'] = SVC(probability=True, kernel='poly', degree=5) models['soft_voting'] = get_voting() return models # evaluate a give model using cross-validation def evaluate_model(model, X, y): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model, X, y) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean and standard deviation accuracy for each model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see the soft voting ensemble achieves a better classification accuracy of about 92.4% compared to all standalone versions of the model.

>svm1 0.855 (0.035) >svm2 0.859 (0.034) >svm3 0.890 (0.035) >svm4 0.808 (0.037) >svm5 0.850 (0.037) >soft_voting 0.924 (0.028)

A box-and-whisker plot is then created comparing the distribution accuracy scores for each model, allowing us to clearly see that soft voting ensemble performing better than all standalone models on average.

If we choose a soft voting ensemble as our final model, we can fit and use it to make predictions on new data just like any other model.

First, the soft voting ensemble is fit on all available data, then the *predict()* function can be called to make predictions on new data.

The example below demonstrates this on our binary classification dataset.

# make a prediction with a soft voting ensemble from sklearn.datasets import make_classification from sklearn.ensemble import VotingClassifier from sklearn.svm import SVC # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=2) # define the base models models = list() models.append(('svm1', SVC(probability=True, kernel='poly', degree=1))) models.append(('svm2', SVC(probability=True, kernel='poly', degree=2))) models.append(('svm3', SVC(probability=True, kernel='poly', degree=3))) models.append(('svm4', SVC(probability=True, kernel='poly', degree=4))) models.append(('svm5', SVC(probability=True, kernel='poly', degree=5))) # define the soft voting ensemble ensemble = VotingClassifier(estimators=models, voting='soft') # fit the model on all available data ensemble.fit(X, y) # make a prediction for one example data = [[5.88891819,2.64867662,-0.42728226,-1.24988856,-0.00822,-3.57895574,2.87938412,-1.55614691,-0.38168784,7.50285659,-1.16710354,-5.02492712,-0.46196105,-0.64539455,-1.71297469,0.25987852,-0.193401,-5.52022952,0.0364453,-1.960039]] yhat = ensemble.predict(data) print('Predicted Class: %d' % (yhat))

Running the example fits the soft voting ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Predicted Class: 1

In this section, we will look at using voting for a regression problem.

The complete example is listed below.

# test regression dataset from sklearn.datasets import make_regression # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output components.

(1000, 20) (1000,)

We can demonstrate ensemble voting for regression with a decision tree algorithm, sometimes referred to as a classification and regression tree (CART) algorithm.

We can fit five different versions of the CART algorithm, each with a different maximum depth of the decision tree, set via the “*max_depth*” argument. We will use depths of 1-5.

Our expectation is that by combining the values predicted by each different CART model that the voting ensemble will achieve a better predictive performance than any standalone model used in the ensemble, on average.

First, we can create a function named *get_voting()* that creates each CART model and combines the models into a voting ensemble.

# get a voting ensemble of models def get_voting(): # define the base models models = list() models.append(('cart1', DecisionTreeRegressor(max_depth=1))) models.append(('cart2', DecisionTreeRegressor(max_depth=2))) models.append(('cart3', DecisionTreeRegressor(max_depth=3))) models.append(('cart4', DecisionTreeRegressor(max_depth=4))) models.append(('cart5', DecisionTreeRegressor(max_depth=5))) # define the voting ensemble ensemble = VotingRegressor(estimators=models) return ensemble

We can then create a list of models to evaluate, including each standalone version of the CART model configurations and the soft voting ensemble.

This will help us directly compare each standalone configuration of the CART model with the ensemble in terms of the distribution of error scores. The *get_models()* function below creates the list of models for us to evaluate.

# get a list of models to evaluate def get_models(): models = dict() models['cart1'] = DecisionTreeRegressor(max_depth=1) models['cart2'] = DecisionTreeRegressor(max_depth=2) models['cart3'] = DecisionTreeRegressor(max_depth=3) models['cart4'] = DecisionTreeRegressor(max_depth=4) models['cart5'] = DecisionTreeRegressor(max_depth=5) models['voting'] = get_voting() return models

We can evaluate and report model performance using repeated k-fold cross-validation as we did in the previous section.

Models are evaluated using mean absolute error (MAE). The scikit-learn makes the score negative so that it can be maximized. This means that the reported MAE scores are negative, larger values are better, and 0 represents no error.

Tying this together, the complete example is listed below.

# compare voting ensemble to each standalone models for regression from numpy import mean from numpy import std from sklearn.datasets import make_regression from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import VotingRegressor from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=1) return X, y # get a voting ensemble of models def get_voting(): # define the base models models = list() models.append(('cart1', DecisionTreeRegressor(max_depth=1))) models.append(('cart2', DecisionTreeRegressor(max_depth=2))) models.append(('cart3', DecisionTreeRegressor(max_depth=3))) models.append(('cart4', DecisionTreeRegressor(max_depth=4))) models.append(('cart5', DecisionTreeRegressor(max_depth=5))) # define the voting ensemble ensemble = VotingRegressor(estimators=models) return ensemble # get a list of models to evaluate def get_models(): models = dict() models['cart1'] = DecisionTreeRegressor(max_depth=1) models['cart2'] = DecisionTreeRegressor(max_depth=2) models['cart3'] = DecisionTreeRegressor(max_depth=3) models['cart4'] = DecisionTreeRegressor(max_depth=4) models['cart5'] = DecisionTreeRegressor(max_depth=5) models['voting'] = get_voting() return models # evaluate a give model using cross-validation def evaluate_model(model, X, y): cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model, X, y) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean and standard deviation accuracy for each model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see the voting ensemble achieves a better mean squared error of about -136.338, which is larger (better) compared to all standalone versions of the model.

>cart1 -161.519 (11.414) >cart2 -152.596 (11.271) >cart3 -142.378 (10.900) >cart4 -140.086 (12.469) >cart5 -137.641 (12.240) >voting -136.338 (11.242)

A box-and-whisker plot is then created comparing the distribution negative MAE scores for each model, allowing us to clearly see that voting ensemble performing better than all standalone models on average.

If we choose a voting ensemble as our final model, we can fit and use it to make predictions on new data just like any other model.

First, the voting ensemble is fit on all available data, then the *predict()* function can be called to make predictions on new data.

The example below demonstrates this on our binary classification dataset.

# make a prediction with a voting ensemble from sklearn.datasets import make_regression from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import VotingRegressor # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=1) # define the base models models = list() models.append(('cart1', DecisionTreeRegressor(max_depth=1))) models.append(('cart2', DecisionTreeRegressor(max_depth=2))) models.append(('cart3', DecisionTreeRegressor(max_depth=3))) models.append(('cart4', DecisionTreeRegressor(max_depth=4))) models.append(('cart5', DecisionTreeRegressor(max_depth=5))) # define the voting ensemble ensemble = VotingRegressor(estimators=models) # fit the model on all available data ensemble.fit(X, y) # make a prediction for one example data = [[0.59332206,-0.56637507,1.34808718,-0.57054047,-0.72480487,1.05648449,0.77744852,0.07361796,0.88398267,2.02843157,1.01902732,0.11227799,0.94218853,0.26741783,0.91458143,-0.72759572,1.08842814,-0.61450942,-0.69387293,1.69169009]] yhat = ensemble.predict(data) print('Predicted Value: %.3f' % (yhat))

Running the example fits the voting ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Predicted Value: 141.319

This section provides more resources on the topic if you are looking to go deeper.

- How to Develop a Weighted Average Ensemble for Deep Learning Neural Networks
- How to Develop a Stacking Ensemble for Deep Learning Neural Networks in Python With Keras

- Ensemble methods scikit-learn API.
- sklearn.ensemble.VotingClassifier API.
- sklearn.ensemble.VotingRegressor API.

In this tutorial, you discovered how to create voting ensembles for machine learning algorithms in Python.

Specifically, you learned:

- A voting ensemble involves summing the predictions made by classification models or averaging the predictions made by regression models.
- How voting ensembles work, when to use voting ensembles, and the limitations of the approach.
- How to implement a hard voting ensemble and soft voting ensembles for classification predictive modeling.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop Voting Ensembles With Python appeared first on Machine Learning Mastery.

]]>The post One-vs-Rest and One-vs-One for Multi-Class Classification appeared first on Machine Learning Mastery.

]]>Algorithms such as the Perceptron, Logistic Regression, and Support Vector Machines were designed for binary classification and do not natively support classification tasks with more than two classes.

One approach for using binary classification algorithms for multi-classification problems is to split the multi-class classification dataset into multiple binary classification datasets and fit a binary classification model on each. Two different examples of this approach are the One-vs-Rest and One-vs-One strategies.

In this tutorial, you will discover One-vs-Rest and One-vs-One strategies for multi-class classification.

After completing this tutorial, you will know:

- Binary classification models like logistic regression and SVM do not support multi-class classification natively and require meta-strategies.
- The One-vs-Rest strategy splits a multi-class classification into one binary classification problem per class.
- The One-vs-One strategy splits a multi-class classification into one binary classification problem per each pair of classes.

Let’s get started.

This tutorial is divided into three parts; they are:

- Binary Classifiers for Multi-Class Classification
- One-Vs-Rest for Multi-Class Classification
- One-Vs-One for Multi-Class Classification

Classification is a predictive modeling problem that involves assigning a class label to an example.

Binary classification are those tasks where examples are assigned exactly one of two classes. Multi-class classification is those tasks where examples are assigned exactly one of more than two classes.

**Binary Classification**: Classification tasks with two classes.**Multi-class Classification**: Classification tasks with more than two classes.

Some algorithms are designed for binary classification problems. Examples include:

- Logistic Regression
- Perceptron
- Support Vector Machines

As such, they cannot be used for multi-class classification tasks, at least not directly.

Instead, heuristic methods can be used to split a multi-class classification problem into multiple binary classification datasets and train a binary classification model each.

Two examples of these heuristic methods include:

- One-vs-Rest (OvR)
- One-vs-One (OvO)

Let’s take a closer look at each.

One-vs-rest (OvR for short, also referred to as One-vs-All or OvA) is a heuristic method for using binary classification algorithms for multi-class classification.

It involves splitting the multi-class dataset into multiple binary classification problems. A binary classifier is then trained on each binary classification problem and predictions are made using the model that is the most confident.

For example, given a multi-class classification problem with examples for each class ‘*red*,’ ‘*blue*,’ and ‘*green*‘. This could be divided into three binary classification datasets as follows:

**Binary Classification Problem 1**: red vs [blue, green]**Binary Classification Problem 2**: blue vs [red, green]**Binary Classification Problem 3**: green vs [red, blue]

A possible downside of this approach is that it requires one model to be created for each class. For example, three classes requires three models. This could be an issue for large datasets (e.g. millions of rows), slow models (e.g. neural networks), or very large numbers of classes (e.g. hundreds of classes).

The obvious approach is to use a one-versus-the-rest approach (also called one-vs-all), in which we train C binary classifiers, fc(x), where the data from class c is treated as positive, and the data from all the other classes is treated as negative.

— Page 503, Machine Learning: A Probabilistic Perspective, 2012.

This approach requires that each model predicts a class membership probability or a probability-like score. The argmax of these scores (class index with the largest score) is then used to predict a class.

This approach is commonly used for algorithms that naturally predict numerical class membership probability or score, such as:

- Logistic Regression
- Perceptron

As such, the implementation of these algorithms in the scikit-learn library implements the OvR strategy by default when using these algorithms for multi-class classification.

We can demonstrate this with an example on a 3-class classification problem using the LogisticRegression algorithm. The strategy for handling multi-class classification can be set via the “*multi_class*” argument and can be set to “*ovr*” for the one-vs-rest strategy.

The complete example of fitting a logistic regression model for multi-class classification using the built-in one-vs-rest strategy is listed below.

# logistic regression for multi-class classification using built-in one-vs-rest from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1) # define model model = LogisticRegression(multi_class='ovr') # fit model model.fit(X, y) # make predictions yhat = model.predict(X)

The scikit-learn library also provides a separate OneVsRestClassifier class that allows the one-vs-rest strategy to be used with any classifier.

This class can be used to use a binary classifier like Logistic Regression or Perceptron for multi-class classification, or even other classifiers that natively support multi-class classification.

It is very easy to use and requires that a classifier that is to be used for binary classification be provided to the *OneVsRestClassifier* as an argument.

The example below demonstrates how to use the *OneVsRestClassifier* class with a *LogisticRegression* class used as the binary classification model.

# logistic regression for multi-class classification using a one-vs-rest from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.multiclass import OneVsRestClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1) # define model model = LogisticRegression() # define the ovr strategy ovr = OneVsRestClassifier(model) # fit model ovr.fit(X, y) # make predictions yhat = ovr.predict(X)

One-vs-One (OvO for short) is another heuristic method for using binary classification algorithms for multi-class classification.

Like one-vs-rest, one-vs-one splits a multi-class classification dataset into binary classification problems. Unlike one-vs-rest that splits it into one binary dataset for each class, the one-vs-one approach splits the dataset into one dataset for each class versus every other class.

For example, consider a multi-class classification problem with four classes: ‘*red*,’ ‘*blue*,’ and ‘*green*,’ ‘*yellow*.’ This could be divided into six binary classification datasets as follows:

**Binary Classification Problem 1**: red vs. blue**Binary Classification Problem 2**: red vs. green**Binary Classification Problem 3**: red vs. yellow**Binary Classification Problem 4**: blue vs. green**Binary Classification Problem 5**: blue vs. yellow**Binary Classification Problem 6**: green vs. yellow

This is significantly more datasets, and in turn, models than the one-vs-rest strategy described in the previous section.

The formula for calculating the number of binary datasets, and in turn, models, is as follows:

- (NumClasses * (NumClasses – 1)) / 2

We can see that for four classes, this gives us the expected value of six binary classification problems:

- (NumClasses * (NumClasses – 1)) / 2
- (4 * (4 – 1)) / 2
- (4 * 3) / 2
- 12 / 2
- 6

Each binary classification model may predict one class label and the model with the most predictions or votes is predicted by the one-vs-one strategy.

An alternative is to introduce K(K − 1)/2 binary discriminant functions, one for every possible pair of classes. This is known as a one-versus-one classifier. Each point is then classified according to a majority vote amongst the discriminant functions.

— Page 183, Pattern Recognition and Machine Learning, 2006.

Similarly, if the binary classification models predict a numerical class membership, such as a probability, then the argmax of the sum of the scores (class with the largest sum score) is predicted as the class label.

Classically, this approach is suggested for support vector machines (SVM) and related kernel-based algorithms. This is believed because the performance of kernel methods does not scale in proportion to the size of the training dataset and using subsets of the training data may counter this effect.

The support vector machine implementation in the scikit-learn is provided by the SVC class and supports the one-vs-one method for multi-class classification problems. This can be achieved by setting the “*decision_function_shape*” argument to ‘*ovo*‘.

The example below demonstrates SVM for multi-class classification using the one-vs-one method.

# SVM for multi-class classification using built-in one-vs-one from sklearn.datasets import make_classification from sklearn.svm import SVC # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1) # define model model = SVC(decision_function_shape='ovo') # fit model model.fit(X, y) # make predictions yhat = model.predict(X)

The scikit-learn library also provides a separate OneVsOneClassifier class that allows the one-vs-one strategy to be used with any classifier.

This class can be used with a binary classifier like SVM, Logistic Regression or Perceptron for multi-class classification, or even other classifiers that natively support multi-class classification.

It is very easy to use and requires that a classifier that is to be used for binary classification be provided to the *OneVsOneClassifier* as an argument.

The example below demonstrates how to use the *OneVsOneClassifier* class with an SVC class used as the binary classification model.

# SVM for multi-class classification using one-vs-one from sklearn.datasets import make_classification from sklearn.svm import SVC from sklearn.multiclass import OneVsOneClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1) # define model model = SVC() # define ovo strategy ovo = OneVsOneClassifier(model) # fit model ovo.fit(X, y) # make predictions yhat = ovo.predict(X)

This section provides more resources on the topic if you are looking to go deeper.

- Pattern Recognition and Machine Learning, 2006.
- Machine Learning: A Probabilistic Perspective, 2012.

- Multiclass and multilabel algorithms, scikit-learn API.
- sklearn.multiclass.OneVsRestClassifier API.
- sklearn.multiclass.OneVsOneClassifier API.

In this tutorial, you discovered One-vs-Rest and One-vs-One strategies for multi-class classification.

Specifically, you learned:

- Binary classification models like logistic regression and SVM do not support multi-class classification natively and require meta-strategies.
- The One-vs-Rest strategy splits a multi-class classification into one binary classification problem per class.
- The One-vs-One strategy splits a multi-class classification into one binary classification problem per each pair of classes.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post One-vs-Rest and One-vs-One for Multi-Class Classification appeared first on Machine Learning Mastery.

]]>