How to Develop an Extra Trees Ensemble with Python

Extra Trees is an ensemble machine learning algorithm that combines the predictions from many decision trees.

It is related to the widely used random forest algorithm. It can often achieve as-good or better performance than the random forest algorithm, although it uses a simpler algorithm to construct the decision trees used as members of the ensemble.

It is also easy to use given that it has few key hyperparameters and sensible heuristics for configuring these hyperparameters.

In this tutorial, you will discover how to develop Extra Trees ensembles for classification and regression.

After completing this tutorial, you will know:

  • Extra Trees ensemble is an ensemble of decision trees and is related to bagging and random forest.
  • How to use the Extra Trees ensemble for classification and regression with scikit-learn.
  • How to explore the effect of Extra Trees model hyperparameters on model performance.

Kick-start your project with my new book Ensemble Learning Algorithms With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Develop an Extra Trees Ensemble with Python

How to Develop an Extra Trees Ensemble with Python
Photo by Nicolas Raymond, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Extra Trees Algorithm
  2. Extra Trees Scikit-Learn API
    1. Extra Trees for Classification
    2. Extra Trees for Regression
  3. Extra Trees Hyperparameters
    1. Explore Number of Trees
    2. Explore Number of Features
    3. Explore Minimum Samples per Split

Extra Trees Algorithm

Extremely Randomized Trees, or Extra Trees for short, is an ensemble machine learning algorithm.

Specifically, it is an ensemble of decision trees and is related to other ensembles of decision trees algorithms such as bootstrap aggregation (bagging) and random forest.

The Extra Trees algorithm works by creating a large number of unpruned decision trees from the training dataset. Predictions are made by averaging the prediction of the decision trees in the case of regression or using majority voting in the case of classification.

  • Regression: Predictions made by averaging predictions from decision trees.
  • Classification: Predictions made by majority voting from decision trees.

The predictions of the trees are aggregated to yield the final prediction, by majority vote in classification problems and arithmetic average in regression problems.

Extremely Randomized Trees, 2006.

Unlike bagging and random forest that develop each decision tree from a bootstrap sample of the training dataset, the Extra Trees algorithm fits each decision tree on the whole training dataset.

Like random forest, the Extra Trees algorithm will randomly sample the features at each split point of a decision tree. Unlike random forest, which uses a greedy algorithm to select an optimal split point, the Extra Trees algorithm selects a split point at random.

The Extra-Trees algorithm builds an ensemble of unpruned decision or regression trees according to the classical top-down procedure. Its two main differences with other tree-based ensemble methods are that it splits nodes by choosing cut-points fully at random and that it uses the whole learning sample (rather than a bootstrap replica) to grow the trees.

Extremely Randomized Trees, 2006.

As such, there are three main hyperparameters to tune in the algorithm; they are the number of decision trees in the ensemble, the number of input features to randomly select and consider for each split point, and the minimum number of samples required in a node to create a new split point.

It has two parameters: K, the number of attributes randomly selected at each node and nmin, the minimum sample size for splitting a node. […] we denote by M the number of trees of this ensemble.

Extremely Randomized Trees, 2006.

The random selection of split points makes the decision trees in the ensemble less correlated, although this increases the variance of the algorithm. This increase in variance can be countered by increasing the number of trees used in the ensemble.

The parameters K, nmin and M have different effects: K determines the strength of the attribute selection process, nmin the strength of averaging output noise, and M the strength of the variance reduction of the ensemble model aggregation.

Extremely Randomized Trees, 2006.

Want to Get Started With Ensemble Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Extra Trees Scikit-Learn API

Extra Trees ensembles can be implemented from scratch, although this can be challenging for beginners.

The scikit-learn Python machine learning library provides an implementation of Extra Trees for machine learning.

It is available in a recent version of the library.

First, confirm that you are using a modern version of the library by running the following script:

Running the script will print your version of scikit-learn.

Your version should be the same or higher.

If not, you must upgrade your version of the scikit-learn library.

Extra Trees is provided via the ExtraTreesRegressor and ExtraTreesClassifier classes.

Both models operate the same way and take the same arguments that influence how the decision trees are created.

Randomness is used in the construction of the model. This means that each time the algorithm is run on the same data, it will produce a slightly different model.

When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance across multiple runs or repeats of cross-validation. When fitting a final model, it may be desirable to either increase the number of trees until the variance of the model is reduced across repeated evaluations, or to fit multiple final models and average their predictions.

Let’s take a look at how to develop an Extra Trees ensemble for both classification and regression.

Extra Trees for Classification

In this section, we will look at using Extra Trees for a classification problem.

First, we can use the make_classification() function to create a synthetic binary classification problem with 1,000 examples and 20 input features.

The complete example is listed below.

Running the example creates the dataset and summarizes the shape of the input and output components.

Next, we can evaluate an Extra Trees algorithm on this dataset.

We will evaluate the model using repeated stratified k-fold cross-validation, with three repeats and 10 folds. We will report the mean and standard deviation of the accuracy of the model across all repeats and folds.

Running the example reports the mean and standard deviation accuracy of the model.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the Extra Trees ensemble with default hyperparameters achieves a classification accuracy of about 91 percent on this test dataset.

We can also use the Extra Trees model as a final model and make predictions for classification.

First, the Extra Trees ensemble is fit on all available data, then the predict() function can be called to make predictions on new data.

The example below demonstrates this on our binary classification dataset.

Running the example fits the Extra Trees ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Now that we are familiar with using Extra Trees for classification, let’s look at the API for regression.

Extra Trees for Regression

In this section, we will look at using Extra Trees for a regression problem.

First, we can use the make_regression() function to create a synthetic regression problem with 1,000 examples and 20 input features.

The complete example is listed below.

Running the example creates the dataset and summarizes the shape of the input and output components.

Next, we can evaluate an Extra Trees algorithm on this dataset.

As we did with the last section, we will evaluate the model using repeated k-fold cross-validation, with three repeats and 10 folds. We will report the mean absolute error (MAE) of the model across all repeats and folds.

The scikit-learn library makes the MAE negative so that it is maximized instead of minimized. This means that larger negative MAE are better and a perfect model has a MAE of 0.

The complete example is listed below.

Running the example reports the mean and standard deviation accuracy of the model.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the Extra Trees ensemble with default hyperparameters achieves a MAE of about 70.

We can also use the Extra Trees model as a final model and make predictions for regression.

First, the Extra Trees ensemble is fit on all available data, then the predict() function can be called to make predictions on new data.

The example below demonstrates this on our regression dataset.

Running the example fits the Extra Trees ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Now that we are familiar with using the scikit-learn API to evaluate and use Extra Trees ensembles, let’s look at configuring the model.

Extra Trees Hyperparameters

In this section, we will take a closer look at some of the hyperparameters you should consider tuning for the Extra Trees ensemble and their effect on model performance.

Explore Number of Trees

An important hyperparameter for Extra Trees algorithm is the number of decision trees used in the ensemble.

Typically, the number of trees is increased until the model performance stabilizes. Intuition might suggest that more trees will lead to overfitting, although this is not the case. Bagging, Random Forest, and Extra Trees algorithms appear to be somewhat immune to overfitting the training dataset given the stochastic nature of the learning algorithm.

The number of trees can be set via the “n_estimators” argument and defaults to 100.

The example below explores the effect of the number of trees with values between 10 to 5,000.

Running the example first reports the mean accuracy for each configured number of decision trees.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that performance rises and stays flat after about 100 trees. Mean accuracy scores fluctuate across 100, 500, and 1,000 trees and this may be statistical noise.

A box and whisker plot is created for the distribution of accuracy scores for each configured number of trees.

We can see the general trend of increasing performance with the number of trees, perhaps leveling out after 100 trees.

Box Plot of Extra Trees Ensemble Size vs. Classification Accuracy

Box Plot of Extra Trees Ensemble Size vs. Classification Accuracy

Explore Number of Features

The number of features that is randomly sampled for each split point is perhaps the most important feature to configure for Extra Trees, as it is for Random Forest.

Like Random Forest, the Extra Trees algorithm is not sensitive to the specific value used, although it is an important hyperparameter to tune.

It is set via the max_features argument and defaults to the square root of the number of input features. In this case for our test dataset, this would be sqrt(20) or about four features.

The example below explores the effect of the number of features randomly selected at each split point on model accuracy. We will try values from 1 to 20 and would expect a small value around four to perform well based on the heuristic.

Running the example first reports the mean accuracy for each feature set size.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the results suggest that a value between four and nine would be appropriate, confirming the sensible default of four on this dataset.

A value of nine might even be better given the larger mean and smaller standard deviation in classification accuracy, although the differences in scores may or may not be statistically significant.

A box and whisker plot is created for the distribution of accuracy scores for each feature set size.

We see a trend in performance rising and peaking with values between four and nine and falling or staying flat as larger feature set sizes are considered.

Box Plot of Extra Trees Feature Set Size vs. Classification Accuracy

Box Plot of Extra Trees Feature Set Size vs. Classification Accuracy

Explore Minimum Samples per Split

A final interesting hyperparameter is the number of samples in a node of the decision tree before adding a split.

New splits are only added to a decision tree if the number of samples is equal to or exceeds this value. It is set via the “min_samples_split” argument and defaults to two samples (the lowest value). Smaller numbers of samples result in more splits and a deeper, more specialized tree. In turn, this can mean lower correlation between the predictions made by trees in the ensemble and potentially lift performance.

The example below explores the effect of Extra Trees minimum samples before splitting on model performance, test values between two and 14.

Running the example first reports the mean accuracy for each configured maximum tree depth.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that small values result in better performance, confirming the sensible default of two.

A box and whisker plot is created for the distribution of accuracy scores for each configured maximum tree depth.

In this case, we can see a trend of improved performance with fewer minimum samples for a split, as we might expect.

Box Plot of Extra Trees Minimum Samples per Split vs. Classification Accuracy

Box Plot of Extra Trees Minimum Samples per Split vs. Classification Accuracy

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Papers

APIs

Summary

In this tutorial, you discovered how to develop Extra Trees ensembles for classification and regression.

Specifically, you learned:

  • Extra Trees ensemble is an ensemble of decision trees and is related to bagging and random forest.
  • How to use the Extra Trees ensemble for classification and regression with scikit-learn.
  • How to explore the effect of Extra Trees model hyperparameters on model performance.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Modern Ensemble Learning!

Ensemble Learning Algorithms With Python

Improve Your Predictions in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Ensemble Learning Algorithms With Python

It provides self-study tutorials with full working code on:
Stacking, Voting, Boosting, Bagging, Blending, Super Learner, and much more...

Bring Modern Ensemble Learning Techniques to
Your Machine Learning Projects


See What's Inside

14 Responses to How to Develop an Extra Trees Ensemble with Python

  1. Mark Littlewood April 22, 2020 at 7:33 am #

    Is there a facility within this algo or others to specify the selection of features rather than randomly select. Let me explain more clearly. If I have features a,b,c,d,e,f,g,h lets suppose I feel that a,b,c are logically related in terms of predictive power and the others belong to a second group. Now I could create two models and then some how pool the predictions but I am wondering if there is a module to do all this under the operation of one model…thanks

    • Jason Brownlee April 22, 2020 at 7:48 am #

      No. It selects features that result in the best performing model.

      If you wanted, you could perform feature selection as a pre-processing step.

  2. Vandana July 12, 2020 at 8:50 pm #

    Hi Jason, I would like to plot the confusion matrix for the accuracy received from “Explore Number of Features”. Each feature shows separate accuracy, so how should I fit the model and plot the confusion matrix with the ‘y-test’ and ‘y-predict’. Thank you.

    • Jason Brownlee July 13, 2020 at 6:02 am #

      You can choose one configration, then split your dataset into train and test sets, fit on the train set and predict for the test set and calculate the confusion matrix for those predictions.

  3. Vandana August 16, 2020 at 1:30 am #

    Hi Jason, What could be the possible reason for any model performing poorly after optimizing the hyperparameters. In my Extra Tree algo, the model scored 72% but after optimizing the hyperparameters, it could score only 63%. Pls can you suggest the reason?

    • Jason Brownlee August 16, 2020 at 5:55 am #

      Perhaps your test harness has a high variance?

      Try repeated stratified k-fold cross-validation, with many repeats, like 10 or 30.

  4. gamze September 5, 2020 at 7:53 am #

    How can I cite this page in my PH.D Thesis?

  5. Kami November 29, 2020 at 11:21 pm #

    When I use BaggingClassiffier like:

    tree = DecisionTreeClassifier(max_depth = 1, splitter = “random”, max_features = “sqrt”)
    extra_trees = BaggingClassifier(base_estimator = tree, n_estimators = 50,
    bootstrap = False)

    and the same effect is when I use ExtraTreesClassifier like:

    extra_trees2=ExtraTreesClassifier(n_estimators=50, max_features=”sqrt”)

    What are the benefits of using an ExtraTreesClassifier, not a BaggingClassifier, beyond the obvious simplicity?

    • Jason Brownlee November 30, 2020 at 6:37 am #

      There’s only really a benefit if it results in better performance for your specific prediction task.

  6. Esther December 5, 2020 at 4:26 am #

    Thank you so much for your time and expertise! How can I use my own dataset in the code (instead of creating a sample dataset using “X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=3)”)? I have a datasest of 45 predictive features that are all numeric data. Thanks!

  7. Esther December 5, 2020 at 12:20 pm #

    Thank you so much, Dr. Brownlee, I was just unsure as to whether I needed to define additional aspects in the load code for a custom dataset. Have a great weekend! 😀

Leave a Reply

Machine Learning Mastery is part of Guiding Tech Media, a leading digital media publisher focused on helping people figure out technology. Visit our corporate website to learn more about our mission and team.