SALE! Use code blackfriday for 40% off everything!
Hurry, sale ends soon! Click to see the full catalog.

How to Develop a Random Subspace Ensemble With Python

Random Subspace Ensemble is a machine learning algorithm that combines the predictions from multiple decision trees trained on different subsets of columns in the training dataset.

Randomly varying the columns used to train each contributing member of the ensemble has the effect of introducing diversity into the ensemble and, in turn, can lift performance over using a single decision tree.

It is related to other ensembles of decision trees such as bootstrap aggregation (bagging) that creates trees using different samples of rows from the training dataset, and random forest that combines ideas from bagging and the random subspace ensemble.

Although decision trees are often used, the general random subspace method can be used with any machine learning model whose performance varies meaningfully with the choice of input features.

In this tutorial, you will discover how to develop random subspace ensembles for classification and regression.

After completing this tutorial, you will know:

  • Random subspace ensembles are created from decision trees fit on different samples of features (columns) in the training dataset.
  • How to use the random subspace ensemble for classification and regression with scikit-learn.
  • How to explore the effect of random subspace model hyperparameters on model performance.

Let’s get started.

How to Develop a Random Subspace Ensemble With Python

How to Develop a Random Subspace Ensemble With Python
Photo by Marsel Minga, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Random Subspace Ensemble
  2. Random Subspace Ensemble via Bagging
    1. Random Subspace Ensemble for Classification
    2. Random Subspace Ensemble for Regression
  3. Random Subspace Ensemble Hyperparameters
    1. Explore Number of Trees
    2. Explore Number of Features
    3. Explore Alternate Algorithm

Random Subspace Ensemble

A predictive modeling problem consists of one or more input variables and a target variable.

A variable is a column in the data and is also often referred to as a feature. We can consider all input features together as defining an n-dimensional vector space, where n is the number of input features and each example (input row of data) is a point in the feature space.

This is a common conceptualization in machine learning and as input feature spaces become larger, the distance between points in the space increases, known generally as the curse of dimensionality.

A subset of input features can, therefore, be thought of as a subset of the input feature space, or a subspace.

Selecting features is a way of defining a subspace of the input feature space. For example, feature selection refers to an attempt to reduce the number of dimensions of the input feature space by selecting a subset of features to keep or a subset of features to delete, often based on their relationship to the target variable.

Alternatively, we can select random subsets of input features to define random subspaces. This can be used as the basis for an ensemble learning algorithm, where a model can be fit on each random subspace of features. This is referred to as a random subspace ensemble or the random subspace method.

The training data is usually described by a set of features. Different subsets of features, or called subspaces, provide different views on the data. Therefore, individual learners trained from different subspaces are usually diverse.

— Page 116, Ensemble Methods, 2012.

It was proposed by Tin Kam Ho in the 1998 paper titled “The Random Subspace Method For Constructing Decision Forests” where a decision tree is fit on each random subspace.

More generally, it is a diversity technique for ensemble learning that belongs to a class of methods that change the training dataset for each model in the attempt to reduce the correlation between the predictions of the models in the ensemble.

The procedure is as simple as selecting a random subset of input features (columns) for each model in the ensemble and fitting the model on the model in the entire training dataset. It can be augmented with additional changes, such as using a bootstrap or random sample of the rows in training dataset.

The classifier consists of multiple trees constructed systematically by pseudorandomly selecting subsets of components of the feature vector, that is, trees constructed in randomly chosen subspaces.

The Random Subspace Method For Constructing Decision Forests, 1998.

As such, the random subspace ensemble is related to bootstrap aggregation (bagging) that introduces diversity by training each model, often a decision tree, on a different random sample of the training dataset, with replacement (e.g. the bootstrap sampling method). The random forest ensemble may also be considered a hybrid of both the bagging and random subset ensemble methods.

Algorithms that use different feature subsets are commonly referred to as random subspace methods …

— Page 21, Ensemble Machine Learning, 2012.

The random subspace method can be used with any machine learning algorithm, although it is well suited to models that are sensitive to large changes to the input features, such as decision trees and k-nearest neighbors.

It is appropriate for datasets that have a large number of input features, as it can result in good performance with good efficiency. If the dataset contains many irrelevant input features, it may be better to use feature selection as a data preparation technique as the prevalence of irrelevant features in subspaces may hurt the performance of the ensemble.

For data with a lot of redundant features, training a learner in a subspace will be not only effective but also efficient.

— Page 116, Ensemble Methods, 2012.

Now that we are familiar with the random subspace ensemble, let’s explore how we can implement the approach.

Random Subspace Ensemble via Bagging

We can implement the random subspace ensemble using bagging in scikit-learn.

Bagging is provided via the BaggingRegressor and BaggingClassifier classes.

We can configure bagging to be a random subspace ensemble by setting the “bootstrap” argument to “False” to turn off sampling of the training dataset rows and setting the maximum number of features to a given value via the “max_features” argument.

The default model for bagging is a decision tree, but it can be changed to any model we like.

We can demonstrate using bagging to implement a random subspace ensemble with decision trees for classification and regression.

Random Subspace Ensemble for Classification

In this section, we will look at developing a random subspace ensemble using bagging for a classification problem.

First, we can use the make_classification() function to create a synthetic binary classification problem with 1,000 examples and 20 input features.

The complete example is listed below.

Running the example creates the dataset and summarizes the shape of the input and output components.

Next, we can configure a bagging model to be a random subspace ensemble for decision trees on this dataset.

Each model will be fit on a random subspace of 10 input features, chosen arbitrarily.

We will evaluate the model using repeated stratified k-fold cross-validation, with three repeats and 10 folds. We will report the mean and standard deviation of the accuracy of the model across all repeats and folds.

Running the example reports the mean and standard deviation accuracy of the model.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the random subspace ensemble with default hyperparameters achieves a classification accuracy of about 85.4 percent on this test dataset.

We can also use the random subspace ensemble model as a final model and make predictions for classification.

First, the ensemble is fit on all available data, then the predict() function can be called to make predictions on new data.

The example below demonstrates this on our binary classification dataset.

Running the example fits the random subspace ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Now that we are familiar with using bagging for classification, let’s look at the API for regression.

Random Subspace Ensemble for Regression

In this section, we will look at using bagging for a regression problem.

First, we can use the make_regression() function to create a synthetic regression problem with 1,000 examples and 20 input features.

The complete example is listed below.

Running the example creates the dataset and summarizes the shape of the input and output components.

Next, we can evaluate a random subspace ensemble via bagging on this dataset.

As before, we must configure bagging to use all rows of the training dataset and specify the number of input features to randomly select.

As we did with the last section, we will evaluate the model using repeated k-fold cross-validation, with three repeats and 10 folds. We will report the mean absolute error (MAE) of the model across all repeats and folds. The scikit-learn library makes the MAE negative so that it is maximized instead of minimized. This means that larger negative MAE are better and a perfect model has a MAE of 0.

The complete example is listed below.

Running the example reports the mean and standard deviation accuracy of the model.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the bagging ensemble with default hyperparameters achieves a MAE of about 114.

We can also use the random subspace ensemble model as a final model and make predictions for regression.

First, the ensemble is fit on all available data, then the predict() function can be called to make predictions on new data.

The example below demonstrates this on our regression dataset.

Running the example fits the random subspace ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Now that we are familiar with using the scikit-learn API to evaluate and use random subspace ensembles, let’s look at configuring the model.

Random Subspace Ensemble Hyperparameters

In this section, we will take a closer look at some of the hyperparameters you should consider tuning for the random subspace ensemble and their effect on model performance.

Explore Number of Trees

An important hyperparameter for the random subspace method is the number of decision trees used in the ensemble. More trees will stabilize the variance of the model, countering the effect of the number of features selected by each tree that introduces diversity.

The number of trees can be set via the “n_estimators” argument and defaults to 10.

The example below explores the effect of the number of trees with values between 10 to 5,000.

Running the example first reports the mean accuracy for each configured number of decision trees.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that that performance appears to continue to improve as the number of ensemble members is increased to 5,000.

A box and whisker plot is created for the distribution of accuracy scores for each configured number of trees.

We can see the general trend of further improvement with the number of decision trees used in the ensemble.

Box Plot of Random Subspace Ensemble Size vs. Classification Accuracy

Box Plot of Random Subspace Ensemble Size vs. Classification Accuracy

Explore Number of Features

The number of features selected for each random subspace controls the diversity of the ensemble.

Fewer features mean more diversity, whereas more features mean less diversity. More diversity may require more trees to reduce the variance of predictions made by the model.

We can vary the diversity of the ensemble by varying the number of random features selected by setting the “max_features” argument.

The example below varies the value from 1 to 20 with a fixed number of trees in the ensemble.

Running the example first reports the mean accuracy for each number of features.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that perhaps using 8 to 11 features in the random subspaces might be appropriate on this dataset when using 100 decision trees. This might suggest increasing the number of trees to a large value first, then tuning the number of features selected in each subset.

A box and whisker plot is created for the distribution of accuracy scores for each number of random subset features.

We can see a general trend of increasing accuracy to a point and a steady decrease in performance after 11 features.

Box Plot of Random Subspace Ensemble Features vs. Classification Accuracy

Box Plot of Random Subspace Ensemble Features vs. Classification Accuracy

Explore Alternate Algorithm

Decision trees are the most common algorithm used in a random subspace ensemble.

The reason for this is that they are easy to configure and work well on most problems.

Other algorithms can be used to construct random subspaces and must be configured to have a modestly high variance. One example is the k-nearest neighbors algorithm where the k value can be set to a low value.

The algorithm used in the ensemble is specified via the “base_estimator” argument and must be set to an instance of the algorithm and algorithm configuration to use.

The example below demonstrates using a KNeighborsClassifier as the base algorithm used in the random subspace ensemble via the bagging class. Here, the algorithm is used with default hyperparameters where k is set to 5.

The complete example is listed below.

Running the example reports the mean and standard deviation accuracy of the model.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the random subspace ensemble with KNN and default hyperparameters achieves a classification accuracy of about 90 percent on this test dataset.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Papers

Books

APIs

Articles

Summary

In this tutorial, you discovered how to develop random subspace ensembles for classification and regression.

Specifically, you learned:

  • Random subspace ensembles are created from decision trees fit on different samples of features (columns) in the training dataset.
  • How to use the random subspace ensemble for classification and regression with scikit-learn.
  • How to explore the effect of random subspace model hyperparameters on model performance.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

28 Responses to How to Develop a Random Subspace Ensemble With Python

  1. marco October 30, 2020 at 9:42 pm #

    Hi Jason,
    I’m thinking about prediction of values based on past information (numerical).
    I’ve seen some methods (VAR i.e. Vector Autoregressive Model), or Prophet or neural net using Keras GRU for a Timeseries forecasting for weather prediction (as an example).
    I’d like to predict job demand for a prototype.
    What is your suggestion?
    What are differences between using a VAR vs. a GRU model for a prediction?
    Is possibile to apply a GRU model to predict (it seems simpler)?
    How many years back (historical data) are necessary to build a prototype?
    Thanks,
    Marco

    • Jason Brownlee October 31, 2020 at 6:48 am #

      I recommend testing a suite of different models and discover what works best for your dataset.

      Start with linear models, then try machine learning models, then try deep learning models. All models must out perform a simple persistence model in order to have skill.

  2. marco October 30, 2020 at 9:48 pm #

    Hi Jason,
    one more question If I’d like to predict job demand and how many students will graduate and plotting the results using two curves (to show the gap).
    Do I have to implement two separate algorithms for predictions?
    Which is the easiest way?
    Thanks,
    Marco

  3. marco November 2, 2020 at 4:03 am #

    Hi Jason,
    thanks for the answer above.
    What is a linear model?
    Is something related to a linear regression?
    Why do you say “start with linear model, then try machine learning models”?
    I usually used linear regression with sklearn (that is a machine learning
    library). You mean implementing a linear regression without any library?
    Thanks,
    Marco

    • Jason Brownlee November 2, 2020 at 6:43 am #

      A linear model is a weighted sum of inputs, like linear regression and logistic regression.

      When I suggest starting with a linear model for time series, I mean SARIMA and ETS.

      I recommend using libraries, not coding algorithms from scratch – unless your goal is to learn how to code algorithms from scratch.

  4. marco November 2, 2020 at 4:03 am #

    Hi Jason,
    one more question can XGBoost used also for timeseries? Do you have an example?
    Thanks,
    Marco

  5. marco November 3, 2020 at 8:01 am #

    Hi Jason,
    one more question what kind of metric is used for time-series with ARIMA and GRU or LSTM model to test model effectiveness?
    Thanks,
    Marco

  6. marco November 3, 2020 at 8:02 am #

    Jason,
    is is possible to use multiple features with ARIMA (so far I’ve seen example
    with one feature). Is there any example? On the contrary, I’ve seen example of GRU/ LSTM models with 14 features (e.g. Timeseries forecasting for weather prediction
    https://keras.io/examples/timeseries/timeseries_weather_forecasting/).
    Thanks,
    Marco

    • Jason Brownlee November 3, 2020 at 10:09 am #

      Not really. You can have exogenous variables, so called ARIMAX and SARIMAX.

  7. marco November 16, 2020 at 7:10 am #

    Hello Jason,
    I’m working on time series, I wolud like to use with linear models (e.g. SARIMA) and with deep learning (Keras GRU).
    I’d like metrics like MAE and RMSE . What is the difference (in simple word) between them? Do you suggest any others metrics?
    Marco

    • Jason Brownlee November 16, 2020 at 7:35 am #

      MAE: mean absolute error.
      RMSE: root mean squared error.

      Either of these metrics is a good starting point.

  8. marco November 16, 2020 at 7:11 am #

    Jason,
    is it possible to plot metrics like MAE and RMSE? Do you have any example?
    Thanks,
    Marco

  9. marco November 17, 2020 at 2:37 am #

    Jason,
    what are main differences between ARIMA and a deep learning (e.g. GRU o LSTM)
    in term of accuracy?

    • Jason Brownlee November 17, 2020 at 6:32 am #

      ARIMA is a linear model.

      Deep learning can learn any function, in theory (given the right architecture and learning configuration).

  10. marco November 17, 2020 at 2:37 am #

    Jason,
    What is easier to use ARIMA vs. GRU / LSTM?
    Marco

  11. marco November 17, 2020 at 5:44 am #

    Jason,
    what Pyhton library suggest to use for ARIMA, SARIMA and SARIMAX?
    Thank you,
    Marco

  12. marco November 17, 2020 at 5:46 am #

    Jason,
    one more question, what are major differences between using XGBRegressor and FBProphet for forecasts based on time series?
    Thanks,
    Marco

    • Jason Brownlee November 17, 2020 at 6:35 am #

      XGBoost is a regression model, not specialised for time series.

      Prophet is a time series model.

  13. marco November 17, 2020 at 7:27 am #

    Jason,
    thanks a lot for you answers.
    A VAR (Vector Autoregressive Model) is something similar to ARIMA? At the end can they produce similar results?
    Thanks,
    Marco

  14. marco November 17, 2020 at 7:28 am #

    Jason,
    in order to predict job demand what is better to use VAR o ARIMA models?
    Thanks,
    Marco

    • Jason Brownlee November 17, 2020 at 7:36 am #

      I recommend testing a suite of models and discover what works well or best for your specific dataset.

Leave a Reply