How to Develop a Feature Selection Subspace Ensemble in Python

Random subspace ensembles consist of the same model fit on different randomly selected groups of input features (columns) in the training dataset.

There are many ways to choose groups of features in the training dataset, and feature selection is a popular class of data preparation techniques designed specifically for this purpose. The features selected by different configurations of the same feature selection method and different feature selection methods entirely can be used as the basis for ensemble learning.

In this tutorial, you will discover how to develop feature selection subspace ensembles with Python.

After completing this tutorial, you will know:

  • Feature selection provides an alternative to random subspaces for selecting groups of input features.
  • How to develop and evaluate ensembles composed of features selected by single feature selection techniques.
  • How to develop and evaluate ensembles composed of features selected by multiple different feature selection techniques.

Kick-start your project with my new book Ensemble Learning Algorithms With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Develop a Feature Selection Subspace Ensemble in Python

How to Develop a Feature Selection Subspace Ensemble in Python
Photo by Bernard Spragg. NZ, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Feature Selection Subspace Ensemble
  2. Single Feature Selection Method Ensembles
    1. ANOVA F-statistic Ensemble
    2. Mutual Information Ensemble
    3. Recursive Feature Selection Ensemble
  3. Combined Feature Selection Ensembles
    1. Ensemble With Fixed Number of Features
    2. Ensemble With Contiguous Number of Features

Feature Selection Subspace Ensemble

The random subspace method or random subspace ensemble is an approach to ensemble learning that fits a model on different groups of randomly selected columns in the training dataset.

The difference in the choice of columns used to train each model in the ensemble results in a diversity of models and their predictions. Each model performs well, although each performs differently, making different errors.

The training data is usually described by a set of features. Different subsets of features, or called subspaces, provide different views on the data. Therefore, individual learners trained from different subspaces are usually diverse.

— Page 116, Ensemble Methods, 2012.

The random subspace method is often used with decision trees and the predictions made by each tree are then combined using simple statistics, such as calculating the mode class label for classification or the mean prediction for regression.

Feature selection is a data preparation technique that attempts to select a subset of columns in a dataset that is most relevant to the target variable. Popular approaches involve using statistical measures, such as mutual information, and evaluating models on subsets of features and selecting the subset that results in the best performing model, called recursive feature elimination, or RFE for short.

Each feature selection method will have a different idea or informed guess about what features are most relevant to the target variable. Further, feature selection methods can be tailored to select a specific number of features from 1 to the total number of columns in the dataset, a hyperparameter that can be tuned as part of model selection.

Each set of selected features may be considered as a subset of the input feature space, much like a random subspace ensemble, although chosen using a metric instead of randomly. We can use features chosen by feature selection methods as a type of ensemble model.

There may be many ways that this could be implemented, but perhaps two natural approaches include:

  • One Method: Generate a feature subspace for each number of features from 1 to the number of columns in the dataset, fit a model on each, and combine their predictions.
  • Multiple Methods: Generate a feature subspace using multiple different feature selection methods, fit a model on each, and combine their predictions.

For lack of a better name, we can refer to this as a “Feature Selection Subspace Ensemble.”

We will explore this idea in this tutorial.

Let’s define a test problem as the basis for this exploration and establish a baseline in performance to see if it offers a benefit over a single model.

First, we can use the make_classification() function to create a synthetic binary classification problem with 1,000 examples and 20 input features, five of which are redundant.

The complete example is listed below.

Running the example creates the dataset and summarizes the shape of the input and output components.

Next, we can establish a baseline in performance. We will develop a decision tree for the dataset and evaluate it using repeated stratified k-fold cross-validation with three repeats and 10 folds.

The results will be reported as the mean and standard deviation of the classification accuracy across all repeats and folds.

The complete example is listed below.

Running the example reports the mean and standard deviation classification accuracy.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a single decision tree model achieves a classification accuracy of approximately 79.4 percent. We can use this as a baseline in performance to see if our feature selection ensembles are able to achieve better performance.

Next, let’s explore using different feature selection methods as the basis for ensembles.

Want to Get Started With Ensemble Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Single Feature Selection Method Ensembles

In this section, we will explore creating an ensemble from the features selected by individual feature selection methods.

For a given feature selection method, we will apply it repeatedly with different numbers of selected features to create multiple feature subspaces. We will then train a model on each, in this case, a decision tree, and combine the predictions.

There are many ways to combine the predictions, but to keep things simple, we will use a voting ensemble that can be configured to use hard or soft voting for classification, or averaging for regression. To keep the examples simple, we will focus on classification and use hard voting, as the decision trees do not predict calibrated probabilities, making soft voting less appropriate.

To learn more about voting ensembles, see the tutorial:

Each model in the voting ensemble will be a Pipeline where the first step is a feature selection method, configured to select a specific number of features, followed by a decision tree classifier model.

We will create one feature selection subspace for each number of columns in the input dataset from 1 to the number of columns. This was chosen arbitrarily for simplicity and you might want to experiment with different numbers of features in the ensemble, such as odd numbers of features, or more elaborate methods.

As such, we can define a helper function named get_ensemble() that creates a voting ensemble with feature selection-based members for a given number of input features. We can then use this function as a template to explore using different feature selection methods.

Given that we are working with a classification dataset, we will explore three different feature selection methods:

  • ANOVA F-statistic.
  • Mutual Information.
  • Recursive Feature Selection.

Let’s take a closer look at each.

ANOVA F-statistic Ensemble

ANOVA is an acronym for “analysis of variance” and is a parametric statistical hypothesis test for determining whether the means from two or more samples of data (often three or more) come from the same distribution or not.

An F-statistic, or F-test, is a class of statistical tests that calculate the ratio between variances values, such as the variance from two different samples or the explained and unexplained variance by a statistical test, like ANOVA. The ANOVA method is a type of F-statistic referred to here as an ANOVA F-test.

The scikit-learn machine library provides an implementation of the ANOVA F-test in the f_classif() function. This function can be used in a feature selection strategy, such as selecting the top k most relevant features (largest values) via the SelectKBest class.

Tying this together, the example below evaluates a voting ensemble composed of models fit on feature subspaces selected by the ANOVA F-statistic.

Running the example reports the mean and standard deviation classification accuracy.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see a lift in performance over a single model that achieved an accuracy of about 79.4 percent to about 83.2 percent using an ensemble of models on features selected by the ANOVA F-statistic.

Next, let’s explore using mutual information.

Mutual Information Ensemble

Mutual information from the field of information theory is the application of information gain (typically used in the construction of decision trees) to feature selection.

Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable. It is straightforward when considering the distribution of two discrete (categorical or ordinal) variables, such as categorical input and categorical output data. Nevertheless, it can be adapted for use with numerical input and categorical output.

The scikit-learn machine learning library provides an implementation of mutual information for feature selection with numeric input and categorical output variables via the mutual_info_classif() function. Like f_classif(), it can be used in the SelectKBest feature selection strategy (and other strategies).

Tying this together, the example below evaluates a voting ensemble composed of models fit on feature subspaces selected by mutual information.

Running the example reports the mean and standard deviation classification accuracy.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see a lift in performance over using a single model, although slightly less than feature subspace selected, with the ANOVA F-statistic achieving a mean accuracy of about 82.7 percent.

Next, let’s explore subspaces selected using RFE.

Recursive Feature Selection Ensemble

Recursive Feature Elimination, or RFE for short, works by searching for a subset of features by starting with all features in the training dataset and successfully removing features until the desired number remains.

This is achieved by fitting the given machine learning algorithm used in the core of the model, ranking features by importance, discarding the least important features, and re-fitting the model. This process is repeated until a specified number of features remains.

For more on RFE, see the tutorial:

The RFE method is available via the RFE class in scikit-learn and can be used for feature selection directly. No need to combine it with the SelectKBest class.

Tying this together, the example below evaluates a voting ensemble composed of models fit on feature subspaces selected by RFE.

Running the example reports the mean and standard deviation classification accuracy.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the mean accuracy is similar to that seen with mutual information feature selection, with a score of about 82.3 percent.

This is a good start, and it might be interesting to see if better results can be achieved using ensembles composed of fewer members, e.g. every second, third, or fifth number of selected features.

Next, let’s see if we can improve results by combining models fit on feature subspaces selected by different feature selection methods.

Combined Feature Selection Ensembles

In the previous section, we saw that we can get a lift in performance over a single model by using a single feature selection method as the basis of an ensemble prediction for a dataset.

We would expect the predictions between many of the members of the ensemble to be correlated. This could be addressed by using different numbers of selected input features as the basis for the ensemble rather than a contiguous number of features from 1 to the number of columns.

An alternative approach to introducing diversity is to select feature subspaces using different feature selection methods.

We will explore two versions of this approach. With the first, we will select the same number of features from each method, and with the second, we will select a contiguous number of features from 1 to the number of columns for multiple methods.

Ensemble With Fixed Number of Features

In this section, we will make our first attempt at devising an ensemble using features selected by multiple feature selection techniques.

We will select an arbitrary number of features from the dataset, then use each of the three feature selection methods to select a feature subspace, fit a model of each, and use them as the basis for a voting ensemble.

The get_ensemble() function below implements this, taking the specified number of features to select with each method as an argument. The hope is that the features selected by each method are sufficiently different and sufficiently skillful to result in an effective ensemble.

Tying this together, the example below evaluates an ensemble of a fixed number of features selected using different feature selection methods.

Running the example reports the mean and standard deviation classification accuracy.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see a modest lift in performance over the techniques considered in the previous section, resulting in a mean classification accuracy of about 83.9 percent.

A more fair comparison might be to compare this result to each individual model that comprises the ensemble.

The updated example performs exactly this comparison.

Running the example reports the mean performance of each single model fit on the selected features and ends with the performance of the ensemble that combines all three models.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the results suggest that the ensemble of the models fit on the selected features performs better than any single model in the ensemble, as we might hope.

A figure is created to show box and whisker plots for each set of results, allowing the distribution accuracy scores to be compared directly.

We can see that the distribution for the ensemble both skews higher and has a larger median classification accuracy (orange line), visually confirming the finding.

Box and Whisker Plots of Accuracy of Singles Model Fit On Selected Features vs. Ensemble

Box and Whisker Plots of Accuracy of Singles Model Fit On Selected Features vs. Ensemble

Next, let’s explore adding multiple members for each feature selection method.

Ensemble With Contiguous Number of Features

We can combine the experiments from the previous section with the above experiment.

Specifically, we can select multiple feature subspaces using each feature selection method, fit a model on each, and add all of the models to a single ensemble.

In this case, we will select subspace as we did in the previous section from 1 to the number of columns in the dataset, although in this case, repeat the process with each feature selection method.

The hope is that the diversity of the selected features across the feature selection methods results in a further lift in ensemble performance.

Tying this together, the complete example is listed below.

Running the example reports the mean and standard deviation classification accuracy of the ensemble.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see a further lift of performance as we hoped, where the combined ensemble resulted in a mean classification accuracy of about 86.0 percent.

The use of feature selection for selecting subspaces of input features may provide an interesting alternative or perhaps complement to selecting random subspaces.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Books

APIs

Articles

Summary

In this tutorial, you discovered how to develop feature selection subspace ensembles with Python.

Specifically, you learned:

  • Feature selection provides an alternative to random subspaces for selecting groups of input features.
  • How to develop and evaluate ensembles composed of features selected by single feature selection techniques.
  • How to develop and evaluate ensembles composed of features selected by multiple different feature selection techniques.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Modern Ensemble Learning!

Ensemble Learning Algorithms With Python

Improve Your Predictions in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Ensemble Learning Algorithms With Python

It provides self-study tutorials with full working code on:
Stacking, Voting, Boosting, Bagging, Blending, Super Learner, and much more...

Bring Modern Ensemble Learning Techniques to
Your Machine Learning Projects


See What's Inside

27 Responses to How to Develop a Feature Selection Subspace Ensemble in Python

  1. Avatar
    Bartosz November 21, 2020 at 9:47 pm #

    Hey, thanks for that knoldege. However I’ve got some questions. You wrote:

    “We can see that the distribution for the ensemble both skews higher and has a larger median classification accuracy (orange line), visually confirming the finding”.

    What is the median of accuracy – is a median of accuracies from different subsets created by cross valuation for one method? How to interpret the skewness in that case? How should the best case look like?

  2. Avatar
    Igors Papka November 22, 2020 at 5:49 am #

    Dear Dr. Jason,
    Thank you for the tutorial.
    You wrote: “There are many ways to combine the predictions…”
    Can you list some of them and their implementation in python? I have found out and tried only one other yet – StackingClassifier (Regressor).

    • Avatar
      Jason Brownlee November 22, 2020 at 7:00 am #

      Yes, I have some posts on this topic scheduled. Stay tuned.

  3. Avatar
    Imran December 6, 2020 at 2:20 am #

    Thank you very much for the detailed treatment of this topic along with code snippets. Learned alot.

  4. Avatar
    Mojtaba February 26, 2021 at 2:42 am #

    Hey Dr. Jason,
    Thank you for this useful tutorial.
    I have a question:
    how to see that each model in ensemble method selected which features?

    • Avatar
      Jason Brownlee February 26, 2021 at 5:02 am #

      You’re welcome.

      You can summarize each model separately if you like. What problem are you having precisely?

  5. Avatar
    Mojtaba February 27, 2021 at 1:58 am #

    Thank you.
    My main question: Is there a function in Ensemble methods(scikit-learn) that shows the selected features in each estimator separately or should be defined this function manually?

    • Avatar
      Jason Brownlee February 27, 2021 at 6:05 am #

      You can interrogate your fit models in order to find out how features were ranked, but why do you need to know?

      • Avatar
        Mojtaba March 1, 2021 at 5:17 am #

        I want to use different models to get different subsets and voting between them then get the final subset .
        The goal is to have a reliable subset to use in any ML algorithm

        • Avatar
          Jason Brownlee March 1, 2021 at 5:41 am #

          Interesting, let me know how you go.

          • Avatar
            Mojtaba March 1, 2021 at 8:13 pm #

            Of course, I will email the results to you.

        • Avatar
          Sk September 20, 2021 at 8:20 pm #

          Hi…did you get how to extract the feature subset from ensemble model.
          I am also working on this ensemble feature selection method but stuck in extraction of final feature subset.
          If you have successfully extracted the features from ensemble, then please let me know how you go.

          Thanks

  6. Avatar
    Hugo Souza March 23, 2021 at 6:04 am #

    Hi Jason!

    Once again congratulations on the work and thanks for the posts!

    I have two questions:
    Can I add more classifier models? Like SVM and Random Forest?
    I tried in this same example of yours to introduce chi-squared as follows:
    fs = SelectKBest (score_func = chi2, k = n_features)
    chi2 = Pipeline ([(‘fs’, fs), (‘m’, DecisionTreeClassifier ())))
    models.append ((‘chi2’, chi2))
    names.append (‘chi2’)
    But it is giving the following error:

    UnboundLocalError: local variable ‘chi2’ referenced before assignment

    What am I doing wrong? I already declared it in the return too

    • Avatar
      Jason Brownlee March 24, 2021 at 5:43 am #

      Thanks.

      You must have all data preparation occur first, then have one model make predictions.

  7. Avatar
    shashank kumar singh October 18, 2021 at 3:10 pm #

    Hello Dr. Jason,

    Thanks for the wonderful information given in all of your posts.

    Regarding this post can you please tell me how to list the names of the features selected in ensemble

    • Adrian Tam
      Adrian Tam October 20, 2021 at 9:42 am #

      If you can get hold into the fitted feature selector, fs.get_feature_names_out() will print you the names.

      • Avatar
        Esther December 11, 2021 at 10:24 am #

        in respond to:

        “If you can get hold into the fitted feature selector, fs.get_feature_names_out() will print you the names.

        Adrian, any idea as to how to do this? i’ve been unsuccessful :/

        • Adrian Tam
          Adrian Tam December 15, 2021 at 5:47 am #

          Any error message?

  8. Avatar
    Esther December 17, 2021 at 3:55 am #

    NameError: name ‘fs’ is not defined

    • Adrian Tam
      Adrian Tam December 17, 2021 at 7:31 am #

      fs should be a feature selection transform, e.g., RFE object

    • Avatar
      James Carmichael December 21, 2021 at 12:02 pm #

      Hi Esther…Please provide the exact code for which you are receiving this error.

      Regards,

  9. Avatar
    Esther December 18, 2021 at 12:23 pm #

    Since it produced the highest accuracy (86%), I was wanting to getting the list of features that the ensemble selected for the code under the “Ensemble With Contiguous Number of Features” section of this tutorial. I just don’t know where to insert the line “fs.get_feature_names_out()” into the code block.

  10. Avatar
    Felipe de Morais September 20, 2022 at 5:20 am #

    Dear Dr. Brownlee.
    First of all, thank you for the reach post about feature selection ensembles.
    I really appreciate all your content about machine learning.
    I have a question about feature selection methods. Is it possible to aggregate feature selection methods?
    For example, suppose I have almost 300 features. Could I use a selecKBest model with f_classify to select the 100 best features, then after that use a SequentialFeatureSelection to select the 10 best features from these 100? What do you think about this approach?
    Once again, congratulations on your posts and thank you for your help.
    Sincerely, Felipe.

    • Avatar
      James Carmichael September 20, 2022 at 9:39 am #

      Hi Felipe…You are very welcome! I see no issue with your suggestion. Please proceed with it and let us know your findings.

Leave a Reply