Feature Selection with Stochastic Optimization Algorithms

Typically, a simpler and better-performing machine learning model can be developed by removing input features (columns) from the training dataset.

This is called feature selection and there are many different types of algorithms that can be used.

It is possible to frame the problem of feature selection as an optimization problem. In the case that there are few input features, all possible combinations of input features can be evaluated and the best subset found definitively. In the case of a vast number of input features, a stochastic optimization algorithm can be used to explore the search space and find an effective subset of features.

In this tutorial, you will discover how to use optimization algorithms for feature selection in machine learning.

After completing this tutorial, you will know:

  • The problem of feature selection can be broadly defined as an optimization problem.
  • How to enumerate all possible subsets of input features for a dataset.
  • How to apply stochastic optimization to select an optimal subset of input features.

Let’s get started.

How to Use Optimization for Feature Selection

How to Use Optimization for Feature Selection
Photo by Gregory “Slobirdr” Smith, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Optimization for Feature Selection
  2. Enumerate All Feature Subsets
  3. Optimize Feature Subsets

Optimization for Feature Selection

Feature selection is the process of reducing the number of input variables when developing a predictive model.

It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model. There are many different types of feature selection algorithms, although they can broadly be grouped into two main types: wrapper and filter methods.

Wrapper feature selection methods create many models with different subsets of input features and select those features that result in the best performing model according to a performance metric. These methods are unconcerned with the variable types, although they can be computationally expensive. RFE is a good example of a wrapper feature selection method.

Filter feature selection methods use statistical techniques to evaluate the relationship between each input variable and the target variable, and these scores are used as the basis to choose (filter) those input variables that will be used in the model.

  • Wrapper Feature Selection: Search for well-performing subsets of features.
  • Filter Feature Selection: Select subsets of features based on their relationship with the target.

For more on choosing feature selection algorithms, see the tutorial:

A popular wrapper method is the Recursive Feature Elimination, or RFE, algorithm.

RFE works by searching for a subset of features by starting with all features in the training dataset and successfully removing features until the desired number remains.

This is achieved by fitting the given machine learning algorithm used in the core of the model, ranking features by importance, discarding the least important features, and re-fitting the model. This process is repeated until a specified number of features remains.

For more on RFE, see the tutorial:

The problem of wrapper feature selection can be framed as an optimization problem. That is, find a subset of input features that result in the best model performance.

RFE is one approach to solving this problem systematically, although it may be limited by a large number of features.

An alternative approach would be to use a stochastic optimization algorithm, such as a stochastic hill climbing algorithm, when the number of features is very large. When the number of features is relatively small, it may be possible to enumerate all possible subsets of features.

  • Few Input Variables: Enumerate all possible subsets of features.
  • Many Input Features: Stochastic optimization algorithm to find good subsets of features.

Now that we are familiar with the idea that feature selection may be explored as an optimization problem, let’s look at how we might enumerate all possible feature subsets.

Enumerate All Feature Subsets

When the number of input variables is relatively small and the model evaluation is relatively fast, then it may be possible to enumerate all possible subsets of input variables.

This means evaluating the performance of a model using a test harness given every possible unique group of input variables.

We will explore how to do this with a worked example.

First, let’s define a small binary classification dataset with few input features. We can use the make_classification() function to define a dataset with five input variables, two of which are informative, and 1,000 rows.

The example below defines the dataset and summarizes its shape.

Running the example creates the dataset and confirms that it has the desired shape.

Next, we can establish a baseline in performance using a model evaluated on the entire dataset.

We will use a DecisionTreeClassifier as the model because its performance is quite sensitive to the choice of input variables.

We will evaluate the model using good practices, such as repeated stratified k-fold cross-validation with three repeats and 10 folds.

The complete example is listed below.

Running the example evaluates the decision tree on the entire dataset and reports the mean and standard deviation classification accuracy.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved an accuracy of about 80.5 percent.

Next, we can try to improve model performance by using a subset of the input features.

First, we must choose a representation to enumerate.

In this case, we will enumerate a list of boolean values, with one value for each input feature: True if the feature is to be used and False if the feature is not to be used as input.

For example, with the five input features the sequence [True, True, True, True, True] would use all input features, and [True, False, False, False, False] would only use the first input feature as input.

We can enumerate all sequences of boolean values with the length=5 using the product() Python function. We must specify the valid values [True, False] and the number of steps in the sequence, which is equal to the number of input variables.

The function returns an iterable that we can enumerate directly for each sequence.

For a given sequence of boolean values, we can enumerate it and transform it into a sequence of column indexes for each True in the sequence.

If the sequence has no column indexes (in the case of all False values), then we can skip that sequence.

We can then use the column indexes to choose the columns in the dataset.

And this subset of the dataset can then be evaluated as we did before.

If the accuracy for the model is better than the best sequence found so far, we can store it.

And that’s it.

Tying this together, the complete example of feature selection by enumerating all possible feature subsets is listed below.

Running the example reports the mean classification accuracy of the model for each subset of features considered. The best subset is then reported at the end of the run.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the best subset of features involved features at indexes [2, 3, 4] that resulted in a mean classification accuracy of about 83.0 percent, which is better than the result reported previously using all input features.

Now that we know how to enumerate all possible feature subsets, let’s look at how we might use a stochastic optimization algorithm to choose a subset of features.

Optimize Feature Subsets

We can apply a stochastic optimization algorithm to the search space of subsets of input features.

First, let’s define a larger problem that has many more features, making model evaluation too slow and the search space too large for enumerating all subsets.

We will define a classification problem with 10,000 rows and 500 input features, 10 of which are relevant and the remaining 490 are redundant.

Running the example creates the dataset and confirms that it has the desired shape.

We can establish a baseline in performance by evaluating a model on the dataset with all input features.

Because the dataset is large and the model is slow to evaluate, we will modify the evaluation of the model to use 3-fold cross-validation, e.g. fewer folds and no repeats.

The complete example is listed below.

Running the example evaluates the decision tree on the entire dataset and reports the mean and standard deviation classification accuracy.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved an accuracy of about 91.3 percent.

This provides a baseline that we would expect to outperform using feature selection.

We will use a simple stochastic hill climbing algorithm as the optimization algorithm.

First, we must define the objective function. It will take the dataset and a subset of features to use as input and return an estimated model accuracy from 0 (worst) to 1 (best). It is a maximizing optimization problem.

This objective function is simply the decoding of the sequence and model evaluation step from the previous section.

The objective() function below implements this and returns both the score and the decoded subset of columns used for helpful reporting.

We also need a function that can take a step in the search space.

Given an existing solution, it must modify it and return a new solution in close proximity. In this case, we will achieve this by randomly flipping the inclusion/exclusion of columns in subsequence.

Each position in the sequence will be considered independently and will be flipped probabilistically where the probability of flipping is a hyperparameter.

The mutate() function below implements this given a candidate solution (sequence of booleans) and a mutation hyperparameter, creating and returning a modified solution (a step in the search space).

The larger the p_mutate value (in the range 0 to 1), the larger the step in the search space.

We can now implement the hill climbing algorithm.

The initial solution is a randomly generated sequence, which is then evaluated.

We then loop for a fixed number of iterations, creating mutated versions of the current solution, evaluating them, and saving them if the score is better.

The hillclimbing() function below implements this, taking the dataset, objective function, and hyperparameters as arguments and returns the best subset of dataset columns and the estimated performance of the model.

We can then call this function and pass in our synthetic dataset to perform optimization for feature selection.

In this case, we will run the algorithm for 100 iterations and make about five flips to the sequence for a given mutation, which is quite conservative.

At the end of the run, we will convert the boolean sequence into column indexes (so we could fit a final model if we wanted) and report the performance of the best subsequence.

Tying this all together, the complete example is listed below.

Running the example reports the mean classification accuracy of the model for each subset of features considered. The best subset is then reported at the end of the run.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the best performance was achieved with a subset of 239 features and a classification accuracy of approximately 91.8 percent.

This is better than a model evaluated on all input features.

Although the result is better, we know we can do a lot better, perhaps with tuning of the hyperparameters of the optimization algorithm or perhaps by using an alternate optimization algorithm.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

APIs

Summary

In this tutorial, you discovered how to use optimization algorithms for feature selection in machine learning.

Specifically, you learned:

  • The problem of feature selection can be broadly defined as an optimization problem.
  • How to enumerate all possible subsets of input features for a dataset.
  • How to apply stochastic optimization to select an optimal subset of input features.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

17 Responses to Feature Selection with Stochastic Optimization Algorithms

  1. fabou December 25, 2020 at 5:38 am #

    Nice Chistmas post. Eager to try what is shown.

    • Jason Brownlee December 25, 2020 at 5:45 am #

      Thanks!

      Let me know how you go with it.

      • Fabrice BOUCAHREL December 31, 2020 at 10:36 pm #

        After a while, I have decided to use Optimization for grouping categories from a categorical feature. I am working on it.
        Its seems to me the mutation step is quite important. I chose to mutate a category grouping in 3 random ways : merge 2 groups, split 1 group or put a category from one group to another. This has to be down for all the categorical features in the dataset.
        The evaluation has of course to be down using cross validation.

  2. Saber Abid December 25, 2020 at 6:14 am #

    Thank you Jason for this article which is very relevant,
    I allow myself to ask a question, indeed
    if we would like to model this problem and associate it with a mathematical model, what will be the constraints and the objective function?

    thanks again

    Saber

    • Keith December 25, 2020 at 7:42 pm #

      For an objective function, see the paper entitled Feature importance ranking for deep learning presented in NeurIPS 2020.

    • Jason Brownlee December 26, 2020 at 5:01 am #

      You’re welcome.

      Not sure what you mean exactly. The objective function is the evaluation of a model on a dataset.

  3. Jhon Connor December 25, 2020 at 7:41 am #

    Is there any article in which a pseudo code of the algorithm is shown?

  4. Van Tai December 25, 2020 at 11:32 pm #

    Nice work. Thanks for that and merry Christmas!

  5. marco December 26, 2020 at 4:34 am #

    Hi Jason,
    merry Christmas.
    I’ ve a question about for scatterplot matrix.
    I’m wondering if it is applicable to classification or regression?
    Thanks

    • Jason Brownlee December 26, 2020 at 5:13 am #

      Yes.

      You can create pair-wise scatter plots of any data you like, e.g. pairs of input variables for regression or classification datasets.

  6. marco December 26, 2020 at 4:35 am #

    Another question,
    I’ve seen a lot of enhancements in new version of sklearn (0.24) for HistGradientBoostingClassifier.
    What are major differences between XGBoost and HistGradientBoostingClassifier?
    Do you have any example for HistGradientBoostingClassifier?
    Is better to use XGBoost or HistGradientBoostingClassifier?
    Thanks
    Marco

  7. Md Sajid December 26, 2020 at 7:27 pm #

    Thanks For sharing , very informative article

Leave a Reply