Last Updated on

Feature selection is the process of reducing the number of input variables when developing a predictive model.

It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model.

Feature-based feature selection methods involve evaluating the relationship between each input variable and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable. These methods can be fast and effective, although the choice of statistical measures depends on the data type of both the input and output variables.

As such, it can be challenging for a machine learning practitioner to select an appropriate statistical measure for a dataset when performing filter-based feature selection.

In this post, you will discover how to choose statistical measures for filter-based feature selection with numerical and categorical data.

After reading this post, you will know:

- There are two main types of feature selection techniques: wrapper and filter methods.
- Filter-based feature selection methods use statistical measures to score the correlation or dependence between input variables that can be filtered to choose the most relevant features.
- Statistical measures for feature selection must be carefully chosen based on the data type of the input variable and the output or response variable.

Let’s get started.

**Update Nov/2019**: Added some worked examples for classification and regression.

## Overview

This tutorial is divided into 4 parts; they are:

- Feature Selection Methods
- Statistics for Filter Feature Selection Methods
- Numerical Input, Numerical Output
- Numerical Input, Categorical Output
- Categorical Input, Numerical Output
- Categorical Input, Categorical Output

- Tips and Tricks for Feature Selection
- Correlation Statistics
- Selection Method
- Transform Variables
- What Is the Best Method?

- Worked Examples
- Regression Feature Selection
- Classification Feature Selection

## 1. Feature Selection Methods

Feature selection methods are intended to reduce the number of input variables to those that are believed to be most useful to a model in order to predict the target variable.

Some predictive modeling problems have a large number of variables that can slow the development and training of models and require a large amount of system memory. Additionally, the performance of some models can degrade when including input variables that are not relevant to the target variable.

There are two main types of feature selection algorithms: wrapper methods and filter methods.

- Wrapper Feature Selection Methods.
- Filter Feature Selection Methods.

**Wrapper feature selection methods** create many models with different subsets of input features and select those features that result in the best performing model according to a performance metric. These methods are unconcerned with the variable types, although they can be computationally expensive. RFE is a good example of a wrapper feature selection method.

Wrapper methods evaluate multiple models using procedures that add and/or remove predictors to find the optimal combination that maximizes model performance.

— Page 490, Applied Predictive Modeling, 2013.

**Filter feature selection methods** use statistical techniques to evaluate the relationship between each input variable and the target variable, and these scores are used as the basis to choose (filter) those input variables that will be used in the model.

Filter methods evaluate the relevance of the predictors outside of the predictive models and subsequently model only the predictors that pass some criterion.

— Page 490, Applied Predictive Modeling, 2013.

It is common to use correlation type statistical measures between input and output variables as the basis for filter feature selection. As such, the choice of statistical measures is highly dependent upon the variable data types.

Common data types include numerical (such as height) and categorical (such as a label), although each may be further subdivided such as integer and floating point for numerical variables, and boolean, ordinal, or nominal for categorical variables.

Common input variable data types:

**Numerical Variables**- Integer Variables.
- Floating Point Variables.

**Categorical Variables**.- Boolean Variables (dichotomous).
- Ordinal Variables.
- Nominal Variables.

The more that is known about the data type of a variable, the easier it is to choose an appropriate statistical measure for a filter-based feature selection method.

In the next section, we will review some of the statistical measures that may be used for filter-based feature selection with different input and output variable data types.

## 2. Statistics for Filter-Based Feature Selection Methods

In this section, we will consider two broad categories of variable types: numerical and categorical; also, the two main groups of variables to consider: input and output.

Input variables are those that are provided as input to a model. In feature selection, it is this group of variables that we wish to reduce in size. Output variables are those for which a model is intended to predict, often called the response variable.

The type of response variable typically indicates the type of predictive modeling problem being performed. For example, a numerical output variable indicates a regression predictive modeling problem, and a categorical output variable indicates a classification predictive modeling problem.

**Numerical Output**: Regression predictive modeling problem.**Categorical Output**: Classification predictive modeling problem.

The statistical measures used in filter-based feature selection are generally calculated one input variable at a time with the target variable. As such, they are referred to as univariate statistical measures. This may mean that any interaction between input variables is not considered in the filtering process.

Most of these techniques are univariate, meaning that they evaluate each predictor in isolation. In this case, the existence of correlated predictors makes it possible to select important, but redundant, predictors. The obvious consequences of this issue are that too many predictors are chosen and, as a result, collinearity problems arise.

— Page 499, Applied Predictive Modeling, 2013.

With this framework, let’s review some univariate statistical measures that can be used for filter-based feature selection.

### Numerical Input, Numerical Output

This is a regression predictive modeling problem with numerical input variables.

The most common techniques are to use a correlation coefficient, such as Pearson’s for a linear correlation, or rank-based methods for a nonlinear correlation.

- Pearson’s correlation coefficient (linear).
- Spearman’s rank coefficient (nonlinear)

### Numerical Input, Categorical Output

This is a classification predictive modeling problem with numerical input variables.

This might be the most common example of a classification problem,

Again, the most common techniques are correlation based, although in this case, they must take the categorical target into account.

- ANOVA correlation coefficient (linear).
- Kendall’s rank coefficient (nonlinear).

Kendall does assume that the categorical variable is ordinal.

### Categorical Input, Numerical Output

This is a regression predictive modeling problem with categorical input variables.

This is a strange example of a regression problem (e.g. you would not encounter it often).

Nevertheless, you can use the same “*Numerical Input, Categorical Output*” methods (described above), but in reverse.

### Categorical Input, Categorical Output

This is a classification predictive modeling problem with categorical input variables.

The most common correlation measure for categorical data is the chi-squared test. You can also use mutual information (information gain) from the field of information theory.

- Chi-Squared test (contingency tables).
- Mutual Information.

In fact, mutual information is a powerful method that may prove useful for both categorical and numerical data, e.g. it is agnostic to the data types.

## 3. Tips and Tricks for Feature Selection

This section provides some additional considerations when using filter-based feature selection.

### Correlation Statistics

The scikit-learn library provides an implementation of most of the useful statistical measures.

For example:

- Pearson’s Correlation Coefficient: f_regression()
- ANOVA: f_classif()
- Chi-Squared: chi2()
- Mutual Information: mutual_info_classif() and mutual_info_regression()

Also, the SciPy library provides an implementation of many more statistics, such as Kendall’s tau (kendalltau) and Spearman’s rank correlation (spearmanr).

### Selection Method

The scikit-learn library also provides many different filtering methods once statistics have been calculated for each input variable with the target.

Two of the more popular methods include:

- Select the top k variables: SelectKBest
- Select the top percentile variables: SelectPercentile

I often use *SelectKBest* myself.

### Transform Variables

Consider transforming the variables in order to access different statistical methods.

For example, you can transform a categorical variable to ordinal, even if it is not, and see if any interesting results come out.

You can also make a numerical variable discrete (e.g. bins); try categorical-based measures.

Some statistical measures assume properties of the variables, such as Pearson’s that assumes a Gaussian probability distribution to the observations and a linear relationship. You can transform the data to meet the expectations of the test and try the test regardless of the expectations and compare results.

### What Is the Best Method?

There is no best feature selection method.

Just like there is no best set of input variables or best machine learning algorithm. At least not universally.

Instead, you must discover what works best for your specific problem using careful systematic experimentation.

Try a range of different models fit on different subsets of features chosen via different statistical measures and discover what works best for your specific problem.

## 4. Worked Examples of Feature Selection

It can be helpful to have some worked examples that you can copy-and-paste and adapt for your own project.

This section provides worked examples of feature selection cases that you can use as a starting point.

### Regression Feature Selection:

(*Numerical Input, Numerical Output*)

This section demonstrates feature selection for a regression problem that as numerical inputs and numerical outputs.

A test regression problem is prepared using the make_regression() function.

Feature selection is performed using Pearson’s Correlation Coefficient via the f_regression() function.

1 2 3 4 5 6 7 8 9 10 11 |
# pearson's correlation feature selection for numeric input and numeric output from sklearn.datasets import make_regression from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_regression # generate dataset X, y = make_regression(n_samples=100, n_features=100, n_informative=10) # define feature selection fs = SelectKBest(score_func=f_regression, k=10) # apply feature selection X_selected = fs.fit_transform(X, y) print(X_selected.shape) |

Running the example first creates the regression dataset, then defines the feature selection and applies the feature selection procedure to the dataset, returning a subset of the selected input features.

1 |
(100, 10) |

### Classification Feature Selection:

(*Numerical Input, Categorical Output*)

This section demonstrates feature selection for a classification problem that as numerical inputs and categorical outputs.

A test regression problem is prepared using the make_classification() function.

Feature selection is performed using ANOVA F measure via the f_classif() function.

1 2 3 4 5 6 7 8 9 10 11 |
# ANOVA feature selection for numeric input and categorical output from sklearn.datasets import make_classification from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_classif # generate dataset X, y = make_classification(n_samples=100, n_features=20, n_informative=2) # define feature selection fs = SelectKBest(score_func=f_classif, k=2) # apply feature selection X_selected = fs.fit_transform(X, y) print(X_selected.shape) |

Running the example first creates the classification dataset, then defines the feature selection and applies the feature selection procedure to the dataset, returning a subset of the selected input features.

1 |
(100, 2) |

### Classification Feature Selection:

(*Categorical Input, Categorical Output*)

For examples of feature selection with categorical inputs and categorical outputs, see the tutorial:

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Posts

- How to Calculate Nonparametric Rank Correlation in Python
- How to Calculate Correlation Between Variables in Python
- Feature Selection For Machine Learning in Python
- An Introduction to Feature Selection

### Articles

- Feature selection, scikit-learn API.
- What are the feature selection options for categorical data? Quora.

## Summary

In this post, you discovered how to choose statistical measures for filter-based feature selection with numerical and categorical data.

Specifically, you learned:

- There are two main types of feature selection techniques: wrapper and filter methods.
- Filter-based feature selection methods use statistical measures to score the correlation or dependence between input variables that can be filtered to choose the most relevant features.
- Statistical measures for feature selection must be carefully chosen based on the data type of the input variable and the output or response variable.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Hi Jason,

Thank you for the nice blog. Do you have a summary of unsupervised feature selection methods?

All of the statistical methods listed in the post are unsupervised.

Hi

Actually I’ve got the same question like Mehmet above. Please correct me if I’m wrong, the talk in this article is about input variables and target variables. With that I understand features and labels of a given supervised learning problem.

But in your answer it says unsupervised! I’m a bit confused.

Thanks.

They are statistical tests applied to two variables, there is no supervised learning model involved.

I think by unsupervised you mean no target variable. In that case you cannot do feature selection. But you can do other things, like dimensionality reduction, e.g. SVM and PCA.

But the two code samples you’re providing for feature selection _are_ from the area of supervised learning:

– Regression Feature Selection (Numerical Input, Numerical Output)

– Classification Feature Selection (Numerical Input, Categorical Output)

Do you maybe mean that supervised learning is _one_ possible area one can make use of for feature selection BUTthis is not necessarily the only field of using it?

Perhaps I am saying that this type of feature selection only makes sense on supervised learning, but it is not a supervised learning type algorithm – the procedure is applied in an unsupervised manner.

OK I guess know I understand what you mean.

Feature selection methods are used by the supervised learning problems to reduce the numer of input features (or as you call them “the input variables”), however ALL of these methods themself work in an unsupervised manner to do so.

That is my understanding.

What do you mean by unsupervised – like feature selection for clustering?

Thanks again for short and excellent post. How about Lasso, RF, XGBoost and PCA? These can also be used to identify best features.

Yes, but in this post we are focused on univariate statistical methods, so-called filter feature selection methods.

Thanks for your time for the clarification.

Thanks for sharing. Actually I was looking for such a great blog since a long time.

Thanks!

I hope it helps.

Pleasegivetworeasonswhyitmaybedesirabletoperformfeatureselectioninconnection with document classification.

What would feature selection for document classification look like exactly? Do you mean reducing the size of the vocab?

quite an informative article with great content

Thanks!

Hi Jason! Thanks for this informative post! I’m trying to apply this knowledge to the Housing Price prediction problem where the regressors include both numeric features and categorical features. In your graph, (Categorical Inputs, Numerical Output) also points to ANOVA. To use ANOVA correctly in this Housing Price case, do I have to encode my Categorical Inputs before SelectKBest?

Yes, categorical variables will need to be label/integer encoded at the least.

Hi Jason! I have dataset with both numerical and categorical features. The label is categorical in nature. Which is the best possible approach to find feature importance? Should I OneHotEncode my categorical features before applying ANOVA/Kendall’s?

Use separate statistical feature selection methods for different variable types.

Or try RFE.

Hey Jason,

Thanks a lot for this detailed article.

I have a question, after one hot encoding my categorical feature, the created columns just have 0 and 1. My output variable is numerical and all other predictors are also numerical. Can i use pearson/spearman correlation for feature selection here (and for removing multicollinearity as well) ??

Now since one hot encoded column has some ordinality (0 – Absence, 1- Presence) i guess correlation matrix will be useful.

I tried this and the output is making sense business wise. Just wanted to know your thoughts on this, is this fundamentally correct ??

No, spearman/pearson correlation on binary attributes does not make sense.

You perform feature selection on the categorical variables directly.

Thanks a lot for your nice post. I’m way new to ML so I have a really rudimentary question. Suppose I have a set of tweets which labeled as negative and positive. I want to perform some sentiment analysis. I extracted 3 basic features: 1. Emotion icons 2.Exclamation marks 3. Intensity words(very, really). My question is: How should I use these features with SVM or other ML algorithms? In other words, how should I apply the extracted features in SVM algorithm?

should I train my dataset each time with one feature? I read several articles and they are just saying: we should extract features and deploy them in our algorithms but HOW?

Help me, please

The text will need a numeric representation, such as a bag of words.

This is called natural language processing, you can get started here:

https://machinelearningmastery.com/start-here/#nlp

Thank you sir

You’re welcome.

Hey jason,

Can you please say why should we use univariate selection method for feature selection?

Cause we should use correlation matrix which gives correlation between each dependent feature and independent feature,as well as correlation between two independent features.

So, using correlation matrix we can remove collinear or redundant features also.

So can you please say when should we use univariate selection over correlation matrix?

Yes, filter methods like statistical test are fast and easy to test.

You can move on to wrapper methods like RFE later.

hi Jason

thnx for your helpful post

i want to know which method to use?

input vairables are

1. age

2.sex(but it has numbers as 1 for males and 2 for females)

3. working hours-

4. school attainment (also numbers)

the output is numeric

could you plz help

Perhaps note whether each variable is numeric or categorical then follow the above guide.

Hi, Jason!

Do you mean you need to perform feature selection for each variable according to input and output parameters as illustrated above? Is there any shortcuts where I just feed the data and produce feature scores without worrying on the type of input and output data?

Yes, different feature selection for diffrent variable types.

A short cut would be to use a different approach, like RFE, or an algorithm that does feature selection for you like xgboost/random forest.

Hello Jason,

Thank you for your nice blogs, I read several and find them truly helpful.

I have a quick question related to feature selection:

if I want to select some features via VarianceThreshold, does this method only apply to numerical inputs?

Can I encode categorical inputs and apply VarianceThreshold to them as well?

Many thanks!

Thanks!

Yes, numerical only as far as I would expect.

Hi Jason!

Is there any way to display the names of the features that were selected by SelectKBest?

In your example it just returns a numpy array with no column names.

Yes, you can loop through the list of column names and the features and print whether they were selected or not using information from the attributes on the SelectKBest class.

Hi Jason,

Many thanks for this detailed blog. A quick question on the intuition of the f_classif method.

Why do we select feature with high F value? Say if y takes two classes [0,1], and feature1 was selected because it has high F-statistic in a univariate ANOVA with y, does it mean that the mean of feature11 when y = 0, is statistically different from the mean of feature 1 when y = 1? and therefore feature 1 likely to be useful in predicting y?

Yes, large values. But don’t do it manually use a built-in selection method.

See the worked examples at the end of the tutorial as a template.

hi jason,

so im working with more than 100 thousand samples dota2 dataset which consist of the winner and the “hero” composition from each match. I was trying to build winner of the match prediction model similiar to this [http://jmcauley.ucsd.edu/cse255/projects/fa15/018.pdf]. so the vector input is

Xi= 1 if hero i on radiant side, 0 otherwise.

X(119+i) = 1 if hero i on dire side, 0 otherwise

The vector X consist 238 entri since there are 119 kind of heroes. Each vector represent the composition of the heroes that is played within each match. Each match always consist of exactly 10 heroes (5 radiant side 5 dire side).

From this set up i would have a binary matrix of 100k times (222 + 1) dimension with row represent how many samples and columns represent features, +1 columns for the label vector (0 and 1, 1 meaning radiant side win)

so if i dot product between two column vector of my matrix, i can get how many times hero i played with hero j on all the samples.

so if i hadamard product between two column vector of my matrix and the result of that we dot product to the vector column label i can get how many times hero i played with hero j and win.

with this i can calculate the total weight of each entri per samples that corresponding to the vector label. i could get very high coorelation between this “new features” to the label vector. but i cant find any references to this problem in statistics textbook on binary data.

Not sure I can offer good advice off the cuff, sorry.

Hi Jason, Thanks for this article. I totally understand this different methodologies. I have one question.

If lets say. I have 3 variables. X,Y,Z

X= categorical

Y= Numerical

Z= Categorical, Dependent(Value I want to predict)

Now, I did not get any relationship between Y and Z and I got the Relationship between Y and Z. Is it possible that if we include X, Y both together to predict Z, Y might get the relationship with Z.

If is there any statistical method or research around please do mention them. Thanks

I would recommend simply testing reach combination of input variables and use the combination that results in the best performance for predicting the target – it’s a lot simpler than multivariate statistics.

When having a dataset that contains only categorical variables including nominal, ordinal & dichotomous variables, is it incorrect if I use either Cramér’s V or Theil’s U (Uncertainty Coefficient) to get the correlation between features?

Thanks

San

I don’t know off-hand, sorry.

Very good article.

I have detected outliers and wondering how can I estimate contribution of each feature on a single outlier?

We are talking about only one observation and it’s label, not whole dataset.

I couldn’t find any reference for that.

This sounds like an open question.

Perhaps explore distance measures from a centroid or to inliers?

Or univariate distribution measures for each feature?

Thank you for quick response.

That’s one class multivalve application.

For a single observation, I need to find out the first n features that have the most impact on being in that class.

From most articles, I can find the most important features over all observations, but here I need to know that over a selected observations.

Simply fit the model on your subset of instances.

1) In case of feature selection algorithm (XGBosst, GA, and PCA) what kind of method we can consider wrapper or filter?

2) what is the difference between feature selection and dimension reduction?

XGBoost would be used as a filter, GA would be a wrapper, PCA is not a feature selection method.

Feature selection chooses features in the data. Dimensionality reduction like PCA transforms or projects the features into lower dimensional space.

Technically deleting features could be considered dimensionality reduction.

Thank you so much for your time to respond. Would you like to share some of the material on the same (so I can use it for my thesis as a reference)?

In addition, I am excited to know the advantages and disadvantaged in this respect; I mean when I use XGBoost as a filter feature selection and GA as a wrapper feature selection and PCA as a dimensional reduction, Then what may be the possible advantages and disadvantages?

best regards!

If you need theory of feature selection, I recommend performing a literature review.

I cannot help you with advantages/disadvantages – it’s mostly a waste of time. I recommend using what “does” work best on a specific dataset, not what “might” work best.

I didn’t get your point.

I have 1 record which is outlier. and wanted to know which features had the most contribution on that record to get outlier.

Thank you and sorry if question is confusing

I suggested that it is an open question – as in, there are no obvious answers.

I suggested to take it on as a research project and discover what works best.

Does that help, which part is confusing – perhaps I can elaborate?

Thank you

Should research on that

Thanks for the suggestion.

Do the above algorithms keep track of ‘which’ features have been selected, or only selects the ‘best’ feature data? Aafter having identified the ‘best k features’, how do we extract those features, ideally only those, from new inputs?

Yes, you can discover which features are selected according to their column index.