How to Perform Feature Selection with Categorical Data

Feature selection is the process of identifying and selecting a subset of input features that are most relevant to the target variable.

Feature selection is often straightforward when working with real-valued data, such as using the Pearson’s correlation coefficient, but can be challenging when working with categorical data.

The two most commonly used feature selection methods for categorical input data when the target variable is also categorical (e.g. classification predictive modeling) are the chi-squared statistic and the mutual information statistic.

In this tutorial, you will discover how to perform feature selection with categorical input data.

After completing this tutorial, you will know:

  • The breast cancer predictive modeling problem with categorical inputs and binary classification target variable.
  • How to evaluate the importance of categorical features using the chi-squared and mutual information statistics.
  • How to perform feature selection for categorical data when fitting and evaluating a classification model.

Let’s get started.

How to Perform Feature Selection with Categorical Data

How to Perform Feature Selection with Categorical Data
Photo by Phil Dolby, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Breast Cancer Categorical Dataset
  2. Categorical Feature Selection
  3. Modeling With Selected Features

Breast Cancer Categorical Dataset

As the basis of this tutorial, we will use the so-called “Breast cancer” dataset that has been widely studied as a machine learning dataset since the 1980s.

The dataset classifies breast cancer patient data as either a recurrence or no recurrence of cancer. There are 286 examples and nine input variables. It is a binary classification problem.

A naive model can achieve an accuracy of 70% on this dataset. A good score is about 76% +/- 3%. We will aim for this region, but note that the models in this tutorial are not optimized; they are designed to demonstrate encoding schemes.

You can download the dataset and save the file as “breast-cancer.csv” in your current working directory.

Looking at the data, we can see that all nine input variables are categorical.

Specifically, all variables are quoted strings; some are ordinal and some are not.

We can load this dataset into memory using the Pandas library.

Once loaded, we can split the columns into input (X) and output for modeling.

Finally, we can force all fields in the input data to be string, just in case Pandas tried to map some automatically to numbers (it does try).

We can tie all of this together into a helpful function that we can reuse later.

Once loaded, we can split the data into training and test sets so that we can fit and evaluate a learning model.

We will use the train_test_split() function form scikit-learn and use 67% of the data for training and 33% for testing.

Tying all of these elements together, the complete example of loading, splitting, and summarizing the raw categorical dataset is listed below.

Running the example reports the size of the input and output elements of the train and test sets.

We can see that we have 191 examples for training and 95 for testing.

Now that we are familiar with the dataset, let’s look at how we can encode it for modeling.

We can use the OrdinalEncoder() from scikit-learn to encode each variable to integers. This is a flexible class and does allow the order of the categories to be specified as arguments if any such order is known.

Note: I will leave it as an exercise to you to update the example below to try specifying the order for those variables that have a natural ordering and see if it has an impact on model performance.

The best practice when encoding variables is to fit the encoding on the training dataset, then apply it to the train and test datasets.

The function below named prepare_inputs() takes the input data for the train and test sets and encodes it using an ordinal encoding.

We also need to prepare the target variable.

It is a binary classification problem, so we need to map the two class labels to 0 and 1. This is a type of ordinal encoding, and scikit-learn provides the LabelEncoder class specifically designed for this purpose. We could just as easily use the OrdinalEncoder and achieve the same result, although the LabelEncoder is designed for encoding a single variable.

The prepare_targets() function integer encodes the output data for the train and test sets.

We can call these functions to prepare our data.

Tying this all together, the complete example of loading and encoding the input and output variables for the breast cancer categorical dataset is listed below.

Now that we have loaded and prepared the breast cancer dataset, we can explore feature selection.

Categorical Feature Selection

There are two popular feature selection techniques that can be used for categorical input data and a categorical (class) target variable.

They are:

  • Chi-Squared Statistic.
  • Mutual Information Statistic.

Let’s take a closer look at each in turn.

Chi-Squared Feature Selection

Pearson’s chi-squared statistical hypothesis test is an example of a test for independence between categorical variables.

You can learn more about this statistical test in the tutorial:

The results of this test can be used for feature selection, where those features that are independent of the target variable can be removed from the dataset.

The scikit-learn machine library provides an implementation of the chi-squared test in the chi2() function. This function can be used in a feature selection strategy, such as selecting the top k most relevant features (largest values) via the SelectKBest class.

For example, we can define the SelectKBest class to use the chi2() function and select all features, then transform the train and test sets.

We can then print the scores for each variable (largest is better), and plot the scores for each variable as a bar graph to get an idea of how many features we should select.

Tying this together with the data preparation for the breast cancer dataset in the previous section, the complete example is listed below.

Running the example first prints the scores calculated for each input feature and the target variable.

Note: your specific results may differ. Try running the example a few times.

In this case, we can see the scores are small and it is hard to get an idea from the number alone as to which features are more relevant.

Perhaps features 3, 4, 5, and 8 are most relevant.

A bar chart of the feature importance scores for each input feature is created.

This clearly shows that feature 3 might be the most relevant (according to chi-squared) and that perhaps four of the nine input features are the most relevant.

We could set k=4 When configuring the SelectKBest to select these top four features.

Bar Chart of the Input Features (x) vs The Ch-Squared Feature Importance (y)

Bar Chart of the Input Features (x) vs The Chi-Squared Feature Importance (y)

Mutual Information Feature Selection

Mutual information from the field of information theory is the application of information gain (typically used in the construction of decision trees) to feature selection.

Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable.

You can learn more about mutual information in the following tutorial.

The scikit-learn machine learning library provides an implementation of mutual information for feature selection via the mutual_info_classif() function.

Like chi2(), it can be used in the SelectKBest feature selection strategy (and other strategies).

We can perform feature selection using mutual information on the breast cancer set and print and plot the scores (larger is better) as we did in the previous section.

The complete example of using mutual information for categorical feature selection is listed below.

Running the example first prints the scores calculated for each input feature and the target variable.

Note: your specific results may differ. Try running the example a few times.

In this case, we can see that some of the features have a very low score, suggesting that perhaps they can be removed.

Perhaps features 3, 6, 2, and 5 are most relevant.

A bar chart of the feature importance scores for each input feature is created.

Importantly, a different mixture of features is promoted.

Bar Chart of the Input Features (x) vs The Mutual Information Feature Importance (y)

Bar Chart of the Input Features (x) vs The Mutual Information Feature Importance (y)

Now that we know how to perform feature selection on categorical data for a classification predictive modeling problem, we can try developing a model using the selected features and compare the results.

Modeling With Selected Features

There are many different techniques for scoring features and selecting features based on scores; how do you know which one to use?

A robust approach is to evaluate models using different feature selection methods (and numbers of features) and select the method that results in a model with the best performance.

In this section, we will evaluate a Logistic Regression model with all features compared to a model built from features selected by chi-squared and those features selected via mutual information.

Logistic regression is a good model for testing feature selection methods as it can perform better if irrelevant features are removed from the model.

Model Built Using All Features

As a first step, we will evaluate a LogisticRegression model using all the available features.

The model is fit on the training dataset and evaluated on the test dataset.

The complete example is listed below.

Running the example prints the accuracy of the model on the training dataset.

Note: your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieves a classification accuracy of about 75%.

We would prefer to use a subset of features that achieves a classification accuracy that is as good or better than this.

Model Built Using Chi-Squared Features

We can use the chi-squared test to score the features and select the four most relevant features.

The select_features() function below is updated to achieve this.

The complete example of evaluating a logistic regression model fit and evaluated on data using this feature selection method is listed below.

Running the example reports the performance of the model on just four of the nine input features selected using the chi-squared statistic.

Note: your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we see that the model achieved an accuracy of about 74%, a slight drop in performance.

It is possible that some of the features removed are, in fact, adding value directly or in concert with the selected features.

At this stage, we would probably prefer to use all of the input features.

Model Built Using Mutual Information Features

We can repeat the experiment and select the top four features using a mutual information statistic.

The updated version of the select_features() function to achieve this is listed below.

The complete example of using mutual information for feature selection to fit a logistic regression model is listed below.

Running the example fits the model on the four top selected features chosen using mutual information.

Note: your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see a small lift in classification accuracy to 76%.

To be sure that the effect is real, it would be a good idea to repeat each experiment multiple times and compare the mean performance. It may also be a good idea to explore using k-fold cross-validation instead of a simple train/test split.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Posts

API

Articles

Summary

In this tutorial, you discovered how to perform feature selection with categorical input data.

Specifically, you learned:

  • The breast cancer predictive modeling problem with categorical inputs and binary classification target variable.
  • How to evaluate the importance of categorical features using the chi-squared and mutual information statistics.
  • How to perform feature selection for categorical data when fitting and evaluating a classification model.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Discover Fast Machine Learning in Python!

Master Machine Learning With Python

Develop Your Own Models in Minutes

...with just a few lines of scikit-learn code

Learn how in my new Ebook:
Machine Learning Mastery With Python

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more...

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

See What's Inside

22 Responses to How to Perform Feature Selection with Categorical Data

  1. Alan November 25, 2019 at 7:04 am #

    Hi Jason,

    I have performed feature selection on the columns of my data (after cleaning it). The method I used is the Extra Tree Classifier for Feature Selection (from sklearn.ensemble).

    I fit the model and get the n most relevant features of my data. However, every time I run it, different features come up as the most important ones. I’m not sure why this is happening, shouldn’t the method always output the same n top features?

    Thanks for your inputs!

    • Jason Brownlee November 25, 2019 at 2:06 pm #

      Perhaps the choice of model has too much variance.

      Perhaps try an alternate model or alternate model configuration with less variance?

      Perhaps try evaluating the strategy “on average” rather than for a single run?

      Perhaps compare the approach on average to other less stochastic methods, like RFE or chi squared?

  2. Carolinne Magalhaes November 25, 2019 at 8:11 am #

    Very useful! Thanks!

  3. Thorsten Henrich November 26, 2019 at 12:05 am #

    Why do the two methods give completely different values for feature importance?
    Maybe the numbering of features is not matching?

  4. Ram November 26, 2019 at 2:42 am #

    Jason:
    I am a long time fan of your blog posts. I want to bring to your attention a new library that I have developed called Auto_ViML which performs feature selection and model tuning automatically using many great Kaggle techniques. I’d like you to try it on this breast-cancer.csv file and report results if you can.
    Here’s the Medium article that describes it:

    https://towardsdatascience.com/why-automl-is-an-essential-new-tool-for-data-scientists-2d9ab4e25e46?gi=7814502b6fb8

    Please try It and let me know. Thanks.

  5. Sean November 26, 2019 at 3:02 am #

    Thank you for this a nice post with an illustrative code demo.
    I have a question on the Chi-Squared Feature Selection:
    Are the importance (score) of variable and the Chi-squared score same thing, OR different things in opposite direction?
    The reason I am puzzled is by below understanding (misunderstanding?). The null hypothesis of Chi-Squared test is that the two inputs have similar distribution (i.e. related or dependent). Thus, higher Chi-Squared score implies less dependency. For the sake of feature selection, we want to use those variables that are related to target variable for predicting. Or in your text, “those features that are independent of the target variable can be REMOVED from the dataset”.
    My understanding is that variables with larger Chi-square score should be removed, since they are independent to the target variable. Thus, I have the impression that the importance of variable be inverse to the raw Chi-square score.

  6. marco November 26, 2019 at 6:59 am #

    Hello Jason,
    I have a question.
    Regularization methods increase accuracy and reduce overfitting or are just for accuracy?
    Thanks

    • Jason Brownlee November 26, 2019 at 1:27 pm #

      Regularization reduces complexity, which often reduces overfitting and generalization error.

  7. marco November 27, 2019 at 2:54 am #

    Hello Jason,
    one more question is about using sklearn Precision, Recall, F1 metrics and ROC/ AUC to evaluate Keras model.
    Is up to you correct to use the above metrics to evaluate Keras models?
    Does Keras provide any other function to evaluate models?
    Thanks

  8. Saurabh December 4, 2019 at 1:24 am #

    Hello Jason,

    Thanks for sharing the interesting blog!

  9. Deependra December 6, 2019 at 3:48 am #

    Hi Jason, I have a question.

    I have read it somewhere that we should use dummy variable (using one hot encoding ) for those categorical features which have more than 2 categories. But in this logistic regression model you have not done so which makes me confused about when exactly should we use dummy variable and when it is not necessary.

    Please help.

    • Jason Brownlee December 6, 2019 at 5:24 am #

      Because the test problem has 2 classes.

      • Deependra December 6, 2019 at 6:04 pm #

        But there are many features with more than 2 classes. (e.g. breast-quad has 5 classes: left-up, left-low, right-up, right-low, central).

        So my question is that whether we should use dummy variable for such columns or not.

        • Jason Brownlee December 7, 2019 at 5:36 am #

          They are input variables.

          The class is the output variable. There is one class and it has two values.

  10. Subhankar Hotta December 10, 2019 at 4:52 am #

    I’m using the same piece of code and getting this stuck at this error when I am encoding my inputs using ordinalencoder : “Input contains NaN, infinity or a value too large for dtype(‘float64’).” Could u suggest what can be done to resolve this ???

Leave a Reply