Feature Selection For Machine Learning in Python

The data features that you use to train your machine learning models have a huge influence on the performance you can achieve.

Irrelevant or partially relevant features can negatively impact model performance.

In this post you will discover automatic feature selection techniques that you can use to prepare your machine learning data in python with scikit-learn.

Let’s get started.

Update Dec/2016: Fixed a typo in the RFE section regarding the chosen variables. Thanks Anderson.

Feature Selection For Machine Learning in Python

Feature Selection For Machine Learning in Python
Photo by Baptiste Lafontaine, some rights reserved.

Feature Selection

Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.

Having irrelevant features in your data can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression.

Three benefits of performing feature selection before modeling your data are:

  • Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
  • Improves Accuracy: Less misleading data means modeling accuracy improves.
  • Reduces Training Time: Less data means that algorithms train faster.

You can learn more about feature selection with scikit-learn in the article Feature selection.

Beat Information Overload and Master the Fastest Growing Platform of Machine Learning Pros


Machine Learning Mastery With Python Mini-CourseGet my free Machine Learning With Python mini course and start loading your own datasets from CSV in just 1 hour.

Daily lessons in your inbox for 14 days, and a Machine-Learning-With-Python “Cheat Sheet” you can download right now.

Download Your FREE Mini-Course >>

 

Feature Selection for Machine Learning

This section lists 4 feature selection recipes for machine learning in Python

This post contains recipes for feature selection methods.

Each recipe was designed to be complete and standalone so that you can copy-and-paste it directly into you project and use it immediately.

Recipes uses the Pima Indians onset of diabetes dataset to demonstrate the feature selection method. This is a binary classification problem where all of the attributes are numeric.

1. Univariate Selection

Statistical tests can be used to select those features that have the strongest relationship with the output variable.

The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.

The example below uses the chi squared (chi^2) statistical test for non-negative features to select 4 of the best features from the Pima Indians onset of diabetes dataset.

You can see the scores for each attribute and the 4 attributes chosen (those with the highest scores): plas, test, mass and age.

2. Recursive Feature Elimination

The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain.

It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

You can learn more about the RFE class in the scikit-learn documentation.

The example below uses RFE with the logistic regression algorithm to select the top 3 features. The choice of algorithm does not matter too much as long as it is skillful and consistent.

You can see that RFE chose the the top 3 features as preg, mass and pedi.

These are marked True in the support_ array and marked with a choice “1” in the ranking_ array.

3. Principal Component Analysis

Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form.

Generally this is called a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal component in the transformed result.

In the example below, we use PCA and select 3 principal components.

Learn more about the PCA class in scikit-learn by reviewing the PCA API. Dive deeper into the math behind PCA on the Principal Component Analysis Wikipedia article.

You can see that the transformed dataset (3 principal components) bare little resemblance to the source data.

4. Feature Importance

Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.

In the example below we construct a ExtraTreesClassifier classifier for the Pima Indians onset of diabetes dataset. You can learn more about the ExtraTreesClassifier class in the scikit-learn API.

You can see that we are given an importance score for each attribute where the larger score the more important the attribute. The scores suggest at the importance of plas, age and mass.

Summary

In this post you discovered feature selection for preparing machine learning data in Python with scikit-learn.

You learned about 4 different automatic feature selection techniques:

  • Univariate Selection.
  • Recursive Feature Elimination.
  • Principle Component Analysis.
  • Feature Importance.

If you are looking for more information on feature selection, see these related posts:

Do you have any questions about feature selection or this post? Ask your questions in the comment and I will do my best to answer them.

Frustrated With Python Machine Learning?

Develop Your Own Models and Predictions in Minutes

...with just a few lines of scikit-learn code

Discover how in my new Ebook: Machine Learning Mastery With Python

It covers self-study tutorials and end-to-end projects on topics like:
Loading data, visualization, modeling, algorithm tuning, and much more...

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

 

60 Responses to Feature Selection For Machine Learning in Python

  1. Juliet September 16, 2016 at 8:57 pm #

    Hi Jason! Thanks for this – really useful post! I’m sure I’m just missing something simple, but looking at your Univariate Analysis, the features you have listed as being the most correlated seem to have the highest values in the printed score summary. Is that just a quirk of the way this function outputs results? Thanks again for a great access-point into feature selection.

    • Jason Brownlee September 17, 2016 at 9:29 am #

      Hi Juliet, it might just be coincidence. If you uncover something different, please let me know.

  2. Ansh October 11, 2016 at 12:16 pm #

    For the Recursive Feature Elimination, are the features of high importance (preg,mass,pedi)?
    The ranking array has value 1 for them them.

    • Jason Brownlee October 12, 2016 at 9:11 am #

      Hi Ansh, I believe the features with the 1 are preg, pedi and age as mentioned in the post. These are the first ranked features.

      • Ansh October 12, 2016 at 12:29 pm #

        Thanks for the reply Jason. I seem to have made a mistake, my bad. Great post 🙂

        • Jason Brownlee October 13, 2016 at 8:33 am #

          No problem Ansh.

          • Anderson Neves December 15, 2016 at 6:52 am #

            Hi all,

            I agree with Ansh. There are 8 features and the indexes with True and 1 match with preg, mass and pedi.

            [ ‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’ ]
            [ True, False, False, False, False, True, True, False]
            [ 1, 2, 3, 5, 6, 1, 1, 4 ]

            Jason, could you explain better how you see that preg, pedi and age are the first ranked features?

            Thank you for the post, it was very useful and direct to the point. Congratulations.

          • Jason Brownlee December 15, 2016 at 8:31 am #

            Hi Anderson, they have a “true” in their column index and are all ranked “1” at their respective column index.

            Does that help?

          • Anderson Neves December 16, 2016 at 12:00 am #

            Hi Jason,

            That is exactly what I mean. I believe that the best features would be preg, pedi and age in the scenario below

            Features:
            [ ‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’ ]

            RFE result:
            [ True, False, False, False, False, False, True, True ]
            [ 1, 2, 3, 5, 6, 4, 1, 1 ]

            However, the result was

            Features:
            [ ‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’ ]

            RFE result:
            [ True, False, False, False, False, True, True, False]
            [ 1, 2, 3, 5, 6, 1, 1, 4 ]

            Did you consider the target column ‘class’ by mistake?

            Thank you for the quick reply,
            Anderson Neves

          • Jason Brownlee December 16, 2016 at 5:48 am #

            Hi Anderson,

            I see, you’re saying you have a different result when you run the code?

            The code is correct and does not include the class as an input.

            Re-running now I see the same result:

            Perhaps I don’t understand the problem you’ve noticed?

          • Anderson Neves December 17, 2016 at 12:22 am #

            Hi Jason,

            Your code is correct and my result is the same as yours. My point is that the best features found with RFE are preg, mass and pedi. So, I suggest you fix the text “You can see that RFE chose the the top 3 features as preg, pedi and age.”. If you add the code below at the end of your code you will see what I mean.

            # find best features
            best_features = []
            i = 0
            for is_best_feature in fit.support_:
            if is_best_feature:
            best_features.append(names[i])
            i += 1
            print ‘\nSelected features:’
            print best_features

            Sorry if I am bothering somehow,
            Thanks again,
            Anderson Neves

          • Jason Brownlee December 17, 2016 at 11:18 am #

            Got it Anderson.
            Thanks for being patient with me and helping to make this post more useful. I really appreciate it!

            I’ve fixed up the example above.

  3. Narasimman October 14, 2016 at 9:18 pm #

    from the rfe, how do I form a new dataframe for the features which has true value?

    • Jason Brownlee October 15, 2016 at 10:22 am #

      Great question Narasimman.

      From memory, you can use numpy.concatinate() to collect the columns you want.
      http://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html

    • Iain Dinwoodie November 1, 2016 at 12:52 am #

      Thanks for useful tutorial.

      Narasimman – ‘from the rfe, how do I form a new dataframe for the features which has true value?’

      You can just apply rfe directly to the dataframe then select based on columns:

      df = read_csv(url, names=names)
      X = df.iloc[:, 0:8]
      Y = df.iloc[:, 8]
      # feature extraction
      model = LogisticRegression()
      rfe = RFE(model, 3)
      fit = rfe.fit(X, Y)
      print(“Num Features: {}”.format(fit.n_features_))
      print(“Selected Features: {}”.format(fit.support_))
      print(“Feature Ranking: {}”.format(fit.ranking_))

      X = X[X.columns[fit.support_]]

  4. MLBeginner October 25, 2016 at 1:07 am #

    Hi Jason,

    Really appreciate your post! Really great! I have a quick question for the PCA method. How to get the column header for the selected 3 principal components? It is just simple column no. there, but hard to know which attributes finally are.

    Thanks,

    • Jason Brownlee October 25, 2016 at 8:29 am #

      Thanks MLBeginner, I’m glad you found it useful.

      There is no column header, they are “new” features that summarize the data. I hope that helps.

  5. sadiq October 25, 2016 at 1:51 am #

    hi, Jason! please I want to ask you if i can use PSO for feature selection in sentiment analysis by python

    • Jason Brownlee October 25, 2016 at 8:29 am #

      Sure, try it and see how the results compare (as in the models trained on selected features) to other feature selection methods.

  6. Vignesh Sureshbabu Kishore November 15, 2016 at 5:07 pm #

    Hey Jason, can the univariate test of Chi2 feature selection be applied to both continuous and categorical data.

    • Jason Brownlee November 16, 2016 at 9:25 am #

      Hi Vignesh, I believe just continuous data. But I may be wrong – try and see.

      • Vignesh Sureshbabu Kishore November 16, 2016 at 1:07 pm #

        Hey Jason, Thanks for the reply. In the univariate selection to perform the chi-square test you are fetching the array from df.values. In that case, each element of the array will be each row in the data frame.

        To perform feature selection, we should have ideally fetched the values from each column of the dataframe to check the independence of each feature with the class variable. Is it a inbuilt functionality of the sklearn.preprocessing beacuse of which you fetch the values as each row.

        Please suggest me on this.

        • Jason Brownlee November 17, 2016 at 9:49 am #

          I’m not sure I follow Vignesh. Generally, yes, we are using built-in functions to perform the tests.

  7. Vineet December 2, 2016 at 5:11 am #

    Hi Jason,

    I am trying to do image classification using cpu machine, I have very large training matrix of 3800*200000 means 200000 features. Pls suggest how do I reduce my dimension.?

    • Jason Brownlee December 2, 2016 at 8:19 am #

      Consider working with a sample of the dataset.

      Consider using the feature selection methods in this post.

      Consider projection methods like PCA, sammons mapping, etc.

      I hope that helps as a start.

  8. tvmanikandan December 15, 2016 at 5:49 pm #

    Jason,
    when you use “SelectKBest” , can you please explain how you get the below scores?

    [ 111.52 1411.887 17.605 53.108 2175.565 127.669 5.393
    181.304]

    -Mani

  9. tvmanikandan December 16, 2016 at 2:48 am #

    jason,
    Please explain how the below scores are achieved using chi2.

    [ 111.52 1411.887 17.605 53.108 2175.565 127.669 5.393
    181.304]

    -Mani

  10. Natheer Alabsi December 28, 2016 at 8:35 pm #

    Jason, how can we get feature names from their rankings?

    • Jason Brownlee December 29, 2016 at 7:15 am #

      Hi Natheer,

      Map the feature rank to the index of the column name from the header row on the DataFrame or whathaveyou.

  11. Jason January 9, 2017 at 2:40 am #

    Hi Jason,

    Thank you for this nice blog

    I have a regression problem and I need to convert a bunch of categorical variables into dummy data, which will generate over 200 new columns. Should I do the feature selection before this step or after this step?
    Thanks

    • Jason Brownlee January 9, 2017 at 7:52 am #

      Try and see.

      That is a lot of new binary variables. Your resulting dataset will be sparse (lots of zeros). Feature selection prior might be a good idea, also try after.

  12. Mohit Tiwari February 13, 2017 at 3:37 pm #

    Hi Jason,

    I am bit stuck in selecting the appropriate feature selection algorithm for my data.

    I have about 900 attributes (columns) in my data and about 60 records. The values are nothing but count of attributes.
    Basically, I am taking count of API calls of a portable file.

    My data is like this:

    File, dangerous, API 1,API 2,API 3,API 4,API 5,API 6…..API 900
    ABC, yes, 1,0,2,1,0,0,….
    DEF, no,0,1,0,0,1,2
    FGH,yes,0,0,0,1,2,3
    .
    .
    .
    Till 60

    Can u please suggest me a suitable feature selection for my data?

    • Jason Brownlee February 14, 2017 at 10:03 am #

      Hi Mohit,

      Consider trying a few different methods, as well as some projection methods and see which “views” of your data result in more accurate predictive models.

  13. Esu February 15, 2017 at 12:01 am #

    Hell!

    Once I got the reduced version of my data as a result of using PCA, how can I feed to my classifier?

    example: the original data is of size 100 row by 5000 columns
    if I reduce 200 features I will get 100 by 200 dimension data. right?
    then I create arrays of

    a=array[:,0:199]
    b=array[:,99]
    but when I test my classifier its core is 0% in both test and training accuracy?
    An7y Idea

    • Jason Brownlee February 15, 2017 at 11:35 am #

      Sounds like you’re on the right, but a zero accuracy is a red flag.

      Did you accidently include the class output variable in the data when doing the PCA? It should be excluded.

  14. Kamal February 20, 2017 at 6:20 pm #

    Hello sir,
    I have a question in my mind
    each of these feature selection algo uses some predefined number like 3 in case of PCA.So how we come to know that my data set cantain only 3 or any predefined number of features.it does not automatically select no features its own.

    • Jason Brownlee February 21, 2017 at 9:33 am #

      Great question Kamal.

      No, you must select the number of features. I would recommend using a sensitivity analysis and try a number of different features and see which results in the best performing model.

  15. Massimo March 9, 2017 at 5:29 am #

    Hi jason,
    I have a question about the RFECV approach.
    I’m dealing with a project where I have to use different estimators (regression models). is it correct use RFECV with these models? or is it enough to use only one of them? Once I have selected the best features, could I use them for each regression model?
    To better explain:
    – I have used RFECV on whole dataset in combination with one of the following regression models [LinearRegression, Ridge, Lasso]
    – Then I have compared the r2 and I have chosen the better model, so I have used its features selected in order to do others things.
    – pratically, I use the same ‘best’ features in each regression model.
    Sorry for my bad english.

    • Jason Brownlee March 9, 2017 at 9:58 am #

      Good question.

      You can embed different models in RFE and see if the results tell the same or different stories in terms of what features to pick.

      You can build a model from each set of features and combine the predictions.

      You can pick one set of features and build one or models from them.

      My advice is to try everything you can think of and see what gives the best results on your validation dataset.

      • Massimo March 11, 2017 at 2:41 am #

        Thank you man. You’re great.

  16. gevra March 22, 2017 at 1:49 am #

    Hi Jason.

    Thanks for the post, but I think going with Random Forests straight away will not work if you have correlated features.

    Check this paper:
    https://academic.oup.com/bioinformatics/article/27/14/1986/194387/Classification-with-correlated-features

    I am not sure about the other methods, but feature correlation is an issue that needs to be addressed before assessing feature importance.

    • Jason Brownlee March 22, 2017 at 8:08 am #

      Makes sense, thanks for the note and the reference.

  17. ogunleye March 30, 2017 at 4:29 am #

    Hello sir,
    Thank you for the informative post. My questions are
    1) How do you handle NaN in a dataset for feature selection purposes.
    2) I am getting an error with RFE(model, 3) It is telling me i supplied 2 arguments
    instead of 1.

    Thank you very much once again.

  18. ogunleye March 30, 2017 at 4:33 am #

    I solved my problem sir. I named the function RFE in my main but. I would love to hear
    your response to first question.

  19. Sam April 20, 2017 at 3:49 am #

    how to load the nested JSON into the data frame ?

    • Jason Brownlee April 20, 2017 at 9:32 am #

      I don’t know off hand, perhaps post to StackOverflow Sam?

  20. Federico Carmona April 20, 2017 at 6:10 am #

    good afternoon

    How to know with pca what are the main components?

    • Jason Brownlee April 20, 2017 at 9:34 am #

      PCA will calculate and return the principal components.

      • Federico Carmona April 20, 2017 at 10:53 am #

        Yes but pca does not tell me which are the most relevant varials if mass test etc?

        • Jason Brownlee April 21, 2017 at 8:27 am #

          Not sure I follow you sorry.

          You could apply a feature selection or feature importance method to the PCA results if you wanted. It might be overkill though.

  21. Lehyu April 23, 2017 at 6:44 pm #

    In RFE we should input a estimator, so before I do feature selection, should I fine tune the model or just use the default parmater settting? Thanks.

    • Jason Brownlee April 24, 2017 at 5:33 am #

      You can, but that is not really required. As long as the estimator is reasonably skillful on the problem, the selected features will be valuable.

      • Lehyu April 25, 2017 at 12:41 am #

        I was suck here for days. Thanks a lot.

        • Lehyu April 25, 2017 at 1:09 am #

          stuck…

        • Jason Brownlee April 25, 2017 at 7:49 am #

          I’m glad to hear the advice helped.

          I’m here to help if you get stuck again, just post your questions.

Leave a Reply