Feature Selection in Python with Scikit-Learn

Not all data attributes are created equal. More is not always better when it comes to attributes or columns in your dataset.

In this post you will discover how to select attributes in your data before creating a machine learning model using the scikit-learn library.

Update: For a more recent tutorial on feature selection in Python see the post:

feature selection

Cut Down on Your Options with Feature Selection
Photo by Josh Friedman, some rights reserved

Select Features

Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.

Having too many irrelevant features in your data can decrease the accuracy of the models. Three benefits of performing feature selection before modeling your data are:

  • Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
  • Improves Accuracy: Less misleading data means modeling accuracy improves.
  • Reduces Training Time: Less data means that algorithms train faster.

Two different feature selection methods provided by the scikit-learn Python library are Recursive Feature Elimination and feature importance ranking.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Recursive Feature Elimination

The Recursive Feature Elimination (RFE) method is a feature selection approach. It works by recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

This recipe shows the use of RFE on the Iris floweres dataset to select 3 attributes.

For more information see the RFE method in the API documentation.

Feature Importance

Methods that use ensembles of decision trees (like Random Forest or Extra Trees) can also compute the relative importance of each attribute. These importance values can be used to inform a feature selection process.

This recipe shows the construction of an Extra Trees ensemble of the iris flowers dataset and the display of the relative feature importance.

For more information, see the ExtraTreesClassifier method in the API documentation.

Summary

Feature selection methods can give you useful information on the relative importance or relevance of features for a given problem. You can use this information to create filtered versions of your dataset and increase the accuracy of your models.

In this post you discovered two feature selection methods you can apply in Python using the scikit-learn library.

Frustrated With Python Machine Learning?

Master Machine Learning With Python

Develop Your Own Models in Minutes

…with just a few lines of scikit-learn code

Discover how in my new Ebook:
Machine Learning Mastery With Python

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

59 Responses to Feature Selection in Python with Scikit-Learn

  1. Harsh October 9, 2014 at 4:51 pm #

    Nice post, how does RFE and Feature selection like chi2 are different. I mean, finally they are achieving the same goal, right?

    • jasonb October 10, 2014 at 6:52 am #

      Both seek to reduce the number of features, but they do so using different methods. chi squared is a univariate statistical measure that can be used to rank features, whereas RFE tests different subsets of features.

      • Enny November 29, 2018 at 8:04 am #

        Is there any benchmarks, for example, P value, F score, or R square, to be used to score the importance of features?

        • Jason Brownlee November 29, 2018 at 2:33 pm #

          No, the scores are relative and specific to a given problem.

    • mitillo September 2, 2017 at 7:27 pm #

      Hello,

      I read and view a lot about machine learning but you are amazing,
      You are able to explain everything in a simple way and write code that everyone can understand and ‘play’ with it. and you give good resource for anyone who wants to deep in the topic

      you are good teacher

      Thank you for your work

  2. Bozhidar June 26, 2015 at 11:04 pm #

    Hello,

    Can you tell me which feature selection methods you suggest for time-series data?

    • Alex January 19, 2017 at 8:55 am #

      Please see tsfresh – it’s a new approach for feature selection designed for TS

  3. Max January 30, 2016 at 7:22 pm #

    Great site Jason!

  4. Alan February 24, 2016 at 9:48 am #

    Thanks for that good post. Just wondering whether RFE is also usable for linear regression? How it the model accuracy measured?

  5. Carmen January 4, 2017 at 1:31 am #

    Jason, quick question that may help someone else stumbling across this post.

    The example above does RFE using an untuned model. When would/would not make sense to find some optimised hyperparameters of the model using grid search *first*, and THEN doing RFE. In your experience, is this a good idea/helpful thing to do? If not, then why?

    • Jason Brownlee January 4, 2017 at 8:58 am #

      Hi Carmen, nice catch.

      Short answer: we are interested in relative difference of feature subsets, not absolute best performance.

      Generally, it a good idea to use a robust method for feature selection – that is a method that performs well on most problems with little or no tuning. This provides a baseline and a wrapper method like RFE can focus on the relative difference in the feature subsets rather than on the optimized best performance of each subset.

      There are those cases where your general method (say a random forest) falls down. In those cases, you may want to try RFE with a suite of 3-5 different wrapped methods and see what falls out. I expect that is this is overkill on most problems.

      Does that help?

  6. Carmen January 6, 2017 at 7:58 pm #

    Thanks that helps. The only reason I’d mentioned tuning a model first (light tuning) is that as you mentioned in your “spot checking” post, you want to give algorithms a chance to put their best step forward. If that applies there, I don’t see why it shouldn’t apply to RFE.

    So I figured light tuning (only on the most common hyperparameter with the most common grid values) may help here. But I see your point. Once I’ve got my code all sorted out I may try both and report back 🙂

    • Jason Brownlee January 7, 2017 at 8:30 am #

      You’re absolutely right Carmen.

      There is a cost/benefit here and ultimately it will come down to experience and the “taste” of the practitioner.

      In fact, much of industrial machine learning comes down to taste 🙂
      Most top methods perform just as well say at the 90-95% effort-result level. The really hard work is trying to get above that, kaggle comps are good case in point.

  7. akram June 13, 2017 at 3:38 am #

    thanks so much for your post Jason

    i’am a beginner in scikit-learn and i’ve a little problem when using feature selection module VarianceThreshold, the problem is when i set the variance Var[X]=.8*(1-.8)

    it is supposed to remove all features (that have the same value in all samples) which have the probability p>0.8.
    in my case the fifth column should be removed, p=8/10>(threshold=0,7).

    #####################################

    from sklearn.feature_selection import VarianceThreshold
    X=[[0,1,1,1,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00],
    [0,1,1,1,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00],
    [0,1,1,1,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00],
    [0,1,1,1,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00],
    [0,1,1,1,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.01,0.00,0.00,0.00,0.00,0.00],
    [0,1,1,1,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.01,0.00,0.00,0.00,0.00,0.00],
    [0,1,2,1,29,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,0.00,0.00,0.00,0.00,0.50,1.00,0.00,10,3,0.30,0.30,0.30,0.00,0.00,0.00,0.00,0.00],
    [0,1,1,1,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,253,0.99,0.01,0.00,0.00,0.00,0.00,0.00,0.00],
    [0,1,1,1,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00],
    [0,2,3,1,223,185,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,4,4,0.00,0.00,0.00,0.00,1.00,0.00,0.00,71,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00]]
    sel=VarianceThreshold(threshold=(.7*(1-.7)))

    and this is what i get when running the script

    >>> sel.fit_transform(X)

    array([[ 1., 105., 146., 1., 1., 255., 254.],
    [ 1., 105., 146., 1., 1., 255., 254.],
    [ 1., 105., 146., 1., 1., 255., 254.],
    [ 1., 105., 146., 2., 2., 255., 254.],
    [ 1., 105., 146., 2., 2., 255., 254.],
    [ 1., 105., 146., 2., 2., 255., 255.],
    [ 2., 29., 0., 2., 1., 10., 3.],
    [ 1., 105., 146., 1., 1., 255., 253.],
    [ 1., 105., 146., 2., 2., 255., 254.],
    [ 3., 223., 185., 4., 4., 71., 255.]])
    the second column here should not apear.
    thanks;)

    • Jason Brownlee June 13, 2017 at 8:24 am #

      It is not clear to me what the fault could be. Consider posting to stackoverflow or similar?

  8. Ishaan July 4, 2017 at 10:12 pm #

    Hi Jason,

    I am performing feature selection ( on a dataset with 1,00,000 rows and 32 features) using multinomial Logistic Regression using python.Now, what would be the most efficient way to select features in order to build model for multiclass target variable(1,2,3,4,5,6,7,8,9,10)? I have used RFE for feature selection but it gives Rank=1 to all features. Do I consider all features for building model? Is there any other method for this?
    Thanks in advance.

    • Jason Brownlee July 6, 2017 at 10:15 am #

      Try a suite of methods, build models based on the features and compare the performance of those models.

  9. Hemalatha S November 17, 2017 at 6:50 pm #

    can you tell me how to select features for clinical datasets from a csv file??

    • Jason Brownlee November 18, 2017 at 10:13 am #

      Try a suite of feature selection methods, build models based on selected features, use the set of features + model that results in the best model skill.

  10. Sufian November 26, 2017 at 4:35 am #

    Hi Jason, How can I print the feature name and the importance side by side?

    Thanks,
    Sufian

    • Jason Brownlee November 26, 2017 at 7:35 am #

      es, if you have an array of feature or column names you can use the same index into both arrays.

  11. Hemalatha December 1, 2017 at 2:03 am #

    what are the feature selection methods?? and how to build models based on the selected features??
    can you help me in this? because I am new to machine learning and python

  12. Praveen January 2, 2018 at 6:42 pm #

    i want to remove columns which are highly correlated like caret package pre processing method does in R. how can i remove them using sklearn?

    • Jason Brownlee January 3, 2018 at 5:32 am #

      You might need to implement it yourself – e.g. calculate the correlation matrix and remove selected columns.

  13. Shabnam January 5, 2018 at 8:15 am #

    Deas Keras have similar functionality like FRE that we can use?

    I am using Keras for my models. I created a model. Then, I wanted to use RFE for it. The first line (rfe=FRE(model, 3)) is fine, but as soon as I want to fit the data, I get following error:

    TypeError: Cannot clone object ” (type ): it does not seem to be a scikit-learn estimator as it does not implement a ‘get_params’ methods.

  14. Smitha January 16, 2018 at 12:33 am #

    Hi Jason,

    Can Random Forest’s feature importance be considered as a wrapper based approach?

  15. Beytullah January 20, 2018 at 9:40 pm #

    Hi Jason,

    Do you know how is feature importance calculated?

  16. Fawad January 26, 2018 at 4:52 pm #

    I feel in recursive feature selection it is more prudent to use cv and let the algo decide how many features to retain

    • Jason Brownlee January 27, 2018 at 5:54 am #

      Yes. I often keep all features and use subspaces or ensembles of feature selection methods.

  17. kumar February 26, 2018 at 4:19 pm #

    i need to select the best features from my own data set…using feature selection wrapper approach the learning algorithm is ant colony optimization and the classifier is svm …any one have any idea…

  18. Kagne March 23, 2018 at 8:30 pm #

    Nice post!

    But I still have a question.

    I entered the kaggle competition recently, and I evaluate my dataset by using the methods you have posted(the model is

    RandomForest).

    Then I deleted the worst feature. And my score decreased from 0.79904 to 0.78947. Then I was confused. Should I build more

    features? And What should I do to get a higher score(change model? expand features or more?) or where I can learn those ?

    Thanks a lot.

  19. Rimi March 29, 2018 at 7:38 pm #

    Hi Jason,

    I wanted to know if there are any existing python library/libraries that can be used to rank all the features in a specific dataset based on a specific attribute for various methods like Gain Ratio, Infomation Gain, Chi2,rank correlation, linear correlation, symmetric uncertainty . If not, can you please provide some steps to proceed with the same?

    Thanks

    • Jason Brownlee March 30, 2018 at 6:35 am #

      Perhaps?

      Each method will have a different “view” on what is important in the data. You can test each view to see what is real/useful to developing a skilful model.

  20. Abbas April 11, 2018 at 11:48 pm #

    What about the feature importance attribute from the decision tree classifier? Could it be used for feature selection?
    http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

  21. Chris May 13, 2018 at 10:52 pm #

    Could this method be used to perform feature subset selection on groups of subsets that have to be considered together? For instance, after performing a FeatureHasher transformation you have a fixed length hash which takes up say 256 columns which have to be considered as a group. Do you have any resources for this case?

    • Jason Brownlee May 14, 2018 at 6:36 am #

      Perhaps. Try it. Sorry,I don’t have material on this topic. Try a search on scholar.google.com

  22. Aman May 18, 2018 at 5:15 am #

    Regarding ensemble learning model, I used it to reduce the features. But, how i can get to know that how many features I need to select?

  23. Jeremy Dohmann July 14, 2018 at 9:12 am #

    How large can your feature set before the efficacy of this algorithm breaks down?

    Or, because it uses subsets, it returns a reasonable feature ranking even if you fit over a large number of features?

    Thanks!

  24. Junaid July 22, 2018 at 12:50 pm #

    I am using the tree classifier on my dataset and it gives different values each time I run the script. Is this a problem? or it differentiates because different ways the features are linked by the tree?

  25. sajid nawaz October 15, 2018 at 2:15 am #

    classification and regression analysis feature selection python code???if any one have

  26. hwanhee October 26, 2018 at 6:53 pm #

    Is there a way to find the best number of features for each data set?

    • Jason Brownlee October 27, 2018 at 5:57 am #

      Yes, try a suite of feature selection methods, and a suite of models and use the combination of features and model that give the best performance.

      • hwanhee October 27, 2018 at 12:06 pm #

        For example, which algorithm can find the optimal number of features?

      • hwanhee October 27, 2018 at 12:09 pm #

        For example, there are 500 features. Is there any way to know the number of features that show the highest classification accuracy when performing a feature selection algorithm?

        • Jason Brownlee October 28, 2018 at 6:06 am #

          Test different subsets of features by building a model from them and evaluate the performance of the model. The features that lead to a model with the best performance are the features that you should use.

Leave a Reply