An Introduction to Feature Selection

Which features should you use to create a predictive model?

This is a difficult question that may require deep knowledge of the problem domain.

It is possible to automatically select those features in your data that are most useful or most relevant for the problem you are working on. This is a process called feature selection.

In this post you will discover feature selection, the types of methods that you can use and a handy checklist that you can follow the next time that you need to select features for a machine learning model.

feature selection

An Introduction to Feature Selection
Photo by John Tann, some rights reserved

What is Feature Selection

Feature selection is also called variable selection or attribute selection.

It is the automatic selection of attributes in your data (such as columns in tabular data) that are most relevant to the predictive modeling problem you are working on.

feature selection… is the process of selecting a subset of relevant features for use in model construction

Feature Selection, Wikipedia entry.

Feature selection is different from dimensionality reduction. Both methods seek to reduce the number of attributes in the dataset, but a dimensionality reduction method do so by creating new combinations of attributes, where as feature selection methods include and exclude attributes present in the data without changing them.

Examples of dimensionality reduction methods include Principal Component Analysis, Singular Value Decomposition and Sammon’s Mapping.

Feature selection is itself useful, but it mostly acts as a filter, muting out features that aren’t useful in addition to your existing features.

— Robert Neuhaus, in answer to “How valuable do you think feature selection is in machine learning?

The Problem The Feature Selection Solves

Feature selection methods aid you in your mission to create an accurate predictive model. They help you by choosing features that will give you as good or better accuracy whilst requiring less data.

Feature selection methods can be used to identify and remove unneeded, irrelevant and redundant attributes from data that do not contribute to the accuracy of a predictive model or may in fact decrease the accuracy of the model.

Fewer attributes is desirable because it reduces the complexity of the model, and a simpler model is simpler to understand and explain.

The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data.

— Guyon and Elisseeff in “An Introduction to Variable and Feature Selection” (PDF)

Feature Selection Algorithms

There are three general classes of feature selection algorithms: filter methods, wrapper methods and embedded methods.

Filter Methods

Filter feature selection methods apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset. The methods are often univariate and consider the feature independently, or with regard to the dependent variable.

Some examples of some filter methods include the Chi squared test, information gain and correlation coefficient scores.

Wrapper Methods

Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. A predictive model us used to evaluate a combination of features and assign a score based on model accuracy.

The search process may be methodical such as a best-first search, it may stochastic such as a random hill-climbing algorithm, or it may use heuristics, like forward and backward passes to add and remove features.

An example if a wrapper method is the recursive feature elimination algorithm.

Embedded Methods

Embedded methods learn which features best contribute to the accuracy of the model while the model is being created. The most common type of embedded feature selection methods are regularization methods.

Regularization methods are also called penalization methods that introduce additional constraints into the optimization of a predictive algorithm (such as a regression algorithm) that bias the model toward lower complexity (fewer coefficients).

Examples of regularization algorithms are the LASSO, Elastic Net and Ridge Regression.

Feature Selection Tutorials and Recipes

We have seen a number of examples of features selection before on this blog.

A Trap When Selecting Features

Feature selection is another key part of the applied machine learning process, like model selection. You cannot fire and forget.

It is important to consider feature selection a part of the model selection process. If you do not, you may inadvertently introduce bias into your models which can result in overfitting.

… should do feature selection on a different dataset than you train [your predictive model] on … the effect of not doing this is you will overfit your training data.

— Ben Allison in answer to “Is using the same data for feature selection and cross-validation biased or not?

For example, you must include feature selection within the inner-loop when you are using accuracy estimation methods such as cross-validation. This means that feature selection is performed on the prepared fold right before the model is trained. A mistake would be to perform feature selection first to prepare your data, then perform model selection and training on the selected features.

If we adopt the proper procedure, and perform feature selection in each fold, there is no longer any information about the held out cases in the choice of features used in that fold.

— Dikran Marsupial in answer to “Feature selection for final model when performing cross-validation in machine learning

The reason is that the decisions made to select the features were made on the entire training set, that in turn are passed onto the model. This may cause a mode a model that is enhanced by the selected features over other models being tested to get seemingly better results, when in fact it is biased result.

If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.

— Dikran Marsupial in answer to “Feature selection and cross-validation

Feature Selection Checklist

Isabelle Guyon and Andre Elisseeff the authors of “An Introduction to Variable and Feature Selection” (PDF) provide an excellent checklist that you can use the next time you need to select data features for you predictive modeling problem.

I have reproduced the salient parts of the checklist here:

  1. Do you have domain knowledge? If yes, construct a better set of ad hoc”” features
  2. Are your features commensurate? If no, consider normalizing them.
  3. Do you suspect interdependence of features? If yes, expand your feature set by constructing conjunctive features or products of features, as much as your computer resources allow you.
  4. Do you need to prune the input variables (e.g. for cost, speed or data understanding reasons)? If no, construct disjunctive features or weighted sums of feature
  5. Do you need to assess features individually (e.g. to understand their influence on the system or because their number is so large that you need to do a first filtering)? If yes, use a variable ranking method; else, do it anyway to get baseline results.
  6. Do you need a predictor? If no, stop
  7. Do you suspect your data is “dirty” (has a few meaningless input patterns and/or noisy outputs or wrong class labels)? If yes, detect the outlier examples using the top ranking variables obtained in step 5 as representation; check and/or discard them.
  8. Do you know what to try first? If no, use a linear predictor. Use a forward selection method with the “probe” method as a stopping criterion or use the 0-norm embedded method for comparison, following the ranking of step 5, construct a sequence of predictors of same nature using increasing subsets of features. Can you match or improve performance with a smaller subset? If yes, try a non-linear predictor with that subset.
  9. Do you have new ideas, time, computational resources, and enough examples? If yes, compare several feature selection methods, including your new idea, correlation coefficients, backward selection and embedded methods. Use linear and non-linear predictors. Select the best approach with model selection
  10. Do you want a stable solution (to improve performance and/or understanding)? If yes, subsample your data and redo your analysis for several “bootstrap”.

Further Reading

Do you need help with feature selection on a specific platform? Below are some tutorials that can get you started fast:

To go deeper into the topic, you could pick up a dedicated book on the topic, such as any of the following:

Feature Selection is a sub-topic of Feature Engineering. You might like to take a deeper look at feature engineering in the post: ”

You might like to take a deeper look at feature engineering in the post:

112 Responses to An Introduction to Feature Selection

  1. Zvi Boger October 2, 2015 at 12:05 am #

    People can use my automatic feature dimension reduction algorithm published in:

    Z. Boger and H. Guterman, Knowledge extraction from artificial neural networks models. Proceedings of the IEEE International Conference on Systems Man and Cybernetics, SMC’97, Orlando, Florida, Oct. 1997, pp. 3030-3035.

    or contact me at to get a copy of the paper..

    The algorithm analyzes the “activities” of the trained model’s hidden neurons outputs. If a feature dose not contribute to these activities, it either “flat” in the data, or the connection weights assigned to it are too small.

    In both cases it can be safely discarded and the ANN retrained with the reduced dimensions.

  2. Joseph December 29, 2015 at 2:38 pm #

    Nice Post Jason, This is an eye opener for me and I have been looking for this for quite a while. But my challenge is quite different I think, my dataset is still in raw form and comprises different relational tables. How to select best features and how to form a new matrix for my predictive modelling are the major challenges I am facing.


    • Jason Brownlee December 29, 2015 at 4:12 pm #

      Thanks Joseph.

      I wonder if you might get more out of the post on feature engineering (linked above)?

  3. doug February 9, 2016 at 4:22 pm #

    very nice synthesis of some of the ‘primary sources’ out there (Guyon et al) on f/s.

  4. bura February 9, 2016 at 4:58 pm #

    Can we use selection teqnique for the best features in the dataset that is value numbering?

    • Jason Brownlee July 20, 2016 at 5:27 am #

      Hi bura, if you mean integer values, then yes you can.

  5. swati March 6, 2016 at 10:07 pm #

    how can chi statistics feature selection algorithm work in data reduction.

    • Jason Brownlee July 20, 2016 at 5:31 am #

      The calculated chi-squared statistic can be used within a filter selection method.

      • Poornima July 21, 2017 at 2:40 pm #

        Which is the best tool for chi square feature selection

        • Jason Brownlee July 22, 2017 at 8:30 am #

          It is supported on most platforms.

          • Poornima July 24, 2017 at 3:36 pm #

            Actually i want to apply Chi square to find the independence between two attributes to find the redundancy between the two. The tools supporting CHI square feature selection only compute the level of independence between the attribute and the class attribute. May question is…what is the exact formula to use Chi square to find the level of independence between two attributes….? PS: I cannot use an existing tool…thanks

          • Jason Brownlee July 25, 2017 at 9:32 am #

            Sorry, I don’t have the formula at hand. I’d recommend checking a good stats text or perhaps Wikipedia.

  6. Ralf May 2, 2016 at 5:56 pm #

    which category does Random Forest’s feature importance criterion belong as a feature selection technique?

    • Jason Brownlee July 20, 2016 at 5:29 am #

      Great question Ralf.

      Relative feature importance scores from RandomForest and Gradient Boosting can be used as within a filter method.

      If the scores are normalized between 0-1, a cut-off can be specified for the importance scores when filtering.

  7. swati June 23, 2016 at 10:58 pm #

    CHI feature selection ALGORITHM IS is NP- HARD OR NP-COMPLETE

  8. Mohammed AbdelAal June 26, 2016 at 9:53 pm #

    Hi all,
    Thanks Jason Brownlee for this wonderful article.

    I have a small question. While performing feature selection inside the inner loop of cross-validation, what if the feature selection method selects NO features?. Do I have to pass all features to the classifier or What??

    • Jason Brownlee June 27, 2016 at 5:42 am #

      Good question. If this happens, you will need to have a strategy. Selecting all features sounds like a good one to me.

  9. Dado July 19, 2016 at 10:20 pm #

    Hello Jason!

    Great site and great article. I’m confused about how the feature selection methods are categorized though:

    Do filter methods always perform ranking? Is it not possible for them to use some sort of subset search strategy such as ‘sequential forward selection’ or ‘best first’?’

    Is it not possible for wrapper or embedded methods to perform ranking? For example when I select ‘Linear SVM’ or “LASSO” as the estimator in sklearns ‘SelectFromModel’-function, it seems to me that it examines each feature individually. The documentation doesn’t mention anything about a search strategy.

    • Jason Brownlee July 20, 2016 at 5:34 am #

      Good question Dado.

      Feature subsets can be created and evaluated using a technique in the wrapper method, this would not be a filter method.

      You can use an embedded within a wrapper method, but I expect the results would be less insightful.

  10. Youssef August 9, 2016 at 7:09 pm #

    Hi, thx all or your sharing
    I had a quation about the limitation of these methods in terms of number of features. In my scope we work on small sample size (n=20 to 40) with a lot of features (up to 50)
    some people suggested to do all combinations to get high performence in terms of prediction.
    what do you think?

    • Jason Brownlee August 15, 2016 at 11:14 am #

      I think try lots of techniques and see what works best for your problem.

  11. Jarking August 9, 2016 at 9:28 pm #

    hi,I’m now learning feature selection with hierarchical harmony search.but I don’t know how to
    begin with it?could you give me some ideas?

    • Jason Brownlee August 15, 2016 at 11:15 am #

      Consider starting with some off the shelf techniques first.

  12. L K September 3, 2016 at 3:06 am #

    i want to use feature extractor for detecting metals in food products through features such as amplitude and phase. Which algorithm or filter will be best suited?

  13. laxmi k September 3, 2016 at 2:05 pm #

    I want it in matlab.

  14. Jaggi September 20, 2016 at 5:53 am #

    Hello Jason,

    As per my understanding, when we speak of ‘dimensions’ we are actually referring to features or attributes. Curse of dimensionality is sort of sin where dimensions are too much, may be in tens of thousand and algorithms are not robust enough to handle such high dimensionality i.e. feature space.

    To reduce the dimension or features, we use algorithm such as Principle Component Analysis. It creates a combination of existing features which try to explain maximum of variance.

    Question: Since, these components are created using existing features and no feature is removed, then how complexity is reduced ? How it is beneficially?
    Say, there are 10000 features, and each component i.e. PC1 will be created using these 10000 features. Features didn’t reduced rather a mathematical combination of these features is created.

    Without PCA: GoodBye ~ 1*WorkDone + 1*Meeting + 1*MileStoneCompleted
    With PCA: Goodbye ~ PC1
    PC1=0.7*WorkDone + 0.2*Meeting +0.4*MileStoneCompleted

    • Jason Brownlee September 20, 2016 at 8:37 am #

      Yes Jaggi, features are dimensions.

      We are compressing the feature space, and some information (that we think we don’t need) is/may be lost.

      You do have an interesting point from a linalg perspective, but the ML algorithms are naive in feature space, generally. Deep learning may be different on the other hand, with feature learning. The hidden layers may be doing a PCA-like thing before getting to work.

  15. sai November 13, 2016 at 11:43 pm #

    Is there any Scope for pursuing PhD in feature selection?

    • Jason Brownlee November 14, 2016 at 7:43 am #

      There may be Sai, I would suggest talking to your advisor.

  16. Poornima December 6, 2016 at 6:29 pm #

    What would be the best strategy for feature selection in case of text mining or sentiment analysis to be more specific. The size of feature vector is around 28,000!

    • Jason Brownlee December 7, 2016 at 8:55 am #

      Sorry Poornima, I don’t know. I have not done my homework on feature selection in NLP.

  17. Lekan December 22, 2016 at 6:31 am #

    How many variables or features can we use in feature selection. I am working on features selection using Cuckoo Search algorithm on predicting students academic performance. Kindly assist pls sir.

    • Jason Brownlee December 22, 2016 at 6:39 am #

      There are no limits beyond your hardware or those of your tools.

  18. Arun January 11, 2017 at 2:21 am #

    can you give some java example code for feature selection using forest optimization algorithm

  19. Amina February 17, 2017 at 4:07 am #

    Pls is comprehensive measure feature selection also part of the methods of feature selection?

    • Jason Brownlee February 17, 2017 at 10:01 am #

      Hi Amina, I’ve not heard of “comprehensive measure feature selection” but it sounds like a feature selection method.

  20. Birendra February 28, 2017 at 10:06 pm #

    Hi Jason,
    I am new to Machine learning. I applied in sklearn RFE to SVM non linear kernels.
    It’s giving me error. Is there any way to reduce features in datasets.

    • Jason Brownlee March 1, 2017 at 8:37 am #

      Yes, this post describes many ways to reduce the number of features in a dataset.

      What is your error exactly? What platform are you using?

  21. Abdel April 6, 2017 at 6:37 am #

    Hi Jason,

    what is the best method between all this methods in prediction problem ??

    is LASSO method great for this type of problem ?

    • Jason Brownlee April 9, 2017 at 2:37 pm #

      I would recommend you try a suite of methods and see what works best on your problem.

  22. Al April 26, 2017 at 6:05 pm #

    Fantastic article Jason, really appreciate this in helping to learn more about feature selection.

    If, for example, I have run the below code for feature selection:

    test = SelectKBest(score_func=chi2, k=4)
    fit =, y_train.ravel())

    How do I then feed this into my KNN model? Is it simply a case of:

    knn = KNeighborsClassifier() –is this where the feature selection comes in?
    KNeighborsClassifier(algorithm=’auto’, leaf_size=30, metric=’minkowski’,
    metric_params=None, n_jobs=1, n_neighbors=5, p=2,
    predicted = knn.predict(X_test)

  23. Nisha t m May 14, 2017 at 2:21 am #

    I have multiple data set. I want to perform LASSO regression for feature selection for each subset. How I get [0,1] vector set as output?

    • Jason Brownlee May 14, 2017 at 7:31 am #

      That really depends on your chosen library or platform.

  24. Simone May 30, 2017 at 6:51 pm #

    Great post!

    If I have well understood step n°8, it’ s a good procedure *first* applying a linear predictor, and then use a non-linear predictor with the features found before. Is it correct?

    • Jason Brownlee June 2, 2017 at 12:34 pm #

      Try linear and nonlinear algorithms on raw a selected features and double down on what works best.

  25. akram June 10, 2017 at 6:03 am #

    hello Jason Brownlee and thank you for this post,
    i’am working on intrusion detection systems IDS, and i want you to advice me about the best features selection algorithm and why?
    thanks in advance.

    • Jason Brownlee June 10, 2017 at 8:30 am #

      Sorry intrusion detection is not my area of expertise.

      I would recommend going through the literature and compiling a list of common features used.

  26. karthika July 28, 2017 at 6:43 pm #

    please tell me the evaluation metrics for feature selection algorithms

    • Jason Brownlee July 29, 2017 at 8:10 am #

      Ultimately the skill of the model in making predictions.

      That is the goal of our project after all!

  27. Swati July 31, 2017 at 4:19 am #


    I have a set of around 3 million features. I want to apply LASSO for feature selection/dimensionality reduction. How do I do that? I’m using MATLAB.

    • Swati July 31, 2017 at 4:23 am #

      When I use the LASSO function in MATLAB, I give X (mxn Feature matrix) and Y (nx1 corresponding responses) as inputs, I obtain an nxp matrix as output but I don’t know how to exactly utilise this output.

      • Jason Brownlee July 31, 2017 at 8:21 am #

        Sorry, I cannot help you with the matlab implementations.

    • Jason Brownlee July 31, 2017 at 8:20 am #

      Perhaps use an off-the-shelf efficient implementation rather than coding it yourself in matlab?

      Perhaps Vowpal Wabbit:

      • Swati July 31, 2017 at 3:20 pm #


  28. Elakkiya September 5, 2017 at 8:45 pm #

    I need your suggestion on something. just assume i have 3 feature set and three models. for example f1, f2,f3 set. Each set produce different different output result in percentage. i need to assign weight to rank the feature set. any mathematical way to assign weight to the feature set based on three models output?

    • Jason Brownlee September 7, 2017 at 12:44 pm #

      Yes, this is what linear machine learning algorithms do, like a regression algorithm.

      • Elakkiya September 8, 2017 at 3:39 pm #

        Thank you. Still its difficult to find how regression algorithm will useful to assign weight . Can you suggest any material or link to read…

  29. Marie J October 4, 2017 at 5:18 am #

    Hi Jason! Thank you for your articles – you’ve been teaching me a lot over the past few weeks. 🙂

    Quick question – what is your experience with the best sample size to train the model? I have 290 features and about 500 rows in my dataset. Would this be considered adequate? Or is the rule of thumb to just try and see how well it performs?

    Many thanks!

  30. gene October 18, 2017 at 6:02 pm #

    Hello Jason,

    I am still confused about your point regarding the feature selection integration with model selection. From the first link you suggested, the advice was to take out a portion of the training set to do feature selection on. Next start model selection on the remaining data in the training set

  31. gene November 12, 2017 at 8:32 am #

    Hello again!
    my feature space is over 8000 attributes. When applying RFE, how can I select the right number of feature? By constructing multiple classfiers (NB, SVM, DT) each of which returns different results.

    • Jason Brownlee November 12, 2017 at 9:11 am #

      There is no one best set or no one best model, just many options for you to compare. That is the job of applied ML.

      Try building a model with each set and see which is more skillful on unseen data.

      • gene November 12, 2017 at 7:44 pm #

        I want to publish my results. Is it ok to report that for each model I used a different feature set with a different number of top features?

        • Jason Brownlee November 13, 2017 at 10:13 am #

          When reporting results, you should provide enough information so that someone else can reproduce your results.

  32. gene November 13, 2017 at 6:44 pm #

    Yes I understand that, but I meant does that outcome look reasonable?

  33. Hardik Sahi January 8, 2018 at 12:12 pm #

    I am getting a bit confused in the section of applying feature selection in cross validation step.
    I understand that we should perform feature selection on a different dataset [let’s call it FS set ] than the dataset we use to train the model [call it train set].

    I understand the following steps:
    1) perform Feature Selection on FS set.
    2) Use above selected features on the training set and fit the desired model like logistic regression model.
    3) Now, we want to evaluate the performance of the above fitted model on unseen data [out-of-sample data, hence perform CV]

    For each fold in CV phase, we have trainSet and ValidSet. Now we have to again perform feature selection for each fold [& get the features which may/ may not be same as features selected in step 1]. For this, I again have to perform Feature selection on a dataset different from the trainSet and ValidSet.

    This is performed for all the k folds and the accuracy is averaged out to get the out-of-sample accuracy for the model predicted in step 2.

    I have doubts in regards to how is the out-of-sample accuracy (from CV) an indicator of generalization accuracy of model in step 2. Clearly the feature sets used in both steps will be different.

    Also, once I have a model from Step 2 with m<p features selected. How will I test it on completely new data [TestData]? (TestData is having p features and the model is trained on data with m features. What happens to the remaining p-m features??)


    • Jason Brownlee January 8, 2018 at 3:54 pm #

      A simple approach is to use the training data for feature selection.

      I would suggest splitting the training data into train and validation. Perform feature selection within each CV fold (automatically).

      Once you pick a final model+procedure, fit on the training dataset use the validation dataset as a sanity check.

  34. Molood March 9, 2018 at 8:07 am #

    Thank you Jason for your article, it was so helpful! I’m working on a set of data which I should to find a business policy among the variables. are any of these methods which you mentioned unsupervised? there’s no target label for my dataset. and if there is unsupervised machine learning method, do you know any ready code in github or in any repository for it?

  35. Rag March 15, 2018 at 5:24 pm #

    Sir, Is there any method to find the feature important measures for the neural network?

    • Jason Brownlee March 16, 2018 at 6:10 am #

      There may be, I am not across them sorry. Try searching on google scholar.

  36. Yosra March 22, 2018 at 4:35 am #

    Thank you for the helpful introduction. However, do you have any code using particle swar optmization for features selection ?

  37. Satya March 28, 2018 at 1:20 am #

    Good Morning Jason,
    A very nice article. I have a quick question. Please help me out. I am using the R code for Gradient Descent available on internet. The data set ‘women’ is one of the R data sets which has just 15 data points. Here is how I am calling the gradient descent.

    n = length(women[,1])
    x = women[,1]
    y = women[,2]

    gradientDesc(x, y, 0.00045, 0.0000001, n, 25000000)

    It takes these many iteration to converge and that small learning rate. It is not converging for any higher learning rates. Also I tried using the feature scaling (single feature) as follows but it did not help also – scaling may not be really applicable in case of a single freature

    x = (women[,1] – mean(women[,1]))/max(women[,1])

    Any help is highly appreciated


    • Jason Brownlee March 28, 2018 at 6:28 am #

      Perhaps ask the person who wrote the code about how it works?

  38. Satya March 28, 2018 at 1:22 am #

    By the way 0.00045 is the learning rate and 0.0000001 is the threshold

  39. Sara April 21, 2018 at 12:57 pm #

    Is it correct to say that PCA is not only a dimension reduction approach but also a feature reduction process too as in PCA, feature with lower loading should be excluded from the components?

    • Jason Brownlee April 22, 2018 at 5:56 am #

      Yes, we can treat dimensionality reduction and feature reduction as synonyms.

  40. Sarah June 11, 2018 at 6:39 am #

    Hi Jason,

    Great and useful article – thank you!!

    So I’ve been performing elastic net and gradient boosting machine analyses on my data. Those are methods of feature selection, correct? So, would it be advisable to choose the significant or most influential predictors and include those as the only predictors in a new elastic net or gradient boosting model?

    • Sarah June 11, 2018 at 6:57 am #

      Also, glmnet is finding far fewer significant features than is gbm. Should I just rely on the more conservative glmnet? Thank you!

      • Jason Brownlee June 11, 2018 at 1:50 pm #

        Simpler models are preferred in general. They are easier to understand, explain and often less likely to overfit.

    • Jason Brownlee June 11, 2018 at 1:49 pm #

      Generally, I recommend testing a suite of methods on your problem in order to discover what works best.

      • Sarah Paul June 12, 2018 at 8:59 am #

        Thank you for your answer!
        But, should I use the most influential predictors (as found via glmnet or gbm. etc.) as the only predictors in a new glmnet or gbm (or decision tree, random forest, etc.) model? That doesn’t seem to improve accuracy for me. And/or, is it advisable to use them as input in a non-machine learning statistical analysis (e.g., multinomial regression)?
        Thank you again!

        • Jason Brownlee June 12, 2018 at 2:26 pm #

          Ideally, you only want to use the variables required to make a skilful model.

          Try the suggested parameters and compare the skill of a fit model to a model trained on all parameters.

  41. Pratik June 14, 2018 at 4:59 pm #

    Hi Jason thanks for a wonderful article!!

    I need to find the correlation between specific set of features and class label. Is it possible to find the correlation of all the features with respect to only class label?

  42. Anthony The Koala June 27, 2018 at 8:46 pm #

    Dear Dr Jason,
    Could you please make the distinction between feature selection (to reduce factors) for predictive modelling and pruning convolutional neural networks (CNN) to improve execution and computation speed please.
    Thank you,
    Anthony of Sydney

    • Jason Brownlee June 28, 2018 at 6:18 am #

      What is “pruning CNNs”?

      • Anthony The Koala June 28, 2018 at 11:50 am #

        As I understand, pruning CNNs or pruning convolutional neural networks is a method of reducing the size of a CNN to make the CNN smaller and fast to compute. The idea behind pruning a CNN is to remove nodes which contribute little to the final CNN output. Each node is assigned a weight and ranked.

        Those nodes with little weight are eliminated. The result of a smaller CNN is faster computation. However there is a tradeoff in accuracy of matching an actual object to the stored CNN. Software and papers indicate that there is not one method of pruning:

        Eg 1

        Eg 2 an implementation in keras,

        Eg 3 a paper it is not the only paper on pruning.

        What is the corollary of pruning CNNs and your (this) article? Answer: pruning CNNs involve removing redundant nodes of a CNN while pruning variables in a model as in Weka reduces the number of variables in a model you wish to predict.

        Yes they are completely different topics, but the idea is (i) reduce computation, (ii) parsimony.

        Thank you,
        Anthony of Sydney

        • Jason Brownlee June 28, 2018 at 2:10 pm #

          I see, like classical neural network pruning from the ’90s.

          Pruning operates on the learned model, in whatever shape or form. Feature selection operates on the input to the model.

          That is the difference, model and input data.

          They are reducing complexity but in different ways, ‘a mapping that has already been learned’ and ‘what could be mapped’ to an output. One fiddles in a small area of hypothesis space for the mapping, the other limits the hypothesis space that is being searched.

  43. Ellen July 17, 2018 at 3:36 am #

    Hi, Jason. I find your articles really helpful. I have a problem that’s highly related to feature selection, but not the same. So there are correlations between my features which I think if I apply standard feature selection methods, would result in that some features get small weights/importance because they are correlated with a chosen one and thus seem redundant. But I was wondering if you have suggestions for methods that do take into account of feature correlation and assign relatively equal weights/importance to features that are highly correlated?

    • Jason Brownlee July 17, 2018 at 6:20 am #

      Ensembles of decision trees are good at handing irrelevant features, e.g. gradient boosting machines and random forest.

  44. Chris August 7, 2018 at 10:26 am #

    Nice write up. What I’ve found is that the most important features (Boruta and Recursive feature elimination) in my data tend to have the lowest correlation coefficients, and vice versa. Can you shed some light on what’s going on?

    • Jason Brownlee August 7, 2018 at 2:31 pm #


      It’s hard to tell, perhaps a quirk of your dataset?

  45. Harsh August 16, 2018 at 2:51 pm #

    nice article , really very helpful

    you have written inadvertently introduce bias into your models which can result in overfitting. but high bais model mainly underfit the traning data

    please correct me if i am worng

    • Jason Brownlee August 17, 2018 at 6:24 am #

      Can you elaborate on what I have “inadvertently written”?

  46. Guru October 18, 2018 at 11:37 pm #

    Hi Jason, I have one query regarding the below statement

    “It is important to consider feature selection a part of the model selection process. If you do not, you may inadvertently introduce bias into your models which can result in overfitting.”

    If we have the bias in our model then it should underfits, just trying to understand the above statement how does bias results in overfitting.

    • Jason Brownlee October 19, 2018 at 6:05 am #

      No, a bias can also lead to an overfit. A bias is list a limit on variance in either a helpful or hurtful direction.

      Using the test set to train a model as well as the training dataset is a helpful bias that will make your model perform better, but any evaluation on the test set less useful – an extreme example of data leakage. More here:

  47. Yoshitaka November 21, 2018 at 5:38 pm #

    How do you determine the cut off value when using the feature selection from RandomForest width Scikit-learn and XGBoost’s feature importance methods?

    I just choose by heuristic, just feeling.
    I thought using grid search or some other optimized methods are better.

    • Jason Brownlee November 22, 2018 at 6:22 am #

      Trial and error and go with the cut-off that results in the most skillful model.

Leave a Reply