Feature Importance and Feature Selection With XGBoost in Python

A benefit of using ensembles of decision tree methods like gradient boosting is that they can automatically provide estimates of feature importance from a trained predictive model.

In this post you will discover how you can estimate the importance of features for a predictive modeling problem using the XGBoost library in Python.

After reading this post you will know:

  • How feature importance is calculated using the gradient boosting algorithm.
  • How to plot feature importance in Python calculated by the XGBoost model.
  • How to use feature importance calculated by XGBoost to perform feature selection.

Let’s get started.

  • Update Jan/2017: Updated to reflect changes in scikit-learn API version 0.18.1.
  • Update March/2018: Added alternate link to download the dataset as the original appears to have been taken down.
Feature Importance and Feature Selection With XGBoost in Python

Feature Importance and Feature Selection With XGBoost in Python
Photo by Keith Roper, some rights reserved.

Need help with XGBoost in Python?

Take my free 7-day email course and discover configuration, tuning and more (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Feature Importance in Gradient Boosting

A benefit of using gradient boosting is that after the boosted trees are constructed, it is relatively straightforward to retrieve importance scores for each attribute.

Generally, importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model. The more an attribute is used to make key decisions with decision trees, the higher its relative importance.

This importance is calculated explicitly for each attribute in the dataset, allowing attributes to be ranked and compared to each other.

Importance is calculated for a single decision tree by the amount that each attribute split point improves the performance measure, weighted by the number of observations the node is responsible for. The performance measure may be the purity (Gini index) used to select the split points or another more specific error function.

The feature importances are then averaged across all of the the decision trees within the model.

For more technical information on how feature importance is calculated in boosted decision trees, see Section 10.13.1 “Relative Importance of Predictor Variables” of the book The Elements of Statistical Learning: Data Mining, Inference, and Prediction, page 367.

Also, see Matthew Drury answer to the StackOverflow question “Relative variable importance for Boosting” where he provides a very detailed and practical answer.

Manually Plot Feature Importance

A trained XGBoost model automatically calculates feature importance on your predictive modeling problem.

These importance scores are available in the feature_importances_ member variable of the trained model. For example, they can be printed directly as follows:

We can plot these scores on a bar chart directly to get a visual indication of the relative importance of each feature in the dataset. For example:

We can demonstrate this by training an XGBoost model on the Pima Indians onset of diabetes dataset and creating a bar chart from the calculated feature importances (update: download from here).

Running this example first outputs the importance scores:

We also get a bar chart of the relative importances.

Manual Bar Chart of XGBoost Feature Importance

Manual Bar Chart of XGBoost Feature Importance

A downside of this plot is that the features are ordered by their input index rather than their importance. We could sort the features before plotting.

Thankfully, there is a built in plot function to help us.

Using theBuilt-in XGBoost Feature Importance Plot

The XGBoost library provides a built-in function to plot features ordered by their importance.

The function is called plot_importance() and can be used as follows:

For example, below is a complete code listing plotting the feature importance for the Pima Indians dataset using the built-in plot_importance() function.

Running the example gives us a more useful bar chart.

XGBoost Feature Importance Bar Chart

XGBoost Feature Importance Bar Chart

You can see that features are automatically named according to their index in the input array (X) from F0 to F7.

Manually mapping these indices to names in the problem description, we can see that the plot shows F5 (body mass index) has the highest importance and F3 (skin fold thickness) has the lowest importance.

Feature Selection with XGBoost Feature Importance Scores

Feature importance scores can be used for feature selection in scikit-learn.

This is done using the SelectFromModel class that takes a model and can transform a dataset into a subset with selected features.

This class can take a pre-trained model, such as one trained on the entire training dataset. It can then use a threshold to decide which features to select. This threshold is used when you call the transform() method on the SelectFromModel instance to consistently select the same features on the training dataset and the test dataset.

In the example below we first train and then evaluate an XGBoost model on the entire training dataset and test datasets respectively.

Using the feature importances calculated from the training dataset, we then wrap the model in a SelectFromModel instance. We use this to select features on the training dataset, train a model from the selected subset of features, then evaluate the model on the testset, subject to the same feature selection scheme.

For example:

For interest, we can test multiple thresholds for selecting features by feature importance. Specifically, the feature importance of each input variable, essentially allowing us to test each subset of features by importance, starting with all features and ending with a subset with the most important feature.

The complete code listing is provided below.

Running this example prints the following output:

We can see that the performance of the model generally decreases with the number of selected features.

On this problem there is a trade-off of features to test set accuracy and we could decide to take a less complex model (fewer attributes such as n=4) and accept a modest decrease in estimated accuracy from 77.95% down to 76.38%.

This is likely to be a wash on such a small dataset, but may be a more useful strategy on a larger dataset and using cross validation as the model evaluation scheme.

Summary

In this post you discovered how to access features and use importance in a trained XGBoost gradient boosting model.

Specifically, you learned:

  • What feature importance is and generally how it is calculated in XGBoost.
  • How to access and plot feature importance scores from an XGBoost model.
  • How to use feature importance from an XGBoost model for feature selection.

Do you have any questions about feature importance in XGBoost or about this post? Ask your questions in the comments and I will do my best to answer them.


Want To Learn The Algorithm Winning Competitions?

XGBoost With Python

Develop Your Own XGBoost Models in Minutes

…with just a few lines of Python

Discover how in my new Ebook:
XGBoost With Python

It covers self-study tutorials like:
Algorithm Fundamentals, Scaling, Hyperparameters, and much more…

Bring The Power of XGBoost To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.


46 Responses to Feature Importance and Feature Selection With XGBoost in Python

  1. Trupti December 9, 2016 at 5:23 pm #

    Hi. I am running “select_X_train = selection.transform(X_train)” where x_train is the data with dependent variables in few rows.

    The error I am getting is “select_X_train = selection.transform(X_train)”

    Request your help.

    Thanks!

  2. Trupti December 9, 2016 at 5:28 pm #

    sorry the error is “TypeError: only length-1 arrays can be converted to Python scalars”.

    • Jason Brownlee December 10, 2016 at 8:04 am #

      Check the shape of your X_train, e.g. print(X_train.shape)

      You may need to reshape it into a matrix.

  3. sa January 5, 2017 at 3:44 pm #

    I tried to select features for xgboost based on this post (last part which uses thresholds) but since I am using gridsearch and pipeline, this error is reported:
    select_X_train = selection.transform(X_train)
    File “C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\feature_selection\base.py”, line 76, in transform
    mask = self.get_support()
    File “C:\Users\MM.co\Anaconda3\lib\site-packages\sklearn\feature_selection\base.py”, line 47, in get_support
    mask = self._get_support_mask()
    File “C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\feature_selection\from_model.py”, line 201, in _get_support_mask
    scores = _get_feature_importances(estimator)
    File “C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\feature_selection\from_model.py”, line 32, in _get_feature_importances
    % estimator.__class__.__name__)
    ValueError: The underlying estimator method has no coef_ or feature_importances_ attribute. Either pass a fitted estimator to SelectFromModel or call fit before calling transform.

    regards,

    • Jason Brownlee January 6, 2017 at 9:05 am #

      Hi sa,

      Consider trying the example without Pipelines first, get it working, then try adding in additional complexity.

      • sa January 6, 2017 at 5:53 pm #

        Hello Mr. Brownlee
        Thanks

        I already tried the example without Pipelines , and it works well. After adding pipeline, it could extract feature importance but after that it fails. Thanks.

        Best regards,

  4. Johnn January 15, 2017 at 12:28 pm #

    Thanks for the post. I don’t understand what’s the meaning of “F-score” in the x-axis of the feature importance plot….. And what is the number next to each of the bar?

    • Jason Brownlee January 16, 2017 at 10:36 am #

      Hi Johnn,

      You can learn more about the F1 score here:
      https://en.wikipedia.org/wiki/F1_score

      The number is a scaled importance, it really only has meaning relative to other features.

      • Gonçalo Abreu June 5, 2017 at 10:32 pm #

        Hey Jason,

        Are you sure the F score on the graph is realted to the tradicional F1-score?

        I found this github page where the owner presents many ways to extract feature importance meaning from xgb. His explanation abou the F measure seems to have no relation to F1
        https://github.com/Far0n/xgbfi

        • Jason Brownlee June 6, 2017 at 9:36 am #

          Importance scores are different from F scores. The above tutorial focuses on feature importance scores.

  5. Soyoung Kim April 20, 2017 at 2:39 am #

    Hi Jason,

    Your postings are always amazing for me to learn ML techniques!
    Especially this XGBoost post really helped me work on my ongoing interview project.
    The task is not for the Kaggle competition but for my technical interview! 🙂

    I used your code to generate a feature importance ranking and some of the explanations you used to describe techniques.
    You can find it here: https://www.kaggle.com/soyoungkim/two-sigma-connect-rental-listing-inquiries/rent-interest-classifier

    I also put your link in the reference section.

    Please let me know if it is not appropriate for me to use your code.

    • Jason Brownlee April 20, 2017 at 9:31 am #

      Well done.

      As long as you cite the source, I am happy.

  6. zttara April 27, 2017 at 3:42 pm #

    Hi Jason,
    I have some questions about feature importance.

    I want to use the features that selected by XGBoost in other classification models, and
    I got confused on how to get the right scores of features, I mean that is it necessary to adjust parameters to get the best model and obtain the corresponding scores of features? In other words, how can I get the right scores of features in the model?

    Thanks a lot.

    • Jason Brownlee April 28, 2017 at 7:36 am #

      The scores are relative.

      You can use them as a filter and select all features with a score above x, e.g. 0.5.

      • max January 14, 2018 at 12:00 pm #

        Hi Jason, I know that choosing a threshold (like 0.5) is always arbitray …but is there a rule of thumb for this?

        thanks a lot.

  7. Omogbehin May 16, 2017 at 10:23 am #

    Hello sir,

    For the XGBoost feature selection, How do i change the Y axis to the names of my attributes. Kind regards sir.

    • Jason Brownlee May 17, 2017 at 8:23 am #

      Great question, I’m not sure off-hand. You may need to use the xgboost API directly.

  8. Simone June 21, 2017 at 11:14 pm #

    Hi Jason,

    Is it possible using “feature_importances_” in XGBRegressor() ?

    • Jason Brownlee June 22, 2017 at 6:06 am #

      I’m not sure off the cuff, sorry.

      • Simone June 22, 2017 at 7:06 am #

        Ok, I will try another method for features selection.

        Thanks

  9. Richard July 22, 2017 at 8:04 pm #

    Hello Jason, I use the XGBRegressor and want to do some feature selection. However, although the ‘plot_importance(model)’ command works, when I want to retreive the values using model.feature_importances_, it says ‘AttributeError: ‘XGBRegressor’ object has no attribute ‘feature_importances_’. Any hints how to retreive the feature importances for regression?

    • Jason Brownlee July 23, 2017 at 6:23 am #

      Sorry to hear that Richard. I’m not sure of the cause.

  10. Long.Ye August 23, 2017 at 10:41 am #

    Hi Jason,

    Do you know some methods to quality variable importance in RNN or LSTM? Could the XGBoost method be used in regression problems of RNN or LSTM? Thanks a lot.

  11. Edward August 25, 2017 at 3:57 pm #

    Can you explain how the decision trees feature importance also works?

  12. Biswajit September 9, 2017 at 10:36 pm #

    Hi Jason while trying to fir my model in Xgboost object it is showing the below error

    OSError: [WinError -529697949] Windows Error 0xe06d7363

    i am using 32 bit anaconda

    import platform
    platform.architecture()
    (’32bit’, ‘WindowsPE’)

    Please suggect how to get over this issue

    • Jason Brownlee September 11, 2017 at 12:01 pm #

      Sorry, I have not seen this error.

      Perhaps you can post to stackoverflow?

      • kim tae in September 23, 2017 at 4:51 pm #

        Hi Jason.

        SelectFromModel(model, threshold=thresh, prefit=True)

        I wonder what prefit = true means in this section. I checked on the sklearn site, but I do not understand.

        • Jason Brownlee September 24, 2017 at 5:15 am #

          It specifies not to fit the model again, that we have already fit it prior.

  13. Reed Guo January 19, 2018 at 2:14 am #

    Hi, Jason

    Can you get feature importance of artificial neural network?

    If you can, how?

    Thanks very much.

  14. Zhang January 25, 2018 at 11:41 pm #

    Hi, Jason. I am doing a project with Stochastic gradient boosting. My database is clinical data and I think the ranking of feature importance can feed clinicians back with clinical knowledge, i.e., machine can tell us which clinical features are most important in distinguishing phenotypes of the diseases. What I did is to predict the phenotypes of the diseases with all the variables of the database using SGB in the training set, and then test the performance of the model in testing set. If the testing is good (e.g., high accuracy and kappa), then I would like to say the ranking of the feature importance is reasonable as machine can make good prediction using this ranking information (i.e., the feature importance is the knowledge machine learns from the database and it is correct because machine uses this knowledge to make good classification). Vice versa, if the prediction is poor I would like to say the ranking of feature importance is bad or even wrong. In this case we cannot trust the ‘knowledge’ feed back by the machine. In other words, it wastes time to do feature selection in this case because the feature importance is not correct (either because of the poor data quality or the machine learning algorithm is not suitable). May I ask whether my thinking above is reasonable?
    My second question is that I did not do feature selection to identify a subset of features as you did in your post. I just treat the few features on the top of the ranking list as the most important clinical features and then did classical analysis like t test to confirm these features are statistically different in different phenotypes. Can I still name it as feature selection or feature extraction? I am little bit confused about these terms. Thanks and I am waiting for your reply.

    • Jason Brownlee January 26, 2018 at 5:42 am #

      Sorry, I’m not sure I follow. Perhaps you can distil your question into one or two lines?

      Yes, you could still call this feature selection.

  15. Zhang January 26, 2018 at 8:09 pm #

    Thanks for your reply.

    As you may know, stochastic gradient boosting (SGB) is a model with built-in feature selection, which is thought to be more efficient in feature selection than wrapper methods and filter methods. But I doubt whether we can always trust the feature selected by SGB because the importance (relative influence) of the features are still provided by the model when the model has bad performance (e.g., very poor accuracy in testing). In this case, the model may be even wrong, so the selected features may be also wrong. So I would like to hear some comment from you regarding to this issue.

    Thanks.

    • Jason Brownlee January 27, 2018 at 5:57 am #

      Perhaps compare models fit with different subsets of features to see if it is lifting skill.

      Try using an ensemble of models fit on different subsets of features to see if you can lift skill further.

  16. Sa January 29, 2018 at 3:25 pm #

    Hi, Jason.

    Could you please let me know if the feature selection method that you used here, is classified as filter, wrapper or embedded feature selection method?

    Regards,

    • Jason Brownlee January 30, 2018 at 9:47 am #

      Here we are doing feature importance or feature scoring. It would be a filter.

  17. Youcai Wang February 2, 2018 at 9:37 am #

    Hi Brownlee, if I have a dataset with 118 variables, but the target variable is in 116, and I want to use 6-115 and 117-118 variables as dependent variables, how can I modify the code X = dataset[:,0:8]
    y = dataset[:,8]
    to get X and Y?

    I did not figure out this simple question. Please help

    Thanks,

  18. Nick March 18, 2018 at 9:47 am #

    Hi Jason,

    Thanks for the tutorial.

    Did you notice that the values of the importances were very different when you used model.get_importances_ versus xgb.plot_importance(model)?

    I used these two methods on a model I just trained and it looks like they are completely different. Moreover, the numpy array feature_importances do not directly correspond to the indexes that are returned from the plot_importance function.

    In other words, these two methods give me qualitatively different results. Any idea why?

    • Jason Brownlee March 19, 2018 at 6:03 am #

      I have not noticed that. Perhaps post a ticket on the xgboost user group or on the project? Sounds like a fault?

    • Nick March 31, 2018 at 4:01 am #

      There is a typo in my question:

      It should be model.feature_importances, not model.get_importances_.

  19. Eran M April 3, 2018 at 9:19 pm #

    Better importance estimation:

    model.feature_importances_ uses the
    Booster.get_fscore() which uses
    Booster.get_score(importance_type=’weight’)

    Which is an estimation to ‘gain’ (as of how many times all trees represented a certain feature).
    I think it would be better to use Booster.get_score(importance_type=’gain’) to get a more precise evaluation of how important a feature is.
    In general, it describes how good was it to split branches by that feature

  20. dasgupso May 24, 2018 at 10:01 pm #

    Hi Jason
    I need to know the feature importance calculations by different methods like “weight”, “gain”, or “cover” etc. in Xgboost.

    Please let me know how can we do it ? Can it be done using same way as you developed the model here (using Xgbclassifier).

    Also what’s the default method which is giving variable importance as per your code
    (model.feature_importances_).
    I need to save importances for very large set of features(around 225 ) using “weight”, “gain”, or “cover” etc. in Xgboost.

    • Jason Brownlee May 25, 2018 at 9:25 am #

      I’m not sure xgboost can present this, you might have to implement it yourself.

Leave a Reply