How to Calculate Feature Importance With Python

Last Updated on

Feature importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable.

There are many types and sources of feature importance scores, although popular examples include statistical correlation scores, coefficients calculated as part of linear models, decision trees, and permutation importance scores.

Feature importance scores play an important role in a predictive modeling project, including providing insight into the data, insight into the model, and the basis for dimensionality reduction and feature selection that can improve the efficiency and effectiveness of a predictive model on the problem.

In this tutorial, you will discover feature importance scores for machine learning in python

After completing this tutorial, you will know:

  • The role of feature importance in a predictive modeling problem.
  • How to calculate and review feature importance from linear models and decision trees.
  • How to calculate and review permutation feature importance scores.

Let’s get started.

  • Update May/2020: Added example of feature selection using importance.
How to Calculate Feature Importance With Python

How to Calculate Feature Importance With Python
Photo by Bonnie Moreland, some rights reserved.

Tutorial Overview

This tutorial is divided into six parts; they are:

  1. Feature Importance
  2. Preparation
    1. Check Scikit-Learn Version
    2. Test Datasets
  3. Coefficients as Feature Importance
    1. Linear Regression Feature Importance
    2. Logistic Regression Feature Importance
  4. Decision Tree Feature Importance
    1. CART Feature Importance
    2. Random Forest Feature Importance
    3. XGBoost Feature Importance
  5. Permutation Feature Importance
    1. Permutation Feature Importance for Regression
    2. Permutation Feature Importance for Classification
  6. Feature Selection with Importance

Feature Importance

Feature importance refers to a class of techniques for assigning scores to input features to a predictive model that indicates the relative importance of each feature when making a prediction.

Feature importance scores can be calculated for problems that involve predicting a numerical value, called regression, and those problems that involve predicting a class label, called classification.

The scores are useful and can be used in a range of situations in a predictive modeling problem, such as:

  • Better understanding the data.
  • Better understanding a model.
  • Reducing the number of input features.

Feature importance scores can provide insight into the dataset. The relative scores can highlight which features may be most relevant to the target, and the converse, which features are the least relevant. This may be interpreted by a domain expert and could be used as the basis for gathering more or different data.

Feature importance scores can provide insight into the model. Most importance scores are calculated by a predictive model that has been fit on the dataset. Inspecting the importance score provides insight into that specific model and which features are the most important and least important to the model when making a prediction. This is a type of model interpretation that can be performed for those models that support it.

Feature importance can be used to improve a predictive model. This can be achieved by using the importance scores to select those features to delete (lowest scores) or those features to keep (highest scores). This is a type of feature selection and can simplify the problem that is being modeled, speed up the modeling process (deleting features is called dimensionality reduction), and in some cases, improve the performance of the model.

Often, we desire to quantify the strength of the relationship between the predictors and the outcome. […] Ranking predictors in this manner can be very useful when sifting through large amounts of data.

— Page 463, Applied Predictive Modeling, 2013.

Feature importance scores can be fed to a wrapper model, such as the SelectFromModel class, to perform feature selection.

There are many ways to calculate feature importance scores and many models that can be used for this purpose.

Perhaps the simplest way is to calculate simple coefficient statistics between each feature and the target variable. For more on this approach, see the tutorial:

In this tutorial, we will look at three main types of more advanced feature importance; they are:

  • Feature importance from model coefficients.
  • Feature importance from decision trees.
  • Feature importance from permutation testing.

Let’s take a closer look at each.

Preparation

Before we dive in, let’s confirm our environment and prepare some test datasets.

Check Scikit-Learn Version

First, confirm that you have a modern version of the scikit-learn library installed.

This is important because some of the models we will explore in this tutorial require a modern version of the library.

You can check the version of the library you have installed with the following code example:

Running the example will print the version of the library. At the time of writing, this is about version 0.22.

You need to be using this version of scikit-learn or higher.

Test Datasets

Next, let’s define some test datasets that we can use as the basis for demonstrating and exploring feature importance scores.

Each test problem has five important and five unimportant features, and it may be interesting to see which methods are consistent at finding or differentiating the features based on their importance.

Classification Dataset

We will use the make_classification() function to create a test binary classification dataset.

The dataset will have 1,000 examples, with 10 input features, five of which will be informative and the remaining five will be redundant. We will fix the random number seed to ensure we get the same examples each time the code is run.

An example of creating and summarizing the dataset is listed below.

Running the example creates the dataset and confirms the expected number of samples and features.

Regression Dataset

We will use the make_regression() function to create a test regression dataset.

Like the classification dataset, the regression dataset will have 1,000 examples, with 10 input features, five of which will be informative and the remaining five that will be redundant.

Running the example creates the dataset and confirms the expected number of samples and features.

Next, let’s take a closer look at coefficients as importance scores.

Coefficients as Feature Importance

Linear machine learning algorithms fit a model where the prediction is the weighted sum of the input values.

Examples include linear regression, logistic regression, and extensions that add regularization, such as ridge regression and the elastic net.

All of these algorithms find a set of coefficients to use in the weighted sum in order to make a prediction. These coefficients can be used directly as a crude type of feature importance score.

Let’s take a closer look at using coefficients as feature importance for classification and regression. We will fit a model on the dataset to find the coefficients, then summarize the importance scores for each input feature and finally create a bar chart to get an idea of the relative importance of the features.

Linear Regression Feature Importance

We can fit a LinearRegression model on the regression dataset and retrieve the coeff_ property that contains the coefficients found for each input variable.

These coefficients can provide the basis for a crude feature importance score. This assumes that the input variables have the same scale or have been scaled prior to fitting a model.

The complete example of linear regression coefficients for feature importance is listed below.

Running the example fits the model, then reports the coefficient value for each feature.

The scores suggest that the model found the five important features and marked all other features with a zero coefficient, essentially removing them from the model.

A bar chart is then created for the feature importance scores.

Bar Chart of Linear Regression Coefficients as Feature Importance Scores

Bar Chart of Linear Regression Coefficients as Feature Importance Scores

This approach may also be used with Ridge and ElasticNet models.

Logistic Regression Feature Importance

We can fit a LogisticRegression model on the regression dataset and retrieve the coeff_ property that contains the coefficients found for each input variable.

These coefficients can provide the basis for a crude feature importance score. This assumes that the input variables have the same scale or have been scaled prior to fitting a model.

The complete example of logistic regression coefficients for feature importance is listed below.

Running the example fits the model, then reports the coefficient value for each feature.

Recall this is a classification problem with classes 0 and 1. Notice that the coefficients are both positive and negative. The positive scores indicate a feature that predicts class 1, whereas the negative scores indicate a feature that predicts class 0.

No clear pattern of important and unimportant features can be identified from these results, at least from what I can tell.

A bar chart is then created for the feature importance scores.

Bar Chart of Logistic Regression Coefficients as Feature Importance Scores

Bar Chart of Logistic Regression Coefficients as Feature Importance Scores

Now that we have seen the use of coefficients as importance scores, let’s look at the more common example of decision-tree-based importance scores.

Decision Tree Feature Importance

Decision tree algorithms like classification and regression trees (CART) offer importance scores based on the reduction in the criterion used to select split points, like Gini or entropy.

This same approach can be used for ensembles of decision trees, such as the random forest and stochastic gradient boosting algorithms.

Let’s take a look at a worked example of each.

CART Feature Importance

We can use the CART algorithm for feature importance implemented in scikit-learn as the DecisionTreeRegressor and DecisionTreeClassifier classes.

After being fit, the model provides a feature_importances_ property that can be accessed to retrieve the relative importance scores for each input feature.

Let’s take a look at an example of this for regression and classification.

CART Regression Feature Importance

The complete example of fitting a DecisionTreeRegressor and summarizing the calculated feature importance scores is listed below.

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps three of the 10 features as being important to prediction.

A bar chart is then created for the feature importance scores.

Bar Chart of DecisionTreeRegressor Feature Importance Scores

Bar Chart of DecisionTreeRegressor Feature Importance Scores

CART Classification Feature Importance

The complete example of fitting a DecisionTreeClassifier and summarizing the calculated feature importance scores is listed below.

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps four of the 10 features as being important to prediction.

A bar chart is then created for the feature importance scores.

Bar Chart of DecisionTreeClassifier Feature Importance Scores

Bar Chart of DecisionTreeClassifier Feature Importance Scores

Random Forest Feature Importance

We can use the Random Forest algorithm for feature importance implemented in scikit-learn as the RandomForestRegressor and RandomForestClassifier classes.

After being fit, the model provides a feature_importances_ property that can be accessed to retrieve the relative importance scores for each input feature.

This approach can also be used with the bagging and extra trees algorithms.

Let’s take a look at an example of this for regression and classification.

Random Forest Regression Feature Importance

The complete example of fitting a RandomForestRegressor and summarizing the calculated feature importance scores is listed below.

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps two or three of the 10 features as being important to prediction.

A bar chart is then created for the feature importance scores.

Bar Chart of RandomForestRegressor Feature Importance Scores

Bar Chart of RandomForestRegressor Feature Importance Scores

Random Forest Classification Feature Importance

The complete example of fitting a RandomForestClassifier and summarizing the calculated feature importance scores is listed below.

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps two or three of the 10 features as being important to prediction.

A bar chart is then created for the feature importance scores.

Bar Chart of RandomForestClassifier Feature Importance Scores

Bar Chart of RandomForestClassifier Feature Importance Scores

XGBoost Feature Importance

XGBoost is a library that provides an efficient and effective implementation of the stochastic gradient boosting algorithm.

This algorithm can be used with scikit-learn via the XGBRegressor and XGBClassifier classes.

After being fit, the model provides a feature_importances_ property that can be accessed to retrieve the relative importance scores for each input feature.

This algorithm is also provided via scikit-learn via the GradientBoostingClassifier and GradientBoostingRegressor classes and the same approach to feature selection can be used.

First, install the XGBoost library, such as with pip:

Then confirm that the library was installed correctly and works by checking the version number.

Running the example, you should see the following version number or higher.

For more on the XGBoost library, start here:

Let’s take a look at an example of XGBoost for feature importance on regression and classification problems.

XGBoost Regression Feature Importance

The complete example of fitting a XGBRegressor and summarizing the calculated feature importance scores is listed below.

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps two or three of the 10 features as being important to prediction.

A bar chart is then created for the feature importance scores.

Bar Chart of XGBRegressor Feature Importance Scores

Bar Chart of XGBRegressor Feature Importance Scores

XGBoost Classification Feature Importance

The complete example of fitting an XGBClassifier and summarizing the calculated feature importance scores is listed below.

Running the example fits the model then reports the coefficient value for each feature.

The results suggest perhaps seven of the 10 features as being important to prediction.

A bar chart is then created for the feature importance scores.

Bar Chart of XGBClassifier Feature Importance Scores

Bar Chart of XGBClassifier Feature Importance Scores

Permutation Feature Importance

Permutation feature importance is a technique for calculating relative importance scores that is independent of the model used.

First, a model is fit on the dataset, such as a model that does not support native feature importance scores. Then the model is used to make predictions on a dataset, although the values of a feature (column) in the dataset are scrambled. This is repeated for each feature in the dataset. Then this whole process is repeated 3, 5, 10 or more times. The result is a mean importance score for each input feature (and distribution of scores given the repeats).

This approach can be used for regression or classification and requires that a performance metric be chosen as the basis of the importance score, such as the mean squared error for regression and accuracy for classification.

Permutation feature selection can be used via the permutation_importance() function that takes a fit model, a dataset (train or test dataset is fine), and a scoring function.

Let’s take a look at this approach to feature selection with an algorithm that does not support feature selection natively, specifically k-nearest neighbors.

Permutation Feature Importance for Regression

The complete example of fitting a KNeighborsRegressor and summarizing the calculated permutation feature importance scores is listed below.

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps two or three of the 10 features as being important to prediction.

A bar chart is then created for the feature importance scores.

Bar Chart of KNeighborsRegressor With Permutation Feature Importance Scores

Bar Chart of KNeighborsRegressor With Permutation Feature Importance Scores

Permutation Feature Importance for Classification

The complete example of fitting a KNeighborsClassifier and summarizing the calculated permutation feature importance scores is listed below.

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps two or three of the 10 features as being important to prediction.

A bar chart is then created for the feature importance scores.

Bar Chart of KNeighborsClassifier With Permutation Feature Importance Scores

Bar Chart of KNeighborsClassifier With Permutation Feature Importance Scores

Feature Selection with Importance

Feature importance scores can be used to help interpret the data, but they can also be used directly to help rank and select features that are most useful to a predictive model.

We can demonstrate this with a small example.

Recall, our synthetic dataset has 1,000 examples each with 10 input variables, five of which are redundant and five of which are important to the outcome. We can use feature importance scores to help select the five variables that are relevant and only use them as inputs to a predictive model.

First, we can split the training dataset into train and test sets and train a model on the training dataset, make predictions on the test set and evaluate the result using classification accuracy. We will use a logistic regression model as the predictive model.

This provides a baseline for comparison when we remove some features using feature importance scores.

The complete example of evaluating a logistic regression model using all features as input on our synthetic dataset is listed below.

Running the example first the logistic regression model on the training dataset and evaluates it on the test set.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case we can see that the model achieved the classification accuracy of about 84.55 percent using all features in the dataset.

Given that we created the dataset, we would expect better or the same results with half the number of input variables.

We could use any of the feature importance scores explored above, but in this case we will use the feature importance scores provided by random forest.

We can use the SelectFromModel class to define both the model we wish to calculate importance scores, RandomForestClassifier in this case, and the number of features to select, 5 in this case.

We can fit the feature selection method on the training dataset.

This will calculate the importance scores that can be used to rank all input features. We can then apply the method as a transform to select a subset of 5 most important features from the dataset. This transform will be applied to the training dataset and the test set.

Tying this all together, the complete example of using random forest feature importance for feature selection is listed below.

Running the example first performs feature selection on the dataset, then fits and evaluates the logistic regression model as before.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieves the same performance on the dataset, although with half the number of input features. As expected, the feature importance scores calculated by random forest allowed us to accurately rank the input features and delete those that were not relevant to the target variable.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Related Tutorials

Books

APIs

Summary

In this tutorial, you discovered feature importance scores for machine learning in python

Specifically, you learned:

  • The role of feature importance in a predictive modeling problem.
  • How to calculate and review feature importance from linear models and decision trees.
  • How to calculate and review permutation feature importance scores.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

47 Responses to How to Calculate Feature Importance With Python

  1. Martin March 30, 2020 at 6:35 pm #

    This tutorial lacks the most important thing – comparison between feature importance and permutation importance. Which to choose and why?

    For interested: https://explained.ai/rf-importance/

    Best method to compare feature importance in Generalized Linear Models (Linear Regression, Logistic Regression etc.) is multiplying feature coefficients with standard devation of variable. It gives you standarized betas, which aren’t affected by variable’s scale measure. Thanks to that, they are comparable.

    Scaling or standarizing variables works only if you have ONLY numeric data, which in practice… never happens.

    • Jason Brownlee March 31, 2020 at 7:59 am #

      Comparison requires a context, e.g. a specific dataset that you’re intersted in solving and suite of models.

  2. Oliver Tomic March 30, 2020 at 7:54 pm #

    Hi Jason!

    Thanks for the nice coding examples and explanation. A little comment though, regarding the Random Forest feature importances: would it be worth mentioning that the feature importance using

    importance = model.feature_importances_

    could potentially provide importances that are biased toward continuous features and high-cardinality categorical features?

    best wishes
    Oliver

    • Jason Brownlee March 31, 2020 at 8:06 am #

      It may be, what makes you say that?

      • Oliver Tomic March 31, 2020 at 5:48 pm #

        I came across this post a couple of years ago when it got published which discusses how you have to be careful interpreting feature importances from Random Forrest in general. This was exemplified using scikit learn and some other package in R.

        https://explained.ai/rf-importance/index.html

        This is the same that Martin mentioned above.

        best wishes
        Oliver

  3. Aventinus March 30, 2020 at 11:22 pm #

    Thank you, Jason, that was very informative.

    As a newbie in data science I a question:

    Is the concept of Feature Importance applicable to all methods? What about DL methods (CNNs, LSTMs)? What about BERT? I’m thinking that, intuitively, a similar function should be available no matter then method used, but when searching online I find that the answer is not clear. I guess I lack some basic, key knowledge here.

    • Jason Brownlee March 31, 2020 at 8:12 am #

      Yes, but different methods are used.

      Here, we are focused on tabular data.

  4. Alex March 31, 2020 at 1:04 am #

    Hi, I am a freshman and I am wondering that with the development of deep learning that could find feature automatically, are the feature engineering that help construct feature manually and efficently going to be out of date? If not, where can we use feature engineering better than deep learning?

    • Jason Brownlee March 31, 2020 at 8:13 am #

      It performs feature extraction automatically.

      Even so, such models may or may not perform better than other methods.

  5. Fotis April 1, 2020 at 7:28 am #

    Hi, I am freshman too. I would like to ask if there is any way to implement “Permutation Feature Importance for Classification” using deep NN with Keras?

    • Jason Brownlee April 1, 2020 at 8:10 am #

      I don’t see why not. Use the Keras wrapper class for your model.

  6. Ruud Goorden April 3, 2020 at 6:10 am #

    Hi. Just a little addition to your review. Beware of feature importance in RFs using standard feature importance metrics. See: https://explained.ai/rf-importance/
    Keep up the good work!

  7. Bill April 3, 2020 at 7:10 am #

    Hi Jason,

    Any plans please to post some practical stuff on Knowledge Graph (Embedding)?

    Thanks,Bill

  8. Ricardo April 5, 2020 at 10:31 pm #

    Your tutorials are so interesting.

  9. Van-Hau Nguyen April 6, 2020 at 1:57 pm #

    Hi Jason,

    thank you very much for your post. It is very interesting as always!
    May I conclude that each method ( Linear, Logistic, Random Forest, XGBoost, etc.) can lead to its own way to Calculate Feature Importance?

    Thank you.

    • Jason Brownlee April 7, 2020 at 5:36 am #

      Thanks.

      Yes, we can get many different views on what is important.

  10. Mayank April 11, 2020 at 9:01 pm #

    Hey Jason!!

    Does this method works for the data having both categorical and continuous features? or we have to separate those features and then compute feature importance which i think wold not be good practice!.

    and off topic question, can we apply P.C.A to categorical features if not then is there any equivalent method for categorical feature?

    • Jason Brownlee April 12, 2020 at 6:20 am #

      I believe so.

      No. PCA is for numeric data.

      • Mayank April 12, 2020 at 1:56 pm #

        And L.D.A is for categorical values??

        • Jason Brownlee April 13, 2020 at 6:09 am #

          LDA – linear discriminant analysis – no it’s for numerical values too.

  11. Dina April 13, 2020 at 1:04 pm #

    Hi Jason, I learnt a lot from your website about machine learning. By the way, do you have an idea on how to know feature importance that use keras model?

  12. Sam April 18, 2020 at 3:05 am #

    Hi Jason,

    I’m a Data Analytics grad student from Colorado and your website has been a great resource for my learning!

    I have a question about the order in which one would do feature selection in the machine learning process. My dataset is heavily imbalanced (95%/5%) and has many NaN’s that require imputation. A professor also recommended doing PCA along with feature selection. Where would you recommend placing feature selection? My initial plan was imputation -> feature selection -> SMOTE -> scaling -> PCA.

    For some more context, the data is 1.8 million rows by 65 columns. The target variable is binary and the columns are mostly numeric with some categorical being one hot encoded.

    Appreciate any wisdom you can pass along!

    • Jason Brownlee April 18, 2020 at 6:09 am #

      Thanks, I’m happy to hear that.

      Experiment to discover what works best.

      I would do PCA or feature selection, not both. I would probably scale, sample then select. But also try scale, select, and sample.

  13. Mbonu Chinedu April 25, 2020 at 7:09 am #

    Thanks Jason for this information.

  14. Deeksha April 29, 2020 at 11:17 pm #

    Hi Jason,

    I am running Decision tree regressor to identify the most important predictor. The output I got is in the same format as given. However I am not being able to understand what is meant by “Feature 1” and what is the significance of the number given.

    I ran the Random forest regressor as well but not being able to compare the result due to unavailability of labelS. Please do provide the Python code to map appropriate fields and Plot.

    Thanks

    • Jason Brownlee April 30, 2020 at 6:44 am #

      If you have a list of string names for each column, then the feature index will be the same as the column name index.

      Does that help?

  15. Swapnil Bendale May 3, 2020 at 3:47 pm #

    Sir,

    How about using SelectKbest from sklearn to identify the best features???
    How does it differ in calculations from the above method?

    Thankin advance

    • Jason Brownlee May 3, 2020 at 5:10 pm #

      Yes, it allows you to use feature importance as a feature selection method.

  16. Alex May 8, 2020 at 7:36 am #

    Hi Jason,

    Great post an nice coding examples. I am quite new to the field of machine learning. I was playing with my own dataset and fitted a simple decision tree (classifier 0,1). Model accuracy was 0.65. I was very surprised when checking the feature importance. They were all 0.0 (7 features of which 6 are numerical. How is that even possible?

    Best
    Alex

    • Jason Brownlee May 8, 2020 at 8:02 am #

      Thanks.

      65% is low, near random. Perhaps the feature importance does not provide insight on your dataset. It is not absolute importance, more of a suggestion.

  17. Alex May 8, 2020 at 9:02 am #

    ok thanks, and yes it‘s really almost random. But still, I would have expected even some very small numbers around 0.01 or so because all features being exactly 0.0 … anyway, will check and use your great blog and comments for further education . thanks

  18. Grzegorz Kępisty May 22, 2020 at 11:06 pm #

    I can see that many readers link the article “Beware Default Random Forest Importances” that compare default RF Gini importances in sklearn and permutation importance approach. I believe that is worth mentioning the other trending approach called SHAP:
    https://www.kaggle.com/wrosinski/shap-feature-importance-with-feature-engineering
    https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d
    Recently I use it as one of a few parallel methods for feature selection. It seems to be worth our attention, because it uses independent method to calculate importance (in comparison to Gini or permutation methods). Also it is helpful for visualizing how variables influence model output. Do you have any experience or remarks on it?
    Regards!

  19. Montse May 31, 2020 at 6:37 am #

    Hi Jason,

    Thanks so much for these useful posts as well as books!
    Would you mind sharing your thoughts about the differences between getting feature importance of our XGBoost model by retrieving the coeffs or directly with the built-in plot function?

    • Jason Brownlee May 31, 2020 at 8:51 am #

      Both provide the same importance scores I believe. What do you mean exactly?

      • Montse June 1, 2020 at 7:31 am #

        I am currently using feature importance scores to rank the inputs of the dataset I am working on. I obtained different scores (and a different importance order) depending on if retrieving the coeffs via model.feature_importances_ or with the built-in plot function plot_importance(model). The specific model used is XGBRegressor(learning_rate=0.01,n_estimators=100, subsample=0.5, max_depth=7 )

        So that, I was wondering if each of them use different strategies to interpret the relative importance of the features on the model …and what would be the best approach to decide which one of them select and when.

        Thanks again Jason, for all your great work.

        • Jason Brownlee June 1, 2020 at 1:37 pm #

          It is possible that different metrics are being used in the plot.

          I believe I have seen this before, look at the arguments to the function used to create the plot.

  20. kejingrio June 2, 2020 at 12:00 pm #

    Hi Jason, Thanks it is very useful.
    I have some difficult on Permutation Feature Importance for Regression.I feel puzzled at the
    scoring “MSE”. “MSE” is closer to 0, the more well-performant the model.When
    according to the “Outline of the permutation importance algorithm”, importance is the difference between original “MSE”and new “MSE”.That is to say, the larger the difference, the less important the original feature is. But the meaning of the article is that the greater the difference, the more important the feature is

Leave a Reply