Feature Selection with the Caret R Package

Selecting the right features in your data can mean the difference between mediocre performance with long training times and great performance with short training times.

The caret R package provides tools automatically report on the relevance and importance of attributes in your data and even select the most important features for you.

In this post you will discover the feature selection tools in the Caret R package with standalone recipes in R.

After reading this post you will know:

  • How to remove redundant features from your dataset.
  • How to rank features in your dataset by their importance.
  • How to select features from your dataset using the Recursive Feature Elimination method.

Let’s get started.

Need more elp with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Remove Redundant Features

Data can contain attributes that are highly correlated with each other. Many methods perform better if highly correlated attributes are removed.

The Caret R package provides the findCorrelation which will analyze a correlation matrix of your data’s attributes report on attributes that can be removed.

The following example loads the Pima Indians Diabetes dataset that contains a number of biological attributes from medical reports. A correlation matrix is created from these attributes and highly correlated attributes are identified, in this case the age attribute is remove as it correlates highly with the pregnant attribute.

Generally, you want to remove attributes with an absolute correlation of 0.75 or higher.

Rank Features By Importance

The importance of features can be estimated from data by building a model. Some methods like decision trees have a built in mechanism to report on variable importance. For other algorithms, the importance can be estimated using a ROC curve analysis conducted for each attribute.

The example below loads the Pima Indians Diabetes dataset and constructs an Learning Vector Quantization (LVQ) model. The varImp is then used to estimate the variable importance, which is printed and plotted. It shows that the glucose, mass and age attributes are the top 3 most important attributes in the dataset and the insulin attribute is the least important.

Rank of Features by Importance

Rank of Features by Importance using Caret R Package

Feature Selection

Automatic feature selection methods can be used to build many models with different subsets of a dataset and identify those attributes that are and are not required to build an accurate model.

A popular automatic method for feature selection provided by the caret R package is called Recursive Feature Elimination or RFE.

The example below provides an example of the RFE method on the Pima Indians Diabetes dataset. A Random Forest algorithm is used on each iteration to evaluate the model. The algorithm is configured to explore all possible subsets of the attributes. All 8 attributes are selected in this example, although in the plot showing the accuracy of the different attribute subset sizes, we can see that just 4 attributes gives almost comparable results.

Feature Selection

Feature Selection Using the Caret R Package

Summary

In this post you discovered 3 feature selection methods provided by the caret R package. Specifically, searching for and removing redundant features, ranking features by importance and automatically selecting a subset of the most predictive features.

Three standalone recipes in R were provided that you can copy-and-paste into your own project and adapt for your specific problems.


Frustrated With Your Progress In R Machine Learning?

Master Machine Learning With R

Develop Your Own Models in Minutes

…with just a few lines of R code

Discover how in my new Ebook:
Machine Learning Mastery With R

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, build models, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.


75 Responses to Feature Selection with the Caret R Package

  1. Ajay October 21, 2014 at 6:31 pm #

    Thanks for an awesome post. Your writings always talk to the point of practical knowledge which i love, easy to understand !!. I think one thing you missed out in Recursive Feature Elimination or RFE. is to include a library “library(randomForest)”, bcoz when i tried to replicate it was unable to find the Random Forest package.

    Keep writing its helping lot !!! 🙂

    Thanks.

  2. Shan November 23, 2014 at 7:27 pm #

    Hi,
    I was trying to run importance of features from above posts but I couldnt.
    I am receiving an error for package ‘e1071′. The command I used to install it is:
    install.packages(“e1071”,dep=TRUE,type=’source’)

    the error is :
    ERROR: compilation failed for package ‘e1071’

    I am using latest version of R 3.1.1.

    Can anyone please help me in it.

  3. Jordan February 27, 2015 at 5:37 am #

    This was a great read! Thank you! I do have one question. With my data set I performed the last two options (ranking by importance and then feature selection), however, the top features selected by the methods were not the same. Why would this be? Is one better than the other?

    Thanks!

    • Jason Brownlee February 27, 2015 at 5:41 am #

      Thanks Jordan.
      This is a good point. Different methods will select different subsets of features. There is likely no “best set” of features just like there is no best model. My advice is to model each subset of features and see what works best for your problem and your needs.

      Applied machine learning is a process of empirical hypothesis testing – lots of trial and error.

  4. Ahmad February 27, 2015 at 12:35 pm #

    Thanks for such a useful post.
    I have one question about RFE method. I am not sure if I’ve got the point correctly because I wonder which method is used to build the model in each step and if it is possible to build the model using SVM in each iteration?

  5. Manuel April 3, 2015 at 9:55 pm #

    Your post is a very nice read – I’m new to caret and found it useful to get things running quickly. I am working on a p>>n classification problem, in particular I am not interested in a blackbox predictive model, but rather a more explanatory model, therefore I’m trying to extract sets of important features that can help to explain the outcome (I have additional data to validate the relationship between the extracted features).

    I am wondering if it makes sense to run RFE with different models and extract the subset of important variables for fitting each respective model, and then find the intersection of the set of features. I’d expect that each model might produce different sets of features (with the most important ones shared), and that the sets of features for one model might not work as good for another type of model, so I’m not sure though.

    What’s the best way to approach a problem like this? Thanks!

  6. Conor April 12, 2015 at 3:48 am #

    Brilliant post and very well laid out. Thanks a million.

  7. Rishabh April 13, 2015 at 4:36 pm #

    can we do feature extraction using caret package?

  8. Kaiyu May 4, 2015 at 8:26 am #

    Great pics, but when running the code i found that rf feature selection would recommend 5 features as recommended, how to change the default setting? many thanks!!!

  9. Sangram May 4, 2015 at 4:31 pm #

    Hi. Great post! I have a doubt about using varImp for feature selection. I have already written an algorithm that runs randomForest for building a model on training set. Now I want to use varImp() to select only important features before applying randomForest. But varImp seems to build yet another model on the data to extract feature importance, which seems to be a bit counter-intuitive. Would you please explain me the significance of varImp before using randomForest to train a model on training data? How can I use varImp in such case?

  10. Amit Nagar June 3, 2015 at 7:26 am #

    Thanks for the informative post.

    I am working with the data from Lending Club that is made available to public on their website. The dataset has a significant number of non numerical columns (grade, loan status etc). I am assuming before I get to feature selection methods described above, I will have to map these non numeric data to numeric values. My question: is there a prescribed way for handling such a situation or is it okay to follow an ad hoc mapping scheme. My intent, of course, is to be able to get to the point where I can do an intelligent feature selection. Thanks
    Amit

  11. Courage July 2, 2015 at 11:11 am #

    Hello James,

    I follow your newsletter. They have saved me at the right times 2x! Thanks. Also, do any of these algorithms take into consideration normalization of the data?

  12. Sam August 21, 2015 at 12:54 am #

    Thanks for the useful information.
    How can it be applied for svm?

  13. ARCHANA September 12, 2015 at 2:30 am #

    thanks a lot for the wonderful information!

  14. Saeed October 3, 2015 at 1:18 am #

    It seems feature importance works with LVQ method only for classification problems, but not for regression problems, doesn’t it?

  15. Raj October 4, 2015 at 6:45 am #

    Awesome ML post i’ve come across Jason !!

    I tried the code and i am trying to model only the features that are selected by the RFE algorithm. I get an error when i tried to use. Couldn’t figure out why it is giving an error

    ‘Error in eval(expr, envir, enclos) : object ‘diabetes’ not found’ at model (….) statement.

    #Load the libraries
    rm(list=ls())
    library(mlbench)
    library(caret)

    #load the data
    data(PimaIndiansDiabetes)

    #define the control using a random forest selection function
    control <- rfeControl(functions=rfFuncs, method='cv', number=10)

    #run the RFE algorithm
    results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl = control)

    #summarize the results
    print(results)
    names(PimaIndiansDiabetes)

    #list the chosen features
    predictors(results)

    #Plot the results
    plot(results, type=c('g','o'))

    PimaIndiansDiabetes$diabetes <- as.character(PimaIndiansDiabetes$diabetes)
    PimaIndiansDiabetes$diabetes[PimaIndiansDiabetes$diabetes=='neg'] <- 0
    PimaIndiansDiabetes$diabetes[PimaIndiansDiabetes$diabetes=='pos'] <- 1
    PimaIndiansDiabetes$diabetes <- as.factor(PimaIndiansDiabetes$diabetes)
    View(PimaIndiansDiabetes)

    #Split data
    inTrain = createDataPartition(y=PimaIndiansDiabetes$diabetes, p=0.7, list=FALSE)
    train <- iris[inTrain,]
    test <- iris[-inTrain,]
    attach(train)

    #train the model
    model <- train(diabetes ~ glucose + mass + age + pregnant + pedigree, data=train, trControl = train_control, method='lvq', tuneLength=5)

    #Summary of model
    summary(model)

    • Raghavendra October 16, 2015 at 4:55 pm #

      You need to work with “PimaIndiansDiabetes” dataset which is within Rstudio. Not the one (pima.indians.diabetes.data) given in this page.

  16. Ben October 5, 2015 at 6:12 pm #

    Hey, I’m trying to plot the importance of the different variables, but calling

    importance <- varImp(model, scale=FALSE)

    gives the error

    Error: (list) object cannot be coerced to type 'double'

    Any advice what I'm doing wrong or how to debug this issue?

    Thanks in advance!

  17. deepak November 4, 2015 at 1:24 am #

    Awesome post Jason. I have been following your blogs like a university course to get up to speed in ML.

  18. Vinh November 27, 2015 at 9:13 pm #

    In Importance part, for instance in your case, the variable ‘Age’ ranked at 3rd place, but can I know it is positive or negative way? I.e. higher age will lead to more chance of diabete, or vice versa?

  19. jake February 15, 2016 at 1:51 pm #

    So I did the feature elimination as so :
    control <- rfeControl(functions=caretFuncs, method="cv", number=10)
    results <- rfe(mydata.train[,1:23], mydata.train[,24], sizes=c(2,5,8,13,19), rfeControl=control , method="svmRadial")
    print(results)

    Recursive feature selection

    Outer resampling method: Cross-Validated (10 fold)

    Resampling performance over subset size:

    Variables Accuracy Kappa AccuracySD KappaSD Selected
    2 0.5100 -0.02879 0.05230 0.08438
    5 0.5371 0.02703 0.05953 0.12621
    8 0.5371 0.03630 0.07200 0.15233
    13 0.5207 0.01543 0.05248 0.11149
    19 0.5850 0.15647 0.07122 0.14019
    23 0.6447 0.27620 0.06088 0.12219 *

    This is a good result for me , almost 65% . But when I try now to do a training
    with :
    svm.model <- train(OUTPUT~.,data = mydata.train,method = "svmRadial",trControl = trainControl(method = "cv",number = 10),tuneLength = 8,metric="Accuracy")

    , I can't get that accuracy . What model and parameters is the recursive elimination using ?

  20. Prashanth March 5, 2016 at 7:36 am #

    Hi!
    The blog is really informative.Thanks!
    Would be nice to see some recommender system examples like supermarket basket analysis and recommendations to them in R.
    How it works.
    Could you please add a page for this?

    Thanks!

  21. Grace March 20, 2016 at 10:16 am #

    great posting! I tried to use my dataset named R_feature_selection_test.csv
    but it didn’t work. I got an error message as below. Could you please give me advice?

    data(R_feature_selection_test)
    Warning message:
    In data(R_feature_selection_test) :
    data set ‘R_feature_selection_test’ not found

    • Nitika July 14, 2016 at 6:57 pm #

      I am getting the same problem. Have you found any solution?

  22. jibs April 13, 2016 at 5:17 am #

    Hi Thanks for the great blog. My outcome is coded disease absent = 0, disease present = 0. Input contains both binary and continous variables. Binary variables are coded in 0 and 1.

    I am getting the following error messages when I try lvq:

    Error in train.default(x, y, weights = w, …) :
    wrong model type for regression

    and later when I try rfe, I get the folllowing warning:
    In randomForest.default(x, y, importance = first, …) :
    The response has five or fewer unique values. Are you sure you want to do regression?

    Any idea what might be causing this?

  23. jibs April 13, 2016 at 5:45 pm #

    i guess the question is: does lvq and ref in caret work for categorical data? Has anyone applied these models to datasets containing catogerical variables?

  24. jibs April 14, 2016 at 2:08 am #

    both lines work now when i recoded the ouput as “yes” , “no” instead of 1 /0 . Make sure to do as.factor in case the recoded output is stored as character format

    • Kg July 11, 2017 at 1:23 am #

      Great ! Thanks

  25. Zain May 2, 2016 at 6:40 pm #

    Hi Jason

    Excellent work.

    I would like to ask how we can using t-test for feature selection. I have 12 attributes (variables) and one class variable(labels). Please guide me , how I can use t-test for variable selection?

    Thanks,

  26. K May 13, 2016 at 1:52 pm #

    Just want to thank you for this accessible post!

  27. Manasi Shah June 30, 2016 at 6:36 am #

    Hi Jason,

    Thank you for the extremely reproducible code. It worked well for me and I could adapt it to my current problem as well. Since you have recently responded to a post, I was hoping you could address this very basic and general query I had regarding RFE.
    My belief so far was that RFE is an additional tool to supplement the findings from trained models using the train function in caret or the randomForest function in the random forest package until I read a paper recently which did not explicitly say but hinted that feature selection is done prior to training the random forest model. So which scenario would be the appropriate use case for RFE?

    Also my accuracy using the RFE function is different than the accuracy I get by tuning the model for ROC. What might be the reason for this?

    Thanks a lot for your time,

    Best,
    Manasi

    • Jason Brownlee June 30, 2016 at 8:42 am #

      Normally you could perform feature selection then build your models.

      As part of feature selection, you can build models, but only to inform you as to what features to select. These are not the final models.

      • Manasi Shah June 30, 2016 at 3:59 pm #

        Thanks, that confirms my doubt and will be helpful in further incorporating this method in the analysis,

        Best,
        Manasi

      • Krishna prasad January 4, 2017 at 10:16 pm #

        Dear Jason,

        could you please build a model with feature selection using SVM-RFE followed by Genetic algorithm followed by permutation test and then any other model using R code

        • Jason Brownlee January 5, 2017 at 9:19 am #

          Sorry Krishna, I don’t have the capacity to write this code for you.

          Best of luck with your project.

  28. Yogesh July 7, 2016 at 3:36 pm #

    Hi Jason

    Thanks for the detailed note It is very helpful for fresher like me.I have a doubt that do we need to remove outlier before using above techniques.

  29. Nitika July 17, 2016 at 5:42 pm #

    great posting! I tried to use my dataset named R_feature_selection_test.csv
    but it didn’t work. I got an error message as below. Could you please give me advice?

    data(R_feature_selection_test)
    Warning message:
    In data(R_feature_selection_test) :
    data set ‘R_feature_selection_test’ not found

  30. Juhi September 14, 2016 at 6:33 am #

    Hi Jason, Great Posting!!!

    My original dataset has missing values in some of the columns but to use rfe() I need to treat those missing values, If I treat missing values my feature selection would be based on this but in the final model I am not treating missing values for those columns, wouldn’t my results be skewed?

    • Jason Brownlee September 14, 2016 at 10:09 am #

      It may be.

      Try feature selection with the data with imputed missing values, then try feature selection with all records with missing data removed.

      See which set of features result in the model with the best performance.

      I hope that helps.

  31. Juhi September 17, 2016 at 6:06 am #

    Alright, Thanks!

    Also is there a way to decide number of iteration in the algorithm or we just try it for various numbers and then try to come up with an optimum number

    • Jason Brownlee September 17, 2016 at 9:37 am #

      Try various training lengths and see what works best.

  32. Brittany October 9, 2016 at 3:25 pm #

    Hi!
    I am new to this whole programming thing.

    I am trying to use the rank features by importance and I keep getting the error: Error in na.fail.default(list(SalePrice = c(208500L, 181500L, 223500L, :
    missing values in object

    My code looks like this:
    model <- train(SalePrice~., data=train_data, method="lvq", preProcess="scale", trControl=control)

    With train_date being the name of my data frame that has all my data. Would you know why I keep getting this error?

    • Jason Brownlee October 10, 2016 at 7:41 am #

      I’ve not seen this specific error Brittany. I wonder if it an issue with your data.

      Perhaps try to cut back your data either rows or columns until your code begins to work – it might help unearth the cause.

  33. Ciara December 12, 2016 at 8:17 pm #

    Hi Jason

    This is a great post, thanks.

    I am trying to run the RFE on a dataset with approx 1000 data entries and 17 variables where several of the variables are categorical. I am not getting an error, however, the process just seems to keep running without stopping or coming to any conclusion.

    Do you know if this could be because of the size of my dataset or the type of data?

    Many thanks
    Ciara

  34. Gaurav Pandey January 12, 2017 at 5:20 am #

    Hi Jason, thanks for an amazing explanation. I have few queries regarding the cross validation.

    1. Why do we use cross validation exactly? Is it only to select the important feature? Or it has some other reason too?

    2. When do we do the cross validation? is it before the model building or after the model building?

    Hope to get a reply.

    Thanks
    Gaurav

    • Jason Brownlee January 12, 2017 at 9:39 am #

      Hi Gaurav,

      Cross-validation allows us to make decisions (choose models or choose features) by estimating the performance of the result of the choice on unseen data. This is more robust than reviewing the performance on the entire training dataset alone.

      It is used to choose which model to build, prior. Once chosen, the model can be constructed using all available data.

      I hope that helps.

  35. kanika March 9, 2017 at 8:47 pm #

    Hi,
    Can you please explain how to perform Feature selection using genetic algorithm on Pima Indians Diabetes Dataset in R? If possible in SAS also.

    • Jason Brownlee March 10, 2017 at 9:23 am #

      Sorry, I don’t have an example of feature selection with genetic algorithms. At least not yet.

  36. Roberto Garcia March 18, 2017 at 4:40 am #

    Thanks for sharing your knowledge, Jason. Excelent explanation.

    • Jason Brownlee March 18, 2017 at 7:50 am #

      You’re welcome Roberto, I’m glad you found it useful.

  37. Asad Khan March 23, 2017 at 5:40 pm #

    hi sir, is any tutorial for GA algorithm for feature selection in binary classification i am working on DNA and RNA datasets.

    • Jason Brownlee March 24, 2017 at 7:52 am #

      Sorry Asad, no tutorial on GA for feature selection yet, soon hopefully.

  38. pier March 25, 2017 at 12:36 am #

    great post Jason.

    I tried to fix ntree and find different mtry.
    The code works well, but resamples have same ranges of accuracy by different mtry values..not possible. .something doesnot work!!!
    Thanks for response

    metric <- "Accuracy"
    control <- trainControl(method="cv", number=10, search="grid")
    tunegrid <- expand.grid(mtry=c(10, 20, 30, 40, 50))
    modellist2 <- list()
    for (mtry in c(10, 20,30, 40, 50)) {
    set.seed(seed)
    custom2 <- train(Class~., data=dataset, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control, ntree=500)
    key2 <- toString(mtry)
    modellist2[[key2]] <- custom2
    }

    custom2
    results2 <- resamples(modellist2)
    summary(results2$values)

  39. Seo Young Jae April 10, 2017 at 3:26 am #

    Good information!
    I have two questions.

    1. In varImp selection, when I implement that code, I can see an overall.
    What is the overall??(like p-value, t-test …. what is it?)

    2. And in the varImp() result, what variables have to be selected or removed?? what’s the criteria?

  40. khawla April 18, 2017 at 7:10 pm #

    your post is really helpful,thank you so much for those information, it works only for quantitative variables i need to know how to calculate the matrix with qualitative variables.

    • Jason Brownlee April 19, 2017 at 7:50 am #

      Good question, sorry I do not have an example at the moment.

  41. Svetlana April 21, 2017 at 10:42 pm #

    Thank you for the post, Jason! Is there any packages (algorithms) for feature selection on nominal data?

    • Jason Brownlee April 22, 2017 at 9:26 am #

      I expect there must be, but I don’t recall off hand sorry.

  42. Phil June 30, 2017 at 3:44 am #

    Is there a way to make findCorrelation() flag attributes that are strictly above cutoff? I see from my use case that the absolute correlation value is compared against cutoff, as in the verbose output snippet below (cutoff=0.9):

    Combination row 12474 and column 12484 is above the cut-off, value = 0.922
    Flagging column 12474
    Combination row 12476 and column 12484 is above the cut-off, value = -0.913
    Flagging column 12476

    Is there a way to not flag negative correlation?

    thanks for your help!

  43. Lee Zee July 2, 2017 at 3:02 am #

    Hi Jason,
    I am assuming that before I get to using the above feature selection methods, I’ll have to convert these non numeric values to numeric values (e.g., converting “W” for wins to “1”s and “L” for losses to “0”s). Is that right? Or, can I leave those columns with non-numeric values as is. My intent, of course, is to be able to get to the point where I can do an intelligent feature selection.

    Thanks in advance.

    Lee

    • Jason Brownlee July 2, 2017 at 6:34 am #

      Generally yes, most methods expect to work with numeric values.

  44. Mike K July 17, 2017 at 9:35 am #

    Hi Jason,

    Thank you for a great post! I am wondering how to apply the same technique to a large data set in order to keep all features and extract the required rows (as a subset or sample) that are highly correlated. Say we start with a matrix of 1000000 rows and 15 variables, I want to extract 20 rows that are most or least correlated. What technique (or algorithm) would be best? Thank you in advance!

    • Jason Brownlee July 18, 2017 at 8:39 am #

      Interesting idea Mike. You may need to prepare some custom code for this task.

  45. Jean-Sebastien P. July 25, 2017 at 4:52 am #

    Hi Jason,
    Thank you very much for the explanation. Now I’m trying to use it on another dataset, but I’m running into a bit of an issue. My dataset had 962 features and 144 observations. My response is a factor with 4 levels and all other variables are either numeric or integer. So far, I’m able to run the first part, but I’m getting an error when I build my model:

    library(mlbench)
    library(caret)
    set.seed(233)
    correlationMatrix <- cor(dataset[,3:962])
    highlyCorrelated <- findCorrelation(correlationMatrix, cutoff = 0.5)
    control <- trainControl(method = "repeatedcv", number=10, repeats=3)
    model Error in seeds[[num_rs + 1L]] : subscript out of bounds

    It’s the first time I’ve encountered this error and I didn’t find any information that could help me so far. I was wondering if you had a hint for me.
    Thank you!

    • Jason Brownlee July 25, 2017 at 9:48 am #

      Sorry to hear that. I have not seen this error.

      Perhaps try simplifying your code/run to help expose it.

      Perhaps try searching stackoverflow/crossvalidated or even posting the issue there?

  46. Murillo August 9, 2017 at 6:08 pm #

    Hi Jason! I am using the Feature Selection, copy and adapting the code to my data and instead of giving me the results according to the ACCURACY, it gives me acording to the RMSE. If I use the selected variables in a multiple linear regression this results in a different RMSE value. Could you help me to change this plot to accuracy instead of RMSE ? Thanks!

    • Jason Brownlee August 10, 2017 at 6:54 am #

      If your problem is a regression problem (predicting a real value), then you cannot calculate accuracy, you must calculate a prediction error, like RMSE.

  47. aquaq August 15, 2017 at 1:13 am #

    Hi Jason, thanks for these wonderful posts.

    I have a question: are there any limitations for the number of features vs. number of observations for machine learning algorithms? E.g can I run SVM or random forest with more feature than observations? For example, for linear regression, I have read that (as a rule of thumb), the number of features better not exceed the 1/5 of the number of observations to avoid overfitting.

    Thank you!

Leave a Reply