# Feature Selection with the Caret R Package

Last Updated on August 22, 2019

Selecting the right features in your data can mean the difference between mediocre performance with long training times and great performance with short training times.

The caret R package provides tools to automatically report on the relevance and importance of attributes in your data and even select the most important features for you.

In this post you will discover the feature selection tools in the Caret R package with standalone recipes in R.

After reading this post you will know:

• How to remove redundant features from your dataset.
• How to rank features in your dataset by their importance.
• How to select features from your dataset using the Recursive Feature Elimination method.

Kick-start your project with my new book Machine Learning Mastery With R, including step-by-step tutorials and the R source code files for all examples.

Let’s get started.

Confidence Intervals for Machine Learning
Photo by Paul Balfe, some rights reserved.

### Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Remove Redundant Features

Data can contain attributes that are highly correlated with each other. Many methods perform better if highly correlated attributes are removed.

The Caret R package provides the findCorrelation which will analyze a correlation matrix of your data’s attributes report on attributes that can be removed.

The following example loads the Pima Indians Diabetes dataset that contains a number of biological attributes from medical reports. A correlation matrix is created from these attributes and highly correlated attributes are identified, in this case the age attribute is removed as it correlates highly with the pregnant attribute.

Generally, you want to remove attributes with an absolute correlation of 0.75 or higher.

## Rank Features By Importance

The importance of features can be estimated from data by building a model. Some methods like decision trees have a built in mechanism to report on variable importance. For other algorithms, the importance can be estimated using a ROC curve analysis conducted for each attribute.

The example below loads the Pima Indians Diabetes dataset and constructs an Learning Vector Quantization (LVQ) model. The varImp is then used to estimate the variable importance, which is printed and plotted. It shows that the glucose, mass and age attributes are the top 3 most important attributes in the dataset and the insulin attribute is the least important.

Rank of Features by Importance using Caret R Package

## Feature Selection

Automatic feature selection methods can be used to build many models with different subsets of a dataset and identify those attributes that are and are not required to build an accurate model.

A popular automatic method for feature selection provided by the caret R package is called Recursive Feature Elimination or RFE.

The example below provides an example of the RFE method on the Pima Indians Diabetes dataset. A Random Forest algorithm is used on each iteration to evaluate the model. The algorithm is configured to explore all possible subsets of the attributes. All 8 attributes are selected in this example, although in the plot showing the accuracy of the different attribute subset sizes, we can see that just 4 attributes gives almost comparable results.

Feature Selection Using the Caret R Package

## Summary

In this post you discovered 3 feature selection methods provided by the caret R package. Specifically, searching for and removing redundant features, ranking features by importance and automatically selecting a subset of the most predictive features.

Three standalone recipes in R were provided that you can copy-and-paste into your own project and adapt for your specific problems.

## Discover Faster Machine Learning in R!

#### Develop Your Own Models in Minutes

...with just a few lines of R code

Discover how in my new Ebook:
Machine Learning Mastery With R

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, build models, tuning, and much more...

#### Finally Bring Machine Learning To Your Own Projects

Skip the Academics. Just Results.

### 177 Responses to Feature Selection with the Caret R Package

1. Ajay October 21, 2014 at 6:31 pm #

Thanks for an awesome post. Your writings always talk to the point of practical knowledge which i love, easy to understand !!. I think one thing you missed out in Recursive Feature Elimination or RFE. is to include a library “library(randomForest)”, bcoz when i tried to replicate it was unable to find the Random Forest package.

Keep writing its helping lot !!! 🙂

Thanks.

2. Shan November 23, 2014 at 7:27 pm #

Hi,
I was trying to run importance of features from above posts but I couldnt.
I am receiving an error for package ‘e1071′. The command I used to install it is:
install.packages(“e1071”,dep=TRUE,type=’source’)

the error is :
ERROR: compilation failed for package ‘e1071’

I am using latest version of R 3.1.1.

Can anyone please help me in it.

• Arun Anoop M October 14, 2019 at 2:46 am #

> install.packages(‘e1071’)
trying URL ‘https://cran.hafro.is/bin/windows/contrib/3.6/e1071_1.7-2.zip’
Content type ‘application/zip’ length 1021699 bytes (997 KB)
downloaded 997 KB

package ‘e1071’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in

3. Jordan February 27, 2015 at 5:37 am #

This was a great read! Thank you! I do have one question. With my data set I performed the last two options (ranking by importance and then feature selection), however, the top features selected by the methods were not the same. Why would this be? Is one better than the other?

Thanks!

• Jason Brownlee February 27, 2015 at 5:41 am #

Thanks Jordan.
This is a good point. Different methods will select different subsets of features. There is likely no “best set” of features just like there is no best model. My advice is to model each subset of features and see what works best for your problem and your needs.

Applied machine learning is a process of empirical hypothesis testing – lots of trial and error.

4. Ahmad February 27, 2015 at 12:35 pm #

Thanks for such a useful post.
I have one question about RFE method. I am not sure if I’ve got the point correctly because I wonder which method is used to build the model in each step and if it is possible to build the model using SVM in each iteration?

5. Manuel April 3, 2015 at 9:55 pm #

Your post is a very nice read – I’m new to caret and found it useful to get things running quickly. I am working on a p>>n classification problem, in particular I am not interested in a blackbox predictive model, but rather a more explanatory model, therefore I’m trying to extract sets of important features that can help to explain the outcome (I have additional data to validate the relationship between the extracted features).

I am wondering if it makes sense to run RFE with different models and extract the subset of important variables for fitting each respective model, and then find the intersection of the set of features. I’d expect that each model might produce different sets of features (with the most important ones shared), and that the sets of features for one model might not work as good for another type of model, so I’m not sure though.

What’s the best way to approach a problem like this? Thanks!

6. Conor April 12, 2015 at 3:48 am #

Brilliant post and very well laid out. Thanks a million.

7. Rishabh April 13, 2015 at 4:36 pm #

can we do feature extraction using caret package?

8. Kaiyu May 4, 2015 at 8:26 am #

Great pics, but when running the code i found that rf feature selection would recommend 5 features as recommended, how to change the default setting? many thanks!!!

• Subhasis February 5, 2021 at 3:49 am #

I have the same question. How can we plot all the variables instead of top 5?

9. Sangram May 4, 2015 at 4:31 pm #

Hi. Great post! I have a doubt about using varImp for feature selection. I have already written an algorithm that runs randomForest for building a model on training set. Now I want to use varImp() to select only important features before applying randomForest. But varImp seems to build yet another model on the data to extract feature importance, which seems to be a bit counter-intuitive. Would you please explain me the significance of varImp before using randomForest to train a model on training data? How can I use varImp in such case?

10. Amit Nagar June 3, 2015 at 7:26 am #

Thanks for the informative post.

I am working with the data from Lending Club that is made available to public on their website. The dataset has a significant number of non numerical columns (grade, loan status etc). I am assuming before I get to feature selection methods described above, I will have to map these non numeric data to numeric values. My question: is there a prescribed way for handling such a situation or is it okay to follow an ad hoc mapping scheme. My intent, of course, is to be able to get to the point where I can do an intelligent feature selection. Thanks
Amit

11. Courage July 2, 2015 at 11:11 am #

Hello James,

I follow your newsletter. They have saved me at the right times 2x! Thanks. Also, do any of these algorithms take into consideration normalization of the data?

12. Sam August 21, 2015 at 12:54 am #

Thanks for the useful information.
How can it be applied for svm?

13. ARCHANA September 12, 2015 at 2:30 am #

thanks a lot for the wonderful information!

14. Saeed October 3, 2015 at 1:18 am #

It seems feature importance works with LVQ method only for classification problems, but not for regression problems, doesn’t it?

15. Raj October 4, 2015 at 6:45 am #

Awesome ML post i’ve come across Jason !!

I tried the code and i am trying to model only the features that are selected by the RFE algorithm. I get an error when i tried to use. Couldn’t figure out why it is giving an error

‘Error in eval(expr, envir, enclos) : object ‘diabetes’ not found’ at model (….) statement.

#Load the libraries
rm(list=ls())
library(mlbench)
library(caret)

#load the data
data(PimaIndiansDiabetes)

#define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method='cv', number=10)

#run the RFE algorithm
results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl = control)

#summarize the results
print(results)
names(PimaIndiansDiabetes)

#list the chosen features
predictors(results)

#Plot the results
plot(results, type=c('g','o'))

PimaIndiansDiabetes$diabetes <- as.character(PimaIndiansDiabetes$diabetes)
PimaIndiansDiabetes$diabetes[PimaIndiansDiabetes$diabetes=='neg'] <- 0
PimaIndiansDiabetes$diabetes[PimaIndiansDiabetes$diabetes=='pos'] <- 1
PimaIndiansDiabetes$diabetes <- as.factor(PimaIndiansDiabetes$diabetes)
View(PimaIndiansDiabetes)

#Split data
inTrain = createDataPartition(y=PimaIndiansDiabetes$diabetes, p=0.7, list=FALSE) train <- iris[inTrain,] test <- iris[-inTrain,] attach(train) #train the model model <- train(diabetes ~ glucose + mass + age + pregnant + pedigree, data=train, trControl = train_control, method='lvq', tuneLength=5) #Summary of model summary(model) • Raghavendra October 16, 2015 at 4:55 pm # You need to work with “PimaIndiansDiabetes” dataset which is within Rstudio. Not the one (pima.indians.diabetes.data) given in this page. 16. Ben October 5, 2015 at 6:12 pm # Hey, I’m trying to plot the importance of the different variables, but calling importance <- varImp(model, scale=FALSE) gives the error Error: (list) object cannot be coerced to type 'double' Any advice what I'm doing wrong or how to debug this issue? Thanks in advance! 17. deepak November 4, 2015 at 1:24 am # Awesome post Jason. I have been following your blogs like a university course to get up to speed in ML. 18. Vinh November 27, 2015 at 9:13 pm # In Importance part, for instance in your case, the variable ‘Age’ ranked at 3rd place, but can I know it is positive or negative way? I.e. higher age will lead to more chance of diabete, or vice versa? 19. jake February 15, 2016 at 1:51 pm # So I did the feature elimination as so : control <- rfeControl(functions=caretFuncs, method="cv", number=10) results <- rfe(mydata.train[,1:23], mydata.train[,24], sizes=c(2,5,8,13,19), rfeControl=control , method="svmRadial") print(results) Recursive feature selection Outer resampling method: Cross-Validated (10 fold) Resampling performance over subset size: Variables Accuracy Kappa AccuracySD KappaSD Selected 2 0.5100 -0.02879 0.05230 0.08438 5 0.5371 0.02703 0.05953 0.12621 8 0.5371 0.03630 0.07200 0.15233 13 0.5207 0.01543 0.05248 0.11149 19 0.5850 0.15647 0.07122 0.14019 23 0.6447 0.27620 0.06088 0.12219 * This is a good result for me , almost 65% . But when I try now to do a training with : svm.model <- train(OUTPUT~.,data = mydata.train,method = "svmRadial",trControl = trainControl(method = "cv",number = 10),tuneLength = 8,metric="Accuracy") , I can't get that accuracy . What model and parameters is the recursive elimination using ? 20. Prashanth March 5, 2016 at 7:36 am # Hi! The blog is really informative.Thanks! Would be nice to see some recommender system examples like supermarket basket analysis and recommendations to them in R. How it works. Could you please add a page for this? Thanks! 21. Grace March 20, 2016 at 10:16 am # great posting! I tried to use my dataset named R_feature_selection_test.csv but it didn’t work. I got an error message as below. Could you please give me advice? data(R_feature_selection_test) Warning message: In data(R_feature_selection_test) : data set ‘R_feature_selection_test’ not found • Nitika July 14, 2016 at 6:57 pm # I am getting the same problem. Have you found any solution? • R May 17, 2020 at 5:23 am # This is an old post, but just in case someone needs this: You need to first import your data into R. **** mydata <- read.csv("R_feature_selection_test.csv") *** 22. jibs April 13, 2016 at 5:17 am # Hi Thanks for the great blog. My outcome is coded disease absent = 0, disease present = 0. Input contains both binary and continous variables. Binary variables are coded in 0 and 1. I am getting the following error messages when I try lvq: Error in train.default(x, y, weights = w, …) : wrong model type for regression and later when I try rfe, I get the folllowing warning: In randomForest.default(x, y, importance = first, …) : The response has five or fewer unique values. Are you sure you want to do regression? Any idea what might be causing this? 23. jibs April 13, 2016 at 5:45 pm # i guess the question is: does lvq and ref in caret work for categorical data? Has anyone applied these models to datasets containing catogerical variables? 24. jibs April 14, 2016 at 2:08 am # both lines work now when i recoded the ouput as “yes” , “no” instead of 1 /0 . Make sure to do as.factor in case the recoded output is stored as character format • Kg July 11, 2017 at 1:23 am # Great ! Thanks • Jack Swen May 6, 2020 at 7:03 am # Hi I also met this kind of problem however, my dependent variables is not “yes” ‘no’. It is some number. Can I also use this to do the feature selection? • Jason Brownlee May 6, 2020 at 1:35 pm # Yes. You might need to encode the factor as 0 and 1. 25. Zain May 2, 2016 at 6:40 pm # Hi Jason Excellent work. I would like to ask how we can using t-test for feature selection. I have 12 attributes (variables) and one class variable(labels). Please guide me , how I can use t-test for variable selection? Thanks, 26. K May 13, 2016 at 1:52 pm # Just want to thank you for this accessible post! 27. Manasi Shah June 30, 2016 at 6:36 am # Hi Jason, Thank you for the extremely reproducible code. It worked well for me and I could adapt it to my current problem as well. Since you have recently responded to a post, I was hoping you could address this very basic and general query I had regarding RFE. My belief so far was that RFE is an additional tool to supplement the findings from trained models using the train function in caret or the randomForest function in the random forest package until I read a paper recently which did not explicitly say but hinted that feature selection is done prior to training the random forest model. So which scenario would be the appropriate use case for RFE? Also my accuracy using the RFE function is different than the accuracy I get by tuning the model for ROC. What might be the reason for this? Thanks a lot for your time, Best, Manasi • Jason Brownlee June 30, 2016 at 8:42 am # Normally you could perform feature selection then build your models. As part of feature selection, you can build models, but only to inform you as to what features to select. These are not the final models. • Manasi Shah June 30, 2016 at 3:59 pm # Thanks, that confirms my doubt and will be helpful in further incorporating this method in the analysis, Best, Manasi • Krishna prasad January 4, 2017 at 10:16 pm # Dear Jason, could you please build a model with feature selection using SVM-RFE followed by Genetic algorithm followed by permutation test and then any other model using R code • Jason Brownlee January 5, 2017 at 9:19 am # Sorry Krishna, I don’t have the capacity to write this code for you. Best of luck with your project. 28. Yogesh July 7, 2016 at 3:36 pm # Hi Jason Thanks for the detailed note It is very helpful for fresher like me.I have a doubt that do we need to remove outlier before using above techniques. • Jason Brownlee July 8, 2016 at 7:16 am # Yes, often data cleaning is a good first step. 29. Nitika July 17, 2016 at 5:42 pm # great posting! I tried to use my dataset named R_feature_selection_test.csv but it didn’t work. I got an error message as below. Could you please give me advice? data(R_feature_selection_test) Warning message: In data(R_feature_selection_test) : data set ‘R_feature_selection_test’ not found 30. Juhi September 14, 2016 at 6:33 am # Hi Jason, Great Posting!!! My original dataset has missing values in some of the columns but to use rfe() I need to treat those missing values, If I treat missing values my feature selection would be based on this but in the final model I am not treating missing values for those columns, wouldn’t my results be skewed? • Jason Brownlee September 14, 2016 at 10:09 am # It may be. Try feature selection with the data with imputed missing values, then try feature selection with all records with missing data removed. See which set of features result in the model with the best performance. I hope that helps. 31. Juhi September 17, 2016 at 6:06 am # Alright, Thanks! Also is there a way to decide number of iteration in the algorithm or we just try it for various numbers and then try to come up with an optimum number • Jason Brownlee September 17, 2016 at 9:37 am # Try various training lengths and see what works best. 32. Brittany October 9, 2016 at 3:25 pm # Hi! I am new to this whole programming thing. I am trying to use the rank features by importance and I keep getting the error: Error in na.fail.default(list(SalePrice = c(208500L, 181500L, 223500L, : missing values in object My code looks like this: model <- train(SalePrice~., data=train_data, method="lvq", preProcess="scale", trControl=control) With train_date being the name of my data frame that has all my data. Would you know why I keep getting this error? • Jason Brownlee October 10, 2016 at 7:41 am # I’ve not seen this specific error Brittany. I wonder if it an issue with your data. Perhaps try to cut back your data either rows or columns until your code begins to work – it might help unearth the cause. • Suren February 21, 2018 at 4:04 am # Hi Brittany/ Jason, Did you end up finding the solution to this? Apologies for any trouble. 33. Ciara December 12, 2016 at 8:17 pm # Hi Jason This is a great post, thanks. I am trying to run the RFE on a dataset with approx 1000 data entries and 17 variables where several of the variables are categorical. I am not getting an error, however, the process just seems to keep running without stopping or coming to any conclusion. Do you know if this could be because of the size of my dataset or the type of data? Many thanks Ciara • Jason Brownlee December 13, 2016 at 8:06 am # Hi Ciara, RFE might not be the best method for categorical inputs. Try wrapping a tree method and see how that goes. More on feature selection in general here: http://machinelearningmastery.com/an-introduction-to-feature-selection/ • OD July 5, 2018 at 8:26 pm # Jason Hi Thanks for the spot on post. However, I’m getting the same problem (processing won’t end) and my variable are not categorical. What might be the problem? • Jason Brownlee July 6, 2018 at 6:40 am # Perhaps cut your data back to a small set inline to help figure out the cause of the fault? 34. Gaurav Pandey January 12, 2017 at 5:20 am # Hi Jason, thanks for an amazing explanation. I have few queries regarding the cross validation. 1. Why do we use cross validation exactly? Is it only to select the important feature? Or it has some other reason too? 2. When do we do the cross validation? is it before the model building or after the model building? Hope to get a reply. Thanks Gaurav • Jason Brownlee January 12, 2017 at 9:39 am # Hi Gaurav, Cross-validation allows us to make decisions (choose models or choose features) by estimating the performance of the result of the choice on unseen data. This is more robust than reviewing the performance on the entire training dataset alone. It is used to choose which model to build, prior. Once chosen, the model can be constructed using all available data. I hope that helps. 35. kanika March 9, 2017 at 8:47 pm # Hi, Can you please explain how to perform Feature selection using genetic algorithm on Pima Indians Diabetes Dataset in R? If possible in SAS also. • Jason Brownlee March 10, 2017 at 9:23 am # Sorry, I don’t have an example of feature selection with genetic algorithms. At least not yet. 36. Roberto Garcia March 18, 2017 at 4:40 am # Thanks for sharing your knowledge, Jason. Excelent explanation. • Jason Brownlee March 18, 2017 at 7:50 am # You’re welcome Roberto, I’m glad you found it useful. 37. Asad Khan March 23, 2017 at 5:40 pm # hi sir, is any tutorial for GA algorithm for feature selection in binary classification i am working on DNA and RNA datasets. • Jason Brownlee March 24, 2017 at 7:52 am # Sorry Asad, no tutorial on GA for feature selection yet, soon hopefully. 38. pier March 25, 2017 at 12:36 am # great post Jason. I tried to fix ntree and find different mtry. The code works well, but resamples have same ranges of accuracy by different mtry values..not possible. .something doesnot work!!! Thanks for response metric <- "Accuracy" control <- trainControl(method="cv", number=10, search="grid") tunegrid <- expand.grid(mtry=c(10, 20, 30, 40, 50)) modellist2 <- list() for (mtry in c(10, 20,30, 40, 50)) { set.seed(seed) custom2 <- train(Class~., data=dataset, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control, ntree=500) key2 <- toString(mtry) modellist2[[key2]] <- custom2 } custom2 results2 <- resamples(modellist2) summary(results2$values)

39. Seo Young Jae April 10, 2017 at 3:26 am #

Good information!
I have two questions.

1. In varImp selection, when I implement that code, I can see an overall.
What is the overall??(like p-value, t-test …. what is it?)

2. And in the varImp() result, what variables have to be selected or removed?? what’s the criteria?

40. khawla April 18, 2017 at 7:10 pm #

your post is really helpful,thank you so much for those information, it works only for quantitative variables i need to know how to calculate the matrix with qualitative variables.

• Jason Brownlee April 19, 2017 at 7:50 am #

Good question, sorry I do not have an example at the moment.

41. Svetlana April 21, 2017 at 10:42 pm #

Thank you for the post, Jason! Is there any packages (algorithms) for feature selection on nominal data?

• Jason Brownlee April 22, 2017 at 9:26 am #

I expect there must be, but I don’t recall off hand sorry.

42. Phil June 30, 2017 at 3:44 am #

Is there a way to make findCorrelation() flag attributes that are strictly above cutoff? I see from my use case that the absolute correlation value is compared against cutoff, as in the verbose output snippet below (cutoff=0.9):

Combination row 12474 and column 12484 is above the cut-off, value = 0.922
Flagging column 12474
Combination row 12476 and column 12484 is above the cut-off, value = -0.913
Flagging column 12476

Is there a way to not flag negative correlation?

thanks for your help!

• Jason Brownlee June 30, 2017 at 8:14 am #

It may work on the absolute abs() value.

43. Lee Zee July 2, 2017 at 3:02 am #

Hi Jason,
I am assuming that before I get to using the above feature selection methods, I’ll have to convert these non numeric values to numeric values (e.g., converting “W” for wins to “1”s and “L” for losses to “0”s). Is that right? Or, can I leave those columns with non-numeric values as is. My intent, of course, is to be able to get to the point where I can do an intelligent feature selection.

Thanks in advance.

Lee

• Jason Brownlee July 2, 2017 at 6:34 am #

Generally yes, most methods expect to work with numeric values.

44. Mike K July 17, 2017 at 9:35 am #

Hi Jason,

Thank you for a great post! I am wondering how to apply the same technique to a large data set in order to keep all features and extract the required rows (as a subset or sample) that are highly correlated. Say we start with a matrix of 1000000 rows and 15 variables, I want to extract 20 rows that are most or least correlated. What technique (or algorithm) would be best? Thank you in advance!

• Jason Brownlee July 18, 2017 at 8:39 am #

Interesting idea Mike. You may need to prepare some custom code for this task.

45. Jean-Sebastien P. July 25, 2017 at 4:52 am #

Hi Jason,
Thank you very much for the explanation. Now I’m trying to use it on another dataset, but I’m running into a bit of an issue. My dataset had 962 features and 144 observations. My response is a factor with 4 levels and all other variables are either numeric or integer. So far, I’m able to run the first part, but I’m getting an error when I build my model:

library(mlbench)
library(caret)
set.seed(233)
correlationMatrix <- cor(dataset[,3:962])
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff = 0.5)
control <- trainControl(method = "repeatedcv", number=10, repeats=3)
model Error in seeds[[num_rs + 1L]] : subscript out of bounds

It’s the first time I’ve encountered this error and I didn’t find any information that could help me so far. I was wondering if you had a hint for me.
Thank you!

• Jason Brownlee July 25, 2017 at 9:48 am #

Sorry to hear that. I have not seen this error.

Perhaps try simplifying your code/run to help expose it.

Perhaps try searching stackoverflow/crossvalidated or even posting the issue there?

• vizzy June 3, 2018 at 3:21 am #

getting the same issue? did you solve it?

• Lopamudra Chakraborty August 18, 2018 at 4:02 pm #

I have the same problm. did you solve it?

46. Murillo August 9, 2017 at 6:08 pm #

Hi Jason! I am using the Feature Selection, copy and adapting the code to my data and instead of giving me the results according to the ACCURACY, it gives me acording to the RMSE. If I use the selected variables in a multiple linear regression this results in a different RMSE value. Could you help me to change this plot to accuracy instead of RMSE ? Thanks!

• Jason Brownlee August 10, 2017 at 6:54 am #

If your problem is a regression problem (predicting a real value), then you cannot calculate accuracy, you must calculate a prediction error, like RMSE.

47. aquaq August 15, 2017 at 1:13 am #

Hi Jason, thanks for these wonderful posts.

I have a question: are there any limitations for the number of features vs. number of observations for machine learning algorithms? E.g can I run SVM or random forest with more feature than observations? For example, for linear regression, I have read that (as a rule of thumb), the number of features better not exceed the 1/5 of the number of observations to avoid overfitting.

Thank you!

48. Piroska October 10, 2017 at 6:22 am #

Thank you for your great posts! I like simple solutions with simple codes like these. These post was very useful for my project. Thanks a lot! Piroska

• Jason Brownlee October 10, 2017 at 7:53 am #

Thanks Piroska, I’m glad to hear that!

49. Noor October 19, 2017 at 10:36 pm #

Hi Jason

Thank you for your post.

in your post, you have used a numeric data set.

Is it possible to apply the mentioned methods on mixed data set such as heart, and
hepatitis datasets from UCI.

Thanks

• Jason Brownlee October 20, 2017 at 5:35 am #

Perhaps, but you will need to encoder categorical variables to integer values or binary vectors.

50. omesad October 19, 2017 at 10:54 pm #

how can I obtain the importance for each attribute using a ROC curve analysis?

51. Jon November 2, 2017 at 5:32 am #

Why does findCorrelation only report on the first row? How do I get it to report on all correlations?

• Jason Brownlee November 2, 2017 at 3:54 pm #

You can learn more about the cor function by typing: ?cor

52. rawia November 6, 2017 at 7:08 am #

Help i got this error results <- rfe(data[,1:71], data[,72], sizes=c(1:71), rfeControl=control)
Error in { : task 9 failed – "Can't have empty classes in y."

• Jason Brownlee November 7, 2017 at 9:44 am #

Double check your data, e.g. assign to a variable and summarize it to ensure it is as you expect.

53. monizza syed December 7, 2017 at 7:46 am #

model <- train(diabetes~., data=PimaIndiansDiabetes, method="lvq", preProcess="scale", trCo

can you explain me what diabetes refers to ? or where it come from?

• Jason Brownlee December 7, 2017 at 8:09 am #

The outcome variable (column) in the dataset (PimaIndiansDiabetes).

54. monizza syed December 9, 2017 at 11:04 am #

Hi ,

I want to understand you code could you help me

model <- train(diabetes~., data=PimaIndiansDiabetes, method="lvq", preProcess="scale", trControl=control)

in above statement what diabetes is referred to?

• Jason Brownlee December 10, 2017 at 5:16 am #

Sure, ‘diabetes’ is the target column in the ‘PimaIndiansDiabetes’ dataset.

55. monizza syed December 9, 2017 at 11:06 am #

sorry omit my last post

56. monizza syed December 10, 2017 at 2:21 am #

hi,

I am trying to running your code by replacing my dataset but it is giving me error for wrong model for regression in feature selection by Varimp

any idea what it could cause ?

• Jason Brownlee December 10, 2017 at 5:23 am #

Perhaps post your error to stackoverflow?

• Monizza Syed December 10, 2017 at 10:17 am #

i give it try but didn’t got any response

57. Sonal January 22, 2018 at 5:34 pm #

hi, i am trying but it says there is no package called caret and no package called mlbench.pls help

• Jason Brownlee January 23, 2018 at 7:49 am #

I’m sorry to hear that. Perhaps try posting to stackoverflow with your error messages?

• kagr February 13, 2018 at 11:33 am #

if error message “Error in library(mlbench)” then
before you have to this
> install.package(“mlbench”)

58. poonam February 20, 2018 at 4:12 am #

> I am getting problem while installing Fselector command, i am not getting whats the matter
plz help
install.packages(“FSelector”)
Error in fetch(key) :
lazy-load database ‘C:/Users/ux305/Documents/R/win-library/3.4/FSelector/help/FSelector.rdb’ is corrupt

> library(FSelector)
Error: package or namespace load failed for ‘FSelector’:
.onLoad failed in loadNamespace() for ‘rJava’, details:
call: fun(libname, pkgname)
error: JAVA_HOME cannot be determined from the Registry

• Jason Brownlee February 21, 2018 at 6:31 am #

I’m sorry to hear that. Perhaps try posting the error to stackoverflow?

59. rawia February 22, 2018 at 9:32 am #

I wanna ask which is the best performance evaluation metric in feature selection in case of classification (ROC, MSRE,, ACCURACY,…)? and thank you

• Jason Brownlee February 22, 2018 at 11:23 am #

It really depends on your project, your goals, and your specific dataset.

60. Kingsley Nwosu March 28, 2018 at 12:44 pm #

Hi Jason, this is a very good post and i am a huge fan because all your work make ML very easy to handle. just wondering if any of your posts goes further to do the main classification task, i.e training the selected rfe attributes?

• Jason Brownlee March 28, 2018 at 2:34 pm #

Sorry, I don’t have more detail on RFE at this stage.

61. Prakhar Khandelwal March 30, 2018 at 9:08 am #

It was really nice. but when I used
model <- train(quality~., data=wine, method="lvq", trControl=control)

then

error: wrong model type for regression.
does this model does not work with only numeric data.
I was trying this model to the Red Wine dataset.

• Jason Brownlee March 31, 2018 at 6:30 am #

Perhaps take the wine quality scores as 10 classes?

62. fabio April 23, 2018 at 11:49 pm #

hI Jason thank you so much about this post.

63. Behrouz April 25, 2018 at 3:26 am #

Hi Jason;

In the Feature Selection, when you used plot(results, type=c(“g”, “o”)) code to make an accuracy vs Variables plot, is there any way we can get the actual names of variables instead of variable numbers?
Thank you in advance for your help.

64. Msilvy May 18, 2018 at 2:50 pm #

Hi jason,
I am having problem with Caret package RFE-RF

set.seed(42)
index <- createDataPartition(RFTXModel$outcome, p = 0.7, list = FALSE) train_data <- RFTXModel[index, ] test_data <- RFTXModel[-index, ] x <- dplyr::select(train_data, -outcome) y <- as.factor(train_data$outcome)

set.seed(42)

control <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 5,
verbose = FALSE)

Result <- rfe(x,y,metric = "Kappa",
rfeControl = control)
print(Result)

I have two question
1. If I only need the list of predictors to be used in another classifier, do I split the data train and test for RFE?

2. How to get OA and Kappa value for each variable like this table

Variablenames OA Kappa SD (OA) SD (Kappa)
PC2Tex1 0.63 0.57 0.05 0.06
PC2 0.70 0.65 0.03 0.04
Thanks in advance

65. clinton K fernandes May 30, 2018 at 11:43 pm #

In removing redundant features don’t we focus on Negative correlation also ?

• Jason Brownlee May 31, 2018 at 6:19 am #

Yes, we should. Or just take the absolute of the coefficient and work in the positive domain.

66. Diego August 10, 2018 at 4:38 am #

Hi Jason. If I’m using recursive feature elimination, how could I obtain a ROC curve for the best model? Like if I had 10 variables and it selected 5, how do I plot the curve for this specific model?

Thanks!

• Jason Brownlee August 10, 2018 at 6:20 am #

Once you select the features, fit a model then calculate a ROC curve using a test dataset.

I don’t think I have an example in R, sorry.

67. LOpa September 5, 2018 at 9:10 pm #

HI jason,
I have one question. I have dataset with 1600 features. I have applied lvq.
It is giving 20 imp features. But I want mininum 60 features. how to get that using varimp? plot(importance) is so clumsy that I am not getting the names of the features.

• Jason Brownlee September 6, 2018 at 5:34 am #

I’m not sure off hand, sorry. Perhaps try posting to stackoverflow?

68. Elizabeth November 17, 2018 at 6:26 am #

HI Jason, I may have posted this on a different thread by accident, but I was curious about the difference in the caret package Var Imp plot and the regular Random Forest Var Imp Plot. In regular RF, the variable importance is determined by gini decrease, whereas it seems that what you are saying here is that the caret uses a different method–not a gini decrease. The code for running both is the same in both packages (VarImp)–so I’m a bit confused…..

• Jason Brownlee November 18, 2018 at 6:33 am #

Caret doesn’t actually implement the algorithms, it is just a wrapper to use algorithms from other packages, like the random forest package.

69. Elizabeth November 18, 2018 at 12:17 pm #

Oh I see, thank you…. but the plot we produce from caret random forest is an importance plot based on 1-100, whereas using random forest alone gives us a mean accuracy decrease and mean gini decrease for importance. I guess that’s where I was confused because I had assumed that caret was using essentially the RF package. But when I run the importance plots in both, they don’t seem to give me the same statistics (even if the order of variables is the same).

• Jason Brownlee November 19, 2018 at 6:42 am #

I see. Perhaps post to stackoverflow or the R users list?

70. Alex November 26, 2018 at 10:00 pm #

Thank you for this post! It helps a lot. I ran the model as you instructed. It shows that my 21 variables can be narrowed down to 8. It lists variables 1-7 and 21. Would those numbers correspond to the order in which they appear in the data frame? Thanks in advance!

• Jason Brownlee November 27, 2018 at 6:34 am #

Yes, the numbers represent the column index for each selected feature.

71. Chris December 5, 2018 at 8:35 pm #

Why is the use of removing features that are correlated with eachother? And if I find 2 features that are highly correlated do I remove only one of them or both? And why is nobody asking this question?

• Jason Brownlee December 6, 2018 at 5:54 am #

It simplifies the model and removes redundancy. This means it is less complex, learns faster and may even make better predictions.

You remove one of them.

• Eddy January 27, 2019 at 3:30 pm #

You remove the one that is less correlated with your dependent variable

72. Lydia February 9, 2019 at 12:57 am #

Thanks for the great post. I have a question if for example I have a classifier problem and let’s say I want to choose an algorithm.

Do I use RFE first, I get the important features and then based on them I run all possible algorithms (e.g. logistic regression, kNN, Decision Trees…) and choose the best one?

OR Do I choose features within each algorithm (e.g. forward selection in logistic regression, RFE in Decision trees etc) and then at the end compare the accuracy of all algorithms and choose the best one?

• Jason Brownlee February 9, 2019 at 5:59 am #

Perhaps go back and forth between the two processes until you find a well performing combination of features and model.

73. Marc February 12, 2019 at 8:07 am #

Can you provide a reference or set of references for using a cutoff of .75 (or other)?
Thank you

• Jason Brownlee February 12, 2019 at 8:09 am #

No, sorry.

Perhaps run a sensitivity analysis of different cut off values and see what works best for your dataset.

74. yinshao February 25, 2019 at 8:55 pm #

Thank you for sharing a good method. I have a question. If i want to know a detailed combination of variables with different number of variables, how can i do?

• Jason Brownlee February 26, 2019 at 6:22 am #

Sorry, I don’t follow, can you elaborate please?

• yinshao February 26, 2019 at 12:11 pm #

For example, I have 11 variables. When the number of variables is 7, the model is optimal and I can know which variables are there. I want to know what the detailed variable is when the number of variables is 5.

75. Steven June 20, 2019 at 10:12 pm #

Hi Jason,

I have a question with regards to the correlation – the highly correlated features i find with this function…

# Find attributes that are highly corrected (ideally >0.75)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.75)

… do I need to take out all of them or should I only tak out half, because then, they do not correlate with something anymore. Or are these the ones that correlate high with ALL the variables in my dataset? Thank you for the help.

Best,
Steven

• Jason Brownlee June 21, 2019 at 6:38 am #

Half.

The idea is to remove redundant features. Those features that correlate strongly with existing features are redundant, they don’t add new information.

76. Mano Vishal June 29, 2019 at 4:46 am #

Hi Jason!

Do you have any topic regarding variable normalization?

77. Marc July 1, 2019 at 2:22 am #

Hi Jason,
Great post, thank you very much.
I had one question regarding the “Recursive Feature Elimination”. Here it shows that with 4 features we get almost the same performance as with 8, is there any way to see which 4 variables those are?

Thank you very much

• Jason Brownlee July 1, 2019 at 6:36 am #

Yes, it reports the index or column of each selected variable. You can then lookup the names of these columns.

78. Pamela Ochungo September 12, 2019 at 1:49 pm #

This is very useful. It has worked well for my data

• Jason Brownlee September 12, 2019 at 1:54 pm #

Thanks, I’m glad it helped.

79. PiAI December 27, 2019 at 3:46 pm #

What is difference is Variable importance and feature selection?

• Jason Brownlee December 28, 2019 at 7:42 am #

Importance can be used for selection but is itself not selection.

80. Rajeev Bhattarai February 20, 2020 at 11:39 am #

Could you please tell in what sense the importance of variables is measured? Are they mean decrease in accuracy or decrease in Gini?

Thank you.

81. xinyue March 30, 2020 at 6:58 pm #

Hi Jason,

if I use RFE based on Random forest, can the selected features set be used to build other kind of model like SVM?

Thank you!

82. Suraj April 1, 2020 at 10:40 pm #

could you share the advantage and disadvantages of RFE, please? I have used it and also used other example xgboost and GA to compare the capacity of each. in some cases, RFE performed well and in other xgboost and GA featured models got higher accuracy.

ps. regression data have used.

best regards

• Jason Brownlee April 2, 2020 at 5:54 am #

If it works well on your dataset, use it, if not, don’t. Other than that – I cannot help you.

83. sdm May 4, 2020 at 9:33 pm #

Hi
Thank you for your nice and explicit explanation.
What if I want to do recursive feature selection for other ML algorithms in caret package such us SVM, ANN, KNN?
I saw a free function called “caretFuncs”. Can i use this function for all machine learning methods that are embedded with caret package including for the random forest classifier? Thank you in advance.

• Jason Brownlee May 5, 2020 at 6:27 am #

Yes, you can use any algorithm you like in RFE I believe.

84. Dominique May 10, 2020 at 12:17 am #

Hi Jason,

Thanks for this nice post.

I observed that the correlation matrix is in fact a Pearson correlation computation.

So I am wondering after the correlation matrix we have observed here with 4 features which are highly correlated. Do we remove 3 out of 4 of theses features before starting a new model of prediction?

Thanks,
Regards,
Dominique

• Jason Brownlee May 10, 2020 at 6:11 am #

Yes.

I believe there is a procedure for this, for removing redundant inputs. From memory, I think each feature is assigned a scoring based how correlated it is with all other features, and a subset of the most correlated inputs are removed.

85. Dhineshwaran Rajendran May 30, 2020 at 4:33 pm #

Hi, I’m using the below code for recursive elimination its a 140:396 dataset.

#Recrusive Elimination…………………………………………………..

control <- rfeControl(functions=rfFuncs, method="cv", number=10)

RFE <- rfe(dat5[,2:396], dat5[,1],
sizes=c(1:8), rfeControl=control)

But I'm Keep having the below error.

Error in summary.connection(connection) : invalid connection

• Jason Brownlee May 31, 2020 at 6:19 am #

That is an odd error.

Perhaps try posting your code and error message to the R user group or stackoverflow?

• Dhineshwaran Rajendran May 31, 2020 at 8:23 am #

Thanks for Replying Jason,

Buth the error is due to parallel computing.

After running the below code able to solve it

library(doParallel); library(doSNOW);
{ # doParallel
cl <- makeCluster(detectCores()); registerDoParallel(cl);
getDoParWorkers(); stopCluster(cl);

# let it snow (doSNOW)
cl <- makeCluster(32,type="SOCK")
stopCluster(cl)
}

86. Grace June 2, 2020 at 8:48 am #

Dear Jason,
Thank you for your work. I have a question related to feature selection part after varImp(). After seeing the most important features for my prediction, suppose I want to select first 4 of them as my predictors and continue with the random forest. Could you help me with that?

• Jason Brownlee June 2, 2020 at 1:18 pm #

Excellent question. I hope to demonstrate this with an example in the future.

87. safalabolo October 16, 2020 at 9:22 pm #

Hi Jason
a question on what basis I define
sizes = C (1: 8)

for

results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl=control)

I guess it depends on the dataset, but is there a general rule to rely on?
Thanks in advance for your reply

• Jason Brownlee October 17, 2020 at 6:02 am #

Not really, it might be a good idea to test 1 to n for n input features.

88. Tommy November 2, 2020 at 5:02 pm #

Hi Jason,
I have nonlinear time series dataset which contains numerical data and I just pass model <- train(EGT~., data=df[1:10], method="lvq", preProcess="scale", trControl=control)
and it obtains the Error: wrong model type for regression.
I hv no idea. Also, how to cal the non-linear corr?!

Thanks.

• Jason Brownlee November 3, 2020 at 6:51 am #

This form of feature selection is probably not appropriate for time series.

Perhaps try nonparametric correlation, like spearmans?

89. Umair Rasool May 20, 2021 at 1:04 am #

Thanks for your valuable information. I am doing prediction analysis using ANN, RF, SVM, KNN, NB and XGB. Where i have used caret package to calculate the feature importance for SVM, KNN and NB, while for ANN, RF and XGB, i have used neuralnetwork, ranomforest and xgboost packages, respectively. My question is, how these packages select the ranks the different parameters because the weight of each parameter is not the same with all 4 packages. Thanks

• Jason Brownlee May 20, 2021 at 5:48 am #

You’re welcome.

I believe each model uses default hyperparametesr. I would recommend tuning each algorithm if you have the time.

90. Gernel June 5, 2021 at 5:46 am #

the result of the plot of the rfe function is RMSE, how did you come up with the accuracy plot?

• Jason Brownlee June 6, 2021 at 5:35 am #

The code used to create the plots is in the above tutorial.