Tune Machine Learning Algorithms in R (random forest case study)

It is difficult to find a good machine learning algorithm for your problem. But once you do, how do you get the best performance out of it.

In this post you will discover three ways that you can tune the parameters of a machine learning algorithm in R.

Walk through a real example step-by-step with working code in R. Use the code as a template to tune machine learning algorithms on your current or next machine learning project.

Tune Random Forest in R

Tune Random Forest in R.
Photo by Susanne Nilsson, some rights reserved.

Get Better Accuracy From Top Algorithms

It is difficult to find a good or even a well-performing machine learning algorithm for your dataset.

Through a process of trial and error you can settle on a short list of algorithms that show promise, but how do you know which is the best.

You could use the default parameters for each algorithm. These are the parameters set by rules of thumb or suggestions in books and research papers. But how do you know the algorithms that you are settling on are showing their best performance?

Use Algorithm Tuning To Search For Algorithm Parameters

The answer is to search for good or even best combinations of algorithm parameters for your problem.

You need a process to tune each machine learning algorithm to know that you are getting the most out of it. Once tuned, you can make an objective comparison between the algorithms on your shortlist.

Searching for algorithm parameters can be difficult, there are many options, such as:

  • What parameters to tune?
  • What search method to use to locate good algorithm parameters?
  • What test options to use to limit overfitting the training data?

Get Started with Machine Learning in R, Right Now

Machine Learning Mastery With R Mini Course Table of Contents

R is the most popular platform among professional data scientists for applied machine learning.

Download your mini-course in Machine Learning with R.

Start Your FREE Mini-Course >> 

FREE 14-Day Mini-Course in
Machine Learning with R

Download your PDF containing all 14 lessons.

Get your daily lesson via email with tips and tricks.

Tune Machine Learning Algorithms in R

You can tune your machine learning algorithm parameters in R.

Generally, the approaches in this section assume that you already have a short list of well-performing machine learning algorithms for your problem from which you are looking to get better performance.

An excellent way to create your shortlist of well-performing algorithms is to use the caret package.

For more on how to use the caret package, see:

In this section we will look at three methods that you can use in R to tune algorithm parameters:

  1. Using the caret R package.
  2. Using tools that come with the algorithm.
  3. Designing your own parameter search.

Before we start tuning, let’s setup our environment and test data.

Test Setup

Let’s take a quick look at the data and the algorithm we will use in this case study.

Test Dataset

In this case study, we will use the sonar test problem.

This is a dataset from the UCI Machine Learning Repository that describes radar returns as either bouncing off metal or rocks.

It is a binary classification problem with 60 numerical input features that describe the properties of the radar return. You can learn more about this problem here: Sonar Dataset. You can see world class published results for this dataset here: Accuracy on the Sonar Dataset.

This is not a particularly difficult dataset, but is non-trivial and interesting for this example.

Let’s load the required libraries and load the dataset from the mlbench package.

Test Algorithm

We will use the popular Random Forest algorithm as the subject of our algorithm tuning.

Random Forest is not necessarily the best algorithm for this dataset, but it is a very popular algorithm and no doubt you will find tuning it a useful exercise in you own machine learning work.

When tuning an algorithm, it is important to have a good understanding of your algorithm so that you know what affect the parameters have on the model you are creating.

In this case study, we will stick to tuning two parameters, namely the mtry and the ntree parameters that have the following affect on our random forest model. There are many other parameters, but these two parameters are perhaps the most likely to have the biggest effect on your final accuracy.

Direct from the help page for the randomForest() function in R:

  • mtry: Number of variables randomly sampled as candidates at each split.
  • ntree: Number of trees to grow.

Let’s create a baseline for comparison by using the recommend defaults for each parameter and mtry=floor(sqrt(ncol(x))) or mtry=7 and ntree=500.

We can see our estimated accuracy is 81.3%.

1. Tune Using Caret

The caret package in R provides an excellent facility to tune machine learning algorithm parameters.

Not all machine learning algorithms are available in caret for tuning. The choice of parameters is left to the developers of the package, namely Max Khun. Only those algorithm parameters that have a large effect (e.g. really require tuning in Khun’s opinion) are available for tuning in caret.

As such, only mtry parameter is available in caret for tuning. The reason is its effect on the final accuracy and that it must be found empirically for a dataset.

The ntree parameter is different in that it can be as large as you like, and continues to increases the accuracy up to some point. It is less difficult or critical to tune and could be limited more by compute time available more than anything.

Random Search

One search strategy that we can use is to try random values within a range.

This can be good if we are unsure of what the value might be and we want to overcome any biases we may have for setting the parameter (like the suggested equation above).

Let’s try a random search for mtry using caret:

Note, that we are using a test harness similar to that which we would use to spot check algorithms. Both 10-fold cross-validation and 3 repeats slows down the search process, but is intended to limit and reduce overfitting on the training set. It won’t remove overfitting entirely. Holding back a validation set for final checking is a great idea if you can spare the data.

We can see that the most accurate value for mtry was 11 with an accuracy of 82.1%.

Tune Random Forest Parameters in R Using Random Search

Tune Random Forest Parameters in R Using Random Search

Grid Search

Another search is to define a grid of algorithm parameters to try.

Each axis of the grid is an algorithm parameter, and points in the grid are specific combinations of parameters. Because we are only tuning one parameter, the grid search is a linear search through a vector of candidate values.

We can see that the most accurate value for mtry was 2 with an accuracy of 83.78%.

Tune Random Forest Parameters in R Using Grid Search.png

Tune Random Forest Parameters in R Using Grid Search.png

2. Tune Using Algorithm Tools

Some algorithms provide tools for tuning the parameters of the algorithm.

For example, the random forest algorithm implementation in the randomForest package provides the tuneRF() function that searches for optimal mtry values given your data.

You can see that the most accurate value for mtry was 10 with an OOBError of 0.1442308.

This does not really match up with what we saw in the caret repeated cross validation experiment above, where mtry=10 gave an accuracy of 82.04%. Nevertheless, it is an alternate way to tune the algorithm.

Tune Random Forest Parameters in R using tuneRF

Tune Random Forest Parameters in R using tuneRF

3. Craft Your Own Parameter Search

Often you want to search for both the parameters that must be tuned (handled by caret) and the those that need to be scaled or adapted more generally for your dataset.

You have to craft your own parameter search.

Two popular options that I recommend are:

  1. Tune Manually: Write R code to create lots of models and compare their accuracy using caret
  2. Extend Caret: Create an extension to caret that adds in additional parameters to caret for the algorithm you want to tune.

Tune Manually

We want to keep using caret because it provides a direct point of comparison to our previous models (apples to apples, even the same data splits) and because of the repeated cross validation test harness that we like as it reduces the severity of overfitting.

One approach is to create many caret models for our algorithm and pass in a different parameters directly to the algorithm manually. Let’s look at an example doing this to evaluate different values for ntree while holding mtry constant.

You can see that the most accuracy value for ntree was perhaps 2000 with a mean accuracy of 82.02% (a lift over our very first experiment using the default mtry value).

The results perhaps suggest an optimal value for ntree between 2000 and 2500. Also note, we held mtry constant at the default value. We could repeat the experiment with a possible better mtry=2 from the experiment above, or try combinations of of ntree and mtry in case they have interaction effects.

Tune Random Forest Parameters in R Manually

Tune Random Forest Parameters in R Manually

Extend Caret

Another approach is to create a “new” algorithm for caret to support.

This is the same random forest algorithm you are using, only modified so that it supports multiple tuning of multiple parameters.

A risk with this approach is that the caret native support for the algorithm has additional or fancy code wrapping it that subtly but importantly changes it’s behavior. You many need to repeat prior experiments with your custom algorithm support.

We can define our own algorithm to use in caret by defining a list that contains a number of custom named elements that the caret package looks for, such as how to fit and how to predict. See below for a definition of a custom random forest algorithm for use with caret that takes both an mtry and ntree parameters.

Now, let’s make use of this custom list in our call to the caret train function, and try tuning different values for ntree and mtry.

This may take a minute or two to run.

You can see that the most accurate values for ntree and mtry were 2000 and 2 with an accuracy of 84.43%.

We do perhaps see some interaction effects between the number of trees and the value of ntree. Nevertheless, if we had chosen the best value for mtry found using grid search of 2 (above) and the best value of ntree found using grid search of 2000 (above), in this case we would have achieved the same level of tuning found in this combined search. This is a nice confirmation.

Custom Tuning of Random Forest parameters in R

Custom Tuning of Random Forest parameters in R

 

For more information on defining custom algorithms in caret see:

To see the actual wrapper for random forest used by caret that you can use as a starting point, see:

Summary

In this post you discovered the importance of tuning well-performing machine learning algorithms in order to get the best performance from them.

You worked through an example of tuning the Random Forest algorithm in R and discovered three ways that you can tune a well-performing algorithm.

  1. Using the caret R package.
  2. Using tools that come with the algorithm.
  3. Designing your own parameter search.

You now have a worked example and template that you can use to tune machine learning algorithms in R on your current or next project.

Next Step

Work through the example in this post.

  1. Open your R interactive environment.
  2. Type or copy-paste the sample code above.
  3. Take your time and understand what is going on, use R help to read-up on functions.

Do you have any questions? Ask in the comments and I will do my best to answer them.

Frustrated With Your Progress In R Machine Learning?

Develop Your Own Models and Predictions in Minutes

...with just a few lines of R code

Discover how in my new Ebook: Machine Learning Mastery With R

It covers self-study tutorials and end-to-end projects on topics like:
Loading data, visualization, build models, algorithm tuning, and much more...

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

40 Responses to Tune Machine Learning Algorithms in R (random forest case study)

  1. Harshith August 17, 2016 at 10:55 pm #

    Though i try Tuning the Random forest model with number of trees and mtry Parameters, the result is the same. The table Looks like this and I have to predict y11.

    x11 x12 x13 x14 x15 x16 x17 x18 x19 y11
    0 0 0 2 0 2 2 4 0.000000000 ?
    1 1 0 0 0 3 3 18 0.000025700 ?
    0 1 0 0 1 2 2 2 0.000000000 ?
    5 2 1 0 1 12 12 14 0.000128479
    0 0 2 0 1 3 3 3 0.000000000
    4 0 2 0 1 7 8 104 0.000102783

    All x11 to x19 are the important Features to detect y11 value. I have given the model like this but still not able to achieve high Efficiency. Can you please help me.

    rf_model <- randomForest(y11~x11+x12+x13+x14+x15+x16+x17+x18+x19, data =new_dataframe_train,ntree=1001,importance=TRUE,keep.forest=TRUE,mtry=3)

    rf_pred<-predict(rf_model,new_dataframe_test)

    • Jason Brownlee August 18, 2016 at 8:00 am #

      Generally it is a good idea to tune mtry and keep increasing the number of trees until you no longer see an improvement in performance.

      No need to perform feature selection with random forest, it will do so automatically, ignoring features that make bad split points.

  2. Bing October 6, 2016 at 9:06 pm #

    Thanks for your codes. I am trying to run them and I was not able to get the same output as you did for the last part of
    > summary(custom)
    > plot(custom)

    I got something like this instead for “summary(custom)”:
    Length Class Mode
    call 5 -none- call
    type 1 -none- character
    predicted 10537 factor numeric
    err.rate 7500 -none- numeric
    confusion 6 -none- numeric
    votes 21074 matrix numeric
    oob.times 10537 -none- numeric
    classes 2 -none- character
    importance 51 -none- numeric
    importanceSD 0 -none- NULL
    localImportance 0 -none- NULL
    proximity 0 -none- NULL
    ntree 1 -none- numeric
    mtry 1 -none- numeric
    forest 14 -none- list
    y 10537 factor numeric
    test 0 -none- NULL
    inbag 0 -none- NULL
    xNames 51 -none- character
    problemType 1 -none- character
    tuneValue 2 data.frame list
    obsLevels 2 -none- character

    Can you advise what went wrong? The codes took 4 days to run and I’m sort of disappointed I wasn’t getting an output I was expecting. Thanks!

    • Beth November 8, 2016 at 1:32 am #

      Try:
      > custom$results.

  3. Aditya November 17, 2016 at 9:59 am #

    Hi Jason,
    I love your website and work.

    Is it a general practice to normalize and reduce the number of features using PCA before running the random forest algorithm? I understand that RF picks the best features while splitting the nodes, but my only motivation for PCA is to reduce the computation time.

    • Jason Brownlee November 18, 2016 at 8:18 am #

      Thanks Aditya.

      No, I would suggest you let RF figure it out.

      If you have time, you could later compare a PCA projected version of your dataset to see if it can out-perform the raw data.

      • Aditya November 20, 2016 at 3:09 am #

        Great. Thanks for the tip!

  4. Harini Devulapalli November 24, 2016 at 1:07 pm #

    Great Work! Thanks for the post! 🙂

    • Harini Devulapalli November 25, 2016 at 4:53 am #

      What will be the metric incase of regression?

    • Jason Brownlee November 25, 2016 at 9:30 am #

      Thanks Harini, I’m glad you found it useful.

  5. Siddhartha Peri November 29, 2016 at 12:53 pm #

    When you write the line for the model:

    rf_default <- train(Class~., . . . )

    My R compiler says that 'Class' wasn't found, what does the Class~. represent and how does one go about resolving the issue?

  6. Na Ja February 6, 2017 at 6:54 pm #

    Very informative blog! thank you! I am experimenting with custom RF function for regression model. I changed type = “regression” and metric=”RMSE”. But, for custom <- train(x,y, method=customRF, metric="RMSE", tuneGrid=tunegrid, trControl=c – I get error. "Error in train.default(x, y, method = customRF, metric = "RMSE", tuneGrid = tunegrid, :
    wrong model type for regression" . What could be wrong?

    • Na Ja February 6, 2017 at 11:50 pm #

      meanwhile, I did as coded below:

      control <- trainControl(method="repeatedcv", number=10, repeats=3, search="grid")
      tunegrid <- expand.grid(.mtry=c(1:15))
      modellist <- list()
      for (ntree in c(1000, 1500, 2000, 2500,3000)) {

      fit <- train(x,y, method="rf", metric="RMSE", tuneGrid=tunegrid, trControl=control, ntree=ntree)
      key <- toString(ntree)
      modellist[[key]] <- fit

      }

      But this takes too much time!

    • Jason Brownlee February 7, 2017 at 10:12 am #

      Strange, it looks like it does not want to use random forest for regression.

      Try using caret train() with ‘rf’ on your dataset without grid search and ensure it works.

      • Na Ja February 7, 2017 at 7:01 pm #

        Thanks Jason, I tried as per your suggestion and I get NULL output for fit$predicted !

        In function help, it is mentioned that “train{caret} function sets up a grid of tuning parameters for a number of classification and regression routines, fits each model and calculates a resampling based performance measure.”

        Do the tuning parameters I have obtained from above mentioned code are relevant, given the situation when we don’t go for grid search?

        • Jason Brownlee February 8, 2017 at 9:34 am #

          You can call the predict() function to make predictions.

  7. DF February 17, 2017 at 1:23 pm #

    Is there a theoretical uperr bound for mtry? I see in most examples the upper bound is 15 but why not fr example 35? Any reference will be greatly appreciated 🙂

    • Jason Brownlee February 18, 2017 at 8:33 am #

      Hi DF,

      mtry is the number of features to consider at each split point.

      The universe of possible values is 1 to the number of features in your dataset.

  8. MEHMET March 12, 2017 at 2:08 pm #

    Hi Jason,
    In order to improve a model’s accuracy I go back and search for inaccurately misplaced training points and correct or remove them. I have thousands of training points so it is not easy to locate them in the training data-set. I was wondering, since the confusion matrix in Random Forest Package in R shows how many training points are inaccurately classified in each class is there a way to easily pinpoint those inaccurately classified training points? I have ID’s of my training samples but I don’t know if it is possible to locate them.

    • Jason Brownlee March 13, 2017 at 7:37 am #

      Interesting tactic of removing misclassified points. This is generally not an approach I use or recommend.

      Why are you tuning your model this way?

  9. Mehmet March 14, 2017 at 1:54 pm #

    Some of my training points are collected using google earth so as you can guess they are not always the best validation data. that is the reason I tune my training points. Can you give me any suggestion regarding the problem? Normally it is possible to locate those points manually but it is time consuming. I am sure there should be a way to reveal the misclassified points but I don’t know how.

    • Jason Brownlee March 15, 2017 at 8:07 am #

      To identify misclassified point would require another source of information.

      This sounds like a data cleaning exercise, prior to modeling, unrelated to the choice or use of machine learning algorithms.

  10. M March 18, 2017 at 2:24 pm #

    Right, that is what I am talking about “data cleaning” but not prior to use of random forest model. I use uncleaned data as a pre-analysis and get confusion matrix. The confusion matrix reveals how many points are misclassified but I need to know exactly which points are misclassified so I can clean it. I need a code in R that reveals the misclassified points. I have the order numbers of each point (total 3000) if it helps.

    • Jason Brownlee March 19, 2017 at 6:09 am #

      I would recommend comparing predictions to actual points one at a time to discover which points were misclassified.

      I do not have an example at hand, sorry.

  11. Seo Young Jae March 24, 2017 at 4:06 pm #

    Thank you for good information!
    But I have a one question.

    >control set.seed(seed)
    >tunegrid rf_gridsearch control set.seed(seed)
    >mtry rf_random <- train(Class~., data=dataset, method="rf", metric=metric, tuneLength=15, trControl=control)

    Have a nice day !!

  12. Seo Young Jae March 24, 2017 at 4:36 pm #

    Thank you for nice information. But, I want to know more parameter like nodesize. Then.. how to write R code.. 🙁

  13. Seo Young Jae March 26, 2017 at 1:52 am #

    Question was deleted…. may be there was error..
    so I request. now.

    in this grid search code. you use set.seed(seed). Also, you use same function in random search code. I know what is set.seed. But I don’t know why you use this function in grid search code. I understand that set.seed function is used in random search because it has to select each parameter’s number. If i know wrong information, please explain what is the role of set.seed function in this code. Have a nice day!

    # train model
    control <- trainControl(method="repeatedcv", number=10, repeats=3)
    tunegrid <- expand.grid(.mtry=c(1:15), .ntree=c(1000, 1500, 2000, 2500))
    set.seed(seed)
    custom <- train(Class~., data=dataset, method=customRF, metric=metric, tuneGrid=tunegrid, trControl=control)
    summary(custom)
    plot(custom)

    • Jason Brownlee March 26, 2017 at 6:14 am #

      I call seed to initialize the random number generator to get the same results each time the code is run. This is useful for tutorials.

      You can learn more about the stochastic nature of applied machine learning here:
      http://machinelearningmastery.com/randomness-in-machine-learning/

      • Seo Young Jae March 26, 2017 at 10:39 pm #

        Oh. Thank you for your reply.

        At first, I understood that set.seed function affect to the selection of random number of ntree and mtry in random search.

        But you mean that seed function affect to the trainControl’s result and train function result. right?

        • Jason Brownlee March 27, 2017 at 7:55 am #

          Correct. The whole algorithm evaluation procedure including fitting the model.

  14. Seo Young Jae April 7, 2017 at 12:02 pm #

    Hi Jason. you introduced good 2 method(grid, random search).

    Is there another method to select best paramters??

    I want to know that Even if I do not set the range of parameters, I can find the optimal parameters for the model. The random search eventually sets the number to use as the tuneLength argument….

    Is there a way to find the parameters i want?

    • Jason Brownlee April 9, 2017 at 2:51 pm #

      You can use an exhaustive search.

      We often do not have the resources to find the “best” parameters and are happy to use “good” or “best found”.

  15. Dennis April 19, 2017 at 1:50 pm #

    Hi Jason. Thanks for your work. I am trying to run ‘Extend Caret’ part and it is unable to get the same output as you did, especially this:
    > summary(custom)
    > plot(custom)

    Something is wrong in “summary(custom)”, almost the same as Bing mentioned , as I used custom$results. as Beth suggested before. the result is still wrong.

    Could you please give further advise about this?Thank you!

  16. Howard May 31, 2017 at 4:18 am #

    Hi Jason.

    I never try Caret before and I am assuming the following line of code performs cross validation.

    control <- trainControl(method="repeatedcv", number=10, repeats=3)

    My question is why is cross-validation necessary since in random forest ,in its algorithmic philosophy, is already cross-validated.

    • Jason Brownlee June 2, 2017 at 12:41 pm #

      We need a scheme to evaluate the skill of the model on unseen data.

      We could use train/test split or k-fold CV.

Leave a Reply