Last Updated on August 22, 2019
Spot checking machine learning algorithms is how you find the best algorithm for your dataset.
But what algorithms should you spot check?
In this post you discover the 8 machine learning algorithms you should spot check on your data.
You also get recipes of each algorithm that you can copy and paste into your current or next machine learning project in R.
Kick-start your project with my new book Machine Learning Mastery With R, including step-by-step tutorials and the R source code files for all examples.
Let’s get started.

Spot Check Machine Learning Algorithms in R
Photo by Nuclear Regulatory Commission, some rights reserved.
Best Algorithm For Your Dataset
You cannot know which algorithm will work best on your dataset before hand.
You must use trial and error to discover a short list of algorithms that do well on your problem that you can then double down on and tune further. I call this process spot checking.
The question is not:
What algorithm should I use on my dataset?
Instead it is:
What algorithms should I spot check on my dataset?
Which Algorithms To Spot Check
You can guess at what algorithms might do well on your dataset, and this can be a good starting point.
I recommend trying a mixture of algorithms and see what is good at picking out the structure in your data.
- Try a mixture of algorithm representations (e.g. instances and trees).
- Try a mixture of learning algorithms (e.g. different algorithms for learning the same type of representation).
- Try a mixture of modeling types (e.g. linear and non-linear functions or parametric and non-parametric).
Let’s get specific. In the next section, we will look algorithms that you can use to spot check on your next machine learning project in R.
Need more Help with R for Machine Learning?
Take my free 14-day email course and discover how to use R on your project (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Algorithms To Spot Check in R
There are hundreds of machine learning algorithms available in R.
I would recommend exploring many of them, especially, if making accurate predictions on your dataset is important and you have the time.
Often you don’t have the time, so you need to know the few algorithms that you absolutely must test on your problem.
In this section you will discover the linear and nonlinear algorithms you should spot check on your problem in R. This excludes ensemble algorithms such as as boosting and bagging, that can come later once you have a baseline.
Each algorithm will be presented from two perspectives:
- The package and function used to train and make predictions for the algorithm.
- The caret wrapper for the algorithm.
You need to know which package and function to use for a given algorithm. This is needed when:
- You are researching the algorithm parameters and how to get the most from the algorithm.
- You have a discovered the best algorithm to use and need to prepare a final model.
You need to know how to use each algorithm with caret, so that you can efficiently evaluate the accuracy of the algorithm on unseen data using the preprocessing, algorithm evaluation and tuning capabilities of caret.
Two standard datasets are used to demonstrate the algorithms:
- Boston Housing dataset for regression (BostonHousing from the mlbench library).
- Pima Indians Diabetes dataset for classification (PimaIndiansDiabetes from the mlbench library).
Algorithms are presented in two groups:
- Linear Algorithms that are simpler methods that have a strong bias but are fast to train.
- Nonlinear Algorithms that are more complex methods that have a large variance but are often more accurate.
Each recipe presented in this section is complete and will produce a result, so that you can copy and paste it into your current or next machine learning project.
Let’s get started.
Linear Algorithms
These are methods that make large assumptions about the form of the function being modeled. As such they are have a high bias but are often fast to train.
The final models are also often easy (or easier) to interpret, making them desirable as final models. If the results are suitably accurate, you may not need to move onto non-linear methods if a linear algorithm.
1. Linear Regression
The lm() function is in the stats library and creates a linear regression model using ordinary least squares.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# load the library library(mlbench) # load data data(BostonHousing) # fit model fit <- lm(medv~., BostonHousing) # summarize the fit print(fit) # make predictions predictions <- predict(fit, BostonHousing) # summarize accuracy mse <- mean((BostonHousing$medv - predictions)^2) print(mse) |
The lm implementation can be used in caret as follows:
1 2 3 4 5 6 7 8 9 10 11 |
# load libraries library(caret) library(mlbench) # load dataset data(BostonHousing) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.lm <- train(medv~., data=BostonHousing, method="lm", metric="RMSE", preProc=c("center", "scale"), trControl=control) # summarize fit print(fit.lm) |
2. Logistic Regression
The glm function is in the stats library and creates a generalized linear model. It can be configured to perform a logistic regression suitable for binary classification problems.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# load the library library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # fit model fit <- glm(diabetes~., data=PimaIndiansDiabetes, family=binomial(link='logit')) # summarize the fit print(fit) # make predictions probabilities <- predict(fit, PimaIndiansDiabetes[,1:8], type='response') predictions <- ifelse(probabilities > 0.5,'pos','neg') # summarize accuracy table(predictions, PimaIndiansDiabetes$diabetes) |
The glm algorithm can be used in caret as follows:
1 2 3 4 5 6 7 8 9 10 11 |
# load libraries library(caret) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.glm <- train(diabetes~., data=PimaIndiansDiabetes, method="glm", metric="Accuracy", preProc=c("center", "scale"), trControl=control) # summarize fit print(fit.glm) |
3. Linear Discriminant Analysis
The lda function is in the MASS library and creates a linear model of a classification problem.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# load the libraries library(MASS) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # fit model fit <- lda(diabetes~., data=PimaIndiansDiabetes) # summarize the fit print(fit) # make predictions predictions <- predict(fit, PimaIndiansDiabetes[,1:8])$class # summarize accuracy table(predictions, PimaIndiansDiabetes$diabetes) |
The lda algorithm can be used in caret as follows:
1 2 3 4 5 6 7 8 9 10 11 |
# load libraries library(caret) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.lda <- train(diabetes~., data=PimaIndiansDiabetes, method="lda", metric="Accuracy", preProc=c("center", "scale"), trControl=control) # summarize fit print(fit.lda) |
4. Regularized Regression
The glmnet function is in the glmnet library and can be used for classification or regression.
Classification Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# load the library library(glmnet) library(mlbench) # load data data(PimaIndiansDiabetes) x <- as.matrix(PimaIndiansDiabetes[,1:8]) y <- as.matrix(PimaIndiansDiabetes[,9]) # fit model fit <- glmnet(x, y, family="binomial", alpha=0.5, lambda=0.001) # summarize the fit print(fit) # make predictions predictions <- predict(fit, x, type="class") # summarize accuracy table(predictions, PimaIndiansDiabetes$diabetes) |
Regression Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# load the libraries library(glmnet) library(mlbench) # load data data(BostonHousing) BostonHousing$chas <- as.numeric(as.character(BostonHousing$chas)) x <- as.matrix(BostonHousing[,1:13]) y <- as.matrix(BostonHousing[,14]) # fit model fit <- glmnet(x, y, family="gaussian", alpha=0.5, lambda=0.001) # summarize the fit print(fit) # make predictions predictions <- predict(fit, x, type="link") # summarize accuracy mse <- mean((y - predictions)^2) print(mse) |
It can also be configured to perform three important types of regularization: lasso, ridge and elastic net by configuring the alpha parameter to 1, 0 or in [0,1] respectively.
The glmnet implementation can be used in caret for classification as follows:
1 2 3 4 5 6 7 8 9 10 11 12 |
# load libraries library(caret) library(mlbench) library(glmnet) # Load the dataset data(PimaIndiansDiabetes) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.glmnet <- train(diabetes~., data=PimaIndiansDiabetes, method="glmnet", metric="Accuracy", preProc=c("center", "scale"), trControl=control) # summarize fit print(fit.glmnet) |
The glmnet implementation can be used in caret for regression as follows:
1 2 3 4 5 6 7 8 9 10 11 12 |
# load libraries library(caret) library(mlbench) library(glmnet) # Load the dataset data(BostonHousing) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.glmnet <- train(medv~., data=BostonHousing, method="glmnet", metric="RMSE", preProc=c("center", "scale"), trControl=control) # summarize fit print(fit.glmnet) |
Nonlinear Algorithms
Thees are machine learning algorithms that make fewer assumptions about the function being modeled. As such, they have a higher variance but are often result in higher accuracy. They increased flexibility also can make them slower to train or increase their memory requirements.
1. k-Nearest Neighbors
The knn3 function is in the caret library and does not create a model, rather makes predictions from the training set directly. It can be used for classification or regression.
Classification Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# knn direct classification # load the libraries library(caret) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # fit model fit <- knn3(diabetes~., data=PimaIndiansDiabetes, k=3) # summarize the fit print(fit) # make predictions predictions <- predict(fit, PimaIndiansDiabetes[,1:8], type="class") # summarize accuracy table(predictions, PimaIndiansDiabetes$diabetes) |
Regression Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# load the libraries library(caret) library(mlbench) # load data data(BostonHousing) BostonHousing$chas <- as.numeric(as.character(BostonHousing$chas)) x <- as.matrix(BostonHousing[,1:13]) y <- as.matrix(BostonHousing[,14]) # fit model fit <- knnreg(x, y, k=3) # summarize the fit print(fit) # make predictions predictions <- predict(fit, x) # summarize accuracy mse <- mean((BostonHousing$medv - predictions)^2) print(mse) |
The knn implementation can be used within the caret train() function for classification as follows:
1 2 3 4 5 6 7 8 9 10 11 |
# load libraries library(caret) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.knn <- train(diabetes~., data=PimaIndiansDiabetes, method="knn", metric="Accuracy", preProc=c("center", "scale"), trControl=control) # summarize fit print(fit.knn) |
The knn implementation can be used within the caret train() function for regression as follows:
1 2 3 4 5 6 7 8 9 10 11 |
# load libraries library(caret) data(BostonHousing) # Load the dataset data(BostonHousing) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.knn <- train(medv~., data=BostonHousing, method="knn", metric="RMSE", preProc=c("center", "scale"), trControl=control) # summarize fit print(fit.knn) |
2. Naive Bayes
The naiveBayes function is in the e1071 library and models the probabilistic of each attribute to the outcome variable independently. It can be used for classification problems.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# load the libraries library(e1071) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # fit model fit <- naiveBayes(diabetes~., data=PimaIndiansDiabetes) # summarize the fit print(fit) # make predictions predictions <- predict(fit, PimaIndiansDiabetes[,1:8]) # summarize accuracy table(predictions, PimaIndiansDiabetes$diabetes) |
A very similar naive bayes implementation (NaiveBayes from the klaR library) can be used with caret as follows:
1 2 3 4 5 6 7 8 9 10 11 |
# load libraries library(caret) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.nb <- train(diabetes~., data=PimaIndiansDiabetes, method="nb", metric="Accuracy", trControl=control) # summarize fit print(fit.nb) |
3. Support Vector Machine
The ksvm function is in the kernlab package and can be used for classification or regression. It is a wrapper for the LIBSVM library and provides a suite of kernel types and configuration options.
These example uses a Radial Basis kernel.
Classification Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
load the libraries library(kernlab) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # fit model fit <- ksvm(diabetes~., data=PimaIndiansDiabetes, kernel="rbfdot") # summarize the fit print(fit) # make predictions predictions <- predict(fit, PimaIndiansDiabetes[,1:8], type="response") # summarize accuracy table(predictions, PimaIndiansDiabetes$diabetes) |
Regression Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# load the libraries library(kernlab) library(mlbench) # load data data(BostonHousing) # fit model fit <- ksvm(medv~., BostonHousing, kernel="rbfdot") # summarize the fit print(fit) # make predictions predictions <- predict(fit, BostonHousing) # summarize accuracy mse <- mean((BostonHousing$medv - predictions)^2) print(mse) |
The SVM with Radial Basis kernel implementation can be used with caret for classification as follows:
1 2 3 4 5 6 7 8 9 10 11 |
# load libraries library(caret) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.svmRadial <- train(diabetes~., data=PimaIndiansDiabetes, method="svmRadial", metric="Accuracy", trControl=control) # summarize fit print(fit.svmRadial) |
The SVM with Radial Basis kernel implementation can be used with caret for regression as follows:
1 2 3 4 5 6 7 8 9 10 11 |
# load libraries library(caret) library(mlbench) # Load the dataset data(BostonHousing) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.svmRadial <- train(medv~., data=BostonHousing, method="svmRadial", metric="RMSE", trControl=control) # summarize fit print(fit.svmRadial) |
4. Classification and Regression Trees
The rpart function in the rpart library provides an implementation of CART for classification and regression.
Classification Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# load the libraries library(rpart) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # fit model fit <- rpart(diabetes~., data=PimaIndiansDiabetes) # summarize the fit print(fit) # make predictions predictions <- predict(fit, PimaIndiansDiabetes[,1:8], type="class") # summarize accuracy table(predictions, PimaIndiansDiabetes$diabetes) |
Regression Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# load the libraries library(rpart) library(mlbench) # load data data(BostonHousing) # fit model fit <- rpart(medv~., data=BostonHousing, control=rpart.control(minsplit=5)) # summarize the fit print(fit) # make predictions predictions <- predict(fit, BostonHousing[,1:13]) # summarize accuracy mse <- mean((BostonHousing$medv - predictions)^2) print(mse) |
The rpart implementation can be used with caret for classification as follows:
1 2 3 4 5 6 7 8 9 10 11 |
# load libraries library(caret) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.rpart <- train(diabetes~., data=PimaIndiansDiabetes, method="rpart", metric="Accuracy", trControl=control) # summarize fit print(fit.rpart) |
The rpart implementation can be used with caret for regression as follows:
1 2 3 4 5 6 7 8 9 10 11 |
# load libraries library(caret) library(mlbench) # Load the dataset data(BostonHousing) # train set.seed(7) control <- trainControl(method="cv", number=2) fit.rpart <- train(medv~., data=BostonHousing, method="rpart", metric="RMSE", trControl=control) # summarize fit print(fit.rpart) |
Other Algorithms
There are many other algorithms provided by R and available in caret.
I would advise you to explore them and add more algorithms to your own short list of must try algorithms on your next machine learning project.
You can find a mapping of machine learning functions and packages to their name in the caret package on this page:
This page is useful if you are using an algorithm in caret and want to know which package it belongs to so that you can read up on the parameters and get more out of it.
This page is also useful if you are using a machine learning algorithm directly in R and want to know how it can be used in caret.
Summary
In this post you discovered a diverse set of 8 algorithms that you can use to spot check on your datasets. Specifically:
- Linear Regression
- Logistic Regression
- Linear Discriminant Analysis
- Regularized Regression
- k-Nearest Neighbors
- Naive Bayes
- Support Vector Machine
- Classification and Regression Trees
You learned which packages and functions to use for each algorithm. You also learned how you can use each algorithm with the caret package that provides algorithm evaluation and tuning capabilities.
You can use these algorithms as a template for spot checking on your current or next machine learning project in R.
Your Next Step
Did you try out these recipes?
- Start your R interactive environment.
- Type or copy-paste the recipes above and try them out.
- Use the built-in help in R to learn more about the functions used.
Do you have a question. Ask it in the comments and I will do my best to answer it.
Hi Jason,
This is a nice overview. 2 minor suggestions:
1. You might want to add a section/additional post on how to easily compare between the various models, to determine which is best.
2. For simplicity of your readers, you might attach a single file with all of the calls (rather than needing to copy from each of the little windows.)
Thanks very much. — dbs
Great suggestions, thanks David.
Hi Jason,
Thanks for this overview article. It certainly helps for a versatile package like caret.
I was looking for a way to perform multi-class logistic regression using caret. glm is for 2-class only. Any idea how one can perform this using caret?
Thanks – Mustafa
Not off-hand, sorry.
Hi dear Jason,
For a dataset, I’ve got better results with linear models than non-linears. I wanted to check if I can improve the result with kernel methods so tried kernel logistic and ksvm, however the auc gets worse!
I appreciate if you can give me any explanation for that?
Best regards,
Sometimes a linear method is the best approach for a dataset, because the underlying structure of the data is linear.
Test many methods and use what works best.
Thanks for this.
So, I just need to pick the algorith with the lowest RMSE.
Yes.