Your First Machine Learning Project in R Step-By-Step (tutorial and template for future projects)

Do you want to do machine learning using R, but you’re having trouble getting started?

In this post you will complete your first machine learning project using R.

In this step-by-step tutorial you will:

  1. Download and install R and get the most useful package for machine learning in R.
  2. Load a dataset and understand it’s structure using statistical summaries and data visualization.
  3. Create 5 machine learning models, pick the best and build confidence that the accuracy is reliable.

If you are a machine learning beginner and looking to finally get started using R, this tutorial was designed for you.

Let’s get started!

Your First Machine Learning Project in R Step-by-Step

Your First Machine Learning Project in R Step-by-Step
Photo by Henry Burrows, some rights reserved.

How Do You Start Machine Learning in R?

The best way to learn machine learning is by designing and completing small projects.

R Can Be Intimidating When Getting Started

R provides a scripting language with an odd syntax. There are also hundreds of packages and thousands of functions to choose from, providing multiple ways to do each task. It can feel overwhelming.

The best way to get started using R for machine learning is to complete a project.

  • It will force you to install and start R (at the very least).
  • It will given you a bird’s eye view of how to step through a small project.
  • It will give you confidence, maybe to go on to your own small projects.

Beginners Need A Small End-to-End Project

Books and courses are frustrating. They give you lots of recipes and snippets, but you never get to see how they all fit together.

When you are applying machine learning to your own datasets, you are working on a project.

The process of a machine learning project may not be linear, but there are a number of well-known steps:

  1. Define Problem.
  2. Prepare Data.
  3. Evaluate Algorithms.
  4. Improve Results.
  5. Present Results.

For more information on the steps in a machine learning project see this checklist and more on the process.

The best way to really come to terms with a new platform or tool is to work through a machine learning project end-to-end and cover the key steps. Namely, from loading data, summarizing your data, evaluating algorithms and making some predictions.

If you can do that, you have a template that you can use on dataset after dataset. You can fill in the gaps such as further data preparation and improving result tasks later, once you have more confidence.

Hello World of Machine Learning

The best small project to start with on a new tool is the classification of iris flowers (e.g. the iris dataset).

This is a good project because it is so well understood.

  • Attributes are numeric so you have to figure out how to load and handle data.
  • It is a classification problem, allowing you to practice with perhaps an easier type of supervised learning algorithms.
  • It is a mutli-class classification problem (multi-nominal) that may require some specialized handling.
  • It only has 4 attribute and 150 rows, meaning it is small and easily fits into memory (and a screen or A4 page).
  • All of the numeric attributes are in the same units and the same scale not requiring any special scaling or transforms to get started.

Let’s get started with your hello world machine learning project in R.

Need more elp with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Machine Learning in R: Step-By-Step Tutorial (start here)

In this section we are going to work through a small machine learning project end-to-end.

Here is an overview what we are going to cover:

  1. Installing the R platform.
  2. Loading the dataset.
  3. Summarizing the dataset.
  4. Visualizing the dataset.
  5. Evaluating some algorithms.
  6. Making some predictions.

Take your time. Work through each step.

Try to type in the commands yourself or copy-and-paste the commands to speed things up.

Any questions, please leave a comment at the bottom of the post.

1. Downloading Installing and Starting R

Get the R platform installed on your system if it is not already.

UPDATE: This tutorial was written and tested with R version 3.2.3. It is recommend that you use this version of R or higher.

I do not want to cover this in great detail, because others already have. This is already pretty straight forward, especially if you are a developer. If you do need help, ask a question in the comments.

Here is what we are going to cover in this step:

  1. Download R.
  2. Install R.
  3. Start R.
  4. Install R Packages.

1.1 Download R

You can download R from The R Project webpage.

When you click the download link, you will have to choose a mirror. You can then choose R for your operating system, such as Windows, OS X or Linux.

1.2 Install R

R is is easy to install and I’m sure you can handle it. There are no special requirements. If you have questions or need help installing see R Installation and Administration.

1.3 Start R

You can start R from whatever menu system you use on your operating system.

For me, I prefer the command line.

Open your command line, change (or create) to your project directory and start R by typing:

You should see something like the screenshot below either in a new window or in your terminal.

R Interactive Environment

R Interactive Environment

1.4 Install Packages

Install the packages we are going to use today. Packages are third party add-ons or libraries that we can use in R.

UPDATE: We may need other packages, but caret should ask us if we want to load them. If you are having problems with packages, you can install the caret packages and all packages that you might need by typing:

Now, let’s load the package that we are going to use in this tutorial, the caret package.

The caret package provides a consistent interface into hundreds of machine learning algorithms and provides useful convenience methods for data visualization, data resampling, model tuning and model comparison, among other features. It’s a must have tool for machine learning projects in R.

For more information about the caret R package see the caret package homepage.

2. Load The Data

We are going to use the iris flowers dataset. This dataset is famous because it is used as the “hello world” dataset in machine learning and statistics by pretty much everyone.

The dataset contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species.
You can learn more about this dataset on Wikipedia.

Here is what we are going to do in this step:

  1. Load the iris data the easy way.
  2. Load the iris data from CSV (optional, for purists).
  3. Separate the data into a training dataset and a validation dataset.

Choose your preferred way to load data or try both methods.

2.1 Load Data The Easy Way

Fortunately, the R platform provides the iris dataset for us. Load the dataset as follows:

You now have the iris data loaded in R and accessible via the dataset variable.

I like to name the loaded data “dataset”. This is helpful if you want to copy-paste code between projects and the dataset always has the same name.

2.2 Load From CSV

Maybe your a purist and you want to load the data just like you would on your own machine learning project, from a CSV file.

  1. Download the iris dataset from the UCI Machine Learning Repository (here is the direct link).
  2. Save the file as iris.csv your project directory.

Load the dataset from the CSV file as follows:

You now have the iris data loaded in R and accessible via the dataset variable.

2.3. Create a Validation Dataset

We need to know that the model we created is any good.

Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.
That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.

We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.

You now have training data in the dataset variable and a validation set we will use later in the validation variable.

Note that we replaced our dataset variable with the 80% sample of the dataset. This was an attempt to keep the rest of the code simpler and readable.

3. Summarize Dataset

Now it is time to take a look at the data.

In this step we are going to take a look at the data a few different ways:

  1. Dimensions of the dataset.
  2. Types of the attributes.
  3. Peek at the data itself.
  4. Levels of the class attribute.
  5. Breakdown of the instances in each class.
  6. Statistical summary of all attributes.

Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.

3.1 Dimensions of Dataset

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the dim function.

You should see 120 instances and 5 attributes:

3.2 Types of Attributes

It is a good idea to get an idea of the types of the attributes. They could be doubles, integers, strings, factors and other types.

Knowing the types is important as it will give you an idea of how to better summarize the data you have and the types of transforms you might need to use to prepare the data before you model it.

You should see that all of the inputs are double and that the class value is a factor:

3.3 Peek at the Data

It is also always a good idea to actually eyeball your data.

You should see the first 5 rows of the data:

3.4 Levels of the Class

The class variable is a factor. A factor is a class that has multiple class labels or levels. Let’s look at the levels:

Notice above how we can refer to an attribute by name as a property of the dataset. In the results we can see that the class has 3 different labels:

This is a multi-class or a multinomial classification problem. If there were two levels, it would be a binary classification problem.

3.5 Class Distribution

Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count and as a percentage.

We can see that each class has the same number of instances (40 or 33% of the dataset)

3.6 Statistical Summary

Now finally, we can take a look at a summary of each attribute.

This includes the mean, the min and max values as well as some percentiles (25th, 50th or media and 75th e.g. values at this points if we ordered all the values for an attribute).

We can see that all of the numerical values have the same scale (centimeters) and similar ranges [0,8] centimeters.

4. Visualize Dataset

We now have a basic idea about the data. We need to extend that with some visualizations.

We are going to look at two types of plots:

  1. Univariate plots to better understand each attribute.
  2. Multivariate plots to better understand the relationships between attributes.

4.1 Univariate Plots

We start with some univariate plots, that is, plots of each individual variable.

It is helpful with visualization to have a way to refer to just the input attributes and just the output attributes. Let’s set that up and call the inputs attributes x and the output attribute (or class) y.

Given that the input variables are numeric, we can create box and whisker plots of each.

This gives us a much clearer idea of the distribution of the input attributes:

Box and Whisker Plots in R

Box and Whisker Plots in R

We can also create a barplot of the Species class variable to get a graphical representation of the class distribution (generally uninteresting in this case because they’re even).

This confirms what we learned in the last section, that the instances are evenly distributed across the three class:

Bar Plot of Iris Flower Species

Bar Plot of Iris Flower Species

4.2 Multivariate Plots

Now we can look at the interactions between the variables.

First let’s look at scatterplots of all pairs of attributes and color the points by class. In addition, because the scatterplots show that points for each class are generally separate, we can draw ellipses around them.

We can see some clear relationships between the input attributes (trends) and between attributes and the class values (ellipses):

Scatterplot Matrix of Iris Data in R

Scatterplot Matrix of Iris Data in R

We can also look at box and whisker plots of each input variable again, but this time broken down into separate plots for each class. This can help to tease out obvious linear separations between the classes.

This is useful to see that there are clearly different distributions of the attributes for each class value.

Box and Whisker Plot of Iris data by Class Value

Box and Whisker Plot of Iris data by Class Value

Next we can get an idea of the distribution of each attribute, again like the box and whisker plots, broken down by class value. Sometimes histograms are good for this, but in this case we will use some probability density plots to give nice smooth lines for each distribution.

Like he boxplots, we can see the difference in distribution of each attribute by class value. We can also see the Gaussian-like distribution (bell curve) of each attribute.

Density Plots of Iris Data By Class Value

Density Plots of Iris Data By Class Value

5. Evaluate Some Algorithms

Now it is time to create some models of the data and estimate their accuracy on unseen data.

Here is what we are going to cover in this step:

  1. Set-up the test harness to use 10-fold cross validation.
  2. Build 5 different models to predict species from flower measurements
  3. Select the best model.

5.1 Test Harness

We will 10-fold crossvalidation to estimate accuracy.

This will split our dataset into 10 parts, train in 9 and test on 1 and release for all combinations of train-test splits. We will also repeat the process 3 times for each algorithm with different splits of the data into 10 groups, in an effort to get a more accurate estimate.

We are using the metric of “Accuracy” to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the metric variable when we run build and evaluate each model next.

5.2 Build Models

We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.

Let’s evaluate 5 different algorithms:

  • Linear Discriminant Analysis (LDA)
  • Classification and Regression Trees (CART).
  • k-Nearest Neighbors (kNN).
  • Support Vector Machines (SVM) with a linear kernel.
  • Random Forest (RF)

This is a good mixture of simple linear (LDA), nonlinear (CART, kNN) and complex nonlinear methods (SVM, RF). We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.

Let’s build our five models:

Caret does support the configuration and tuning of the configuration of each model, but we are not going to cover that in this tutorial.

5.3 Select Best Model

We now have 5 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.

We can report on the accuracy of each model by first creating a list of the created models and using the summary function.

We can see the accuracy of each classifier and also other metrics like Kappa:

We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).

We can see that the most accurate model in this case was LDA:

Comparison of Machine Learning Algorithms on Iris Dataset in R

Comparison of Machine Learning Algorithms on Iris Dataset in R

The results for just the LDA model can be summarized.

This gives a nice summary of what was used to train the model and the mean and standard deviation (SD) accuracy achieved, specifically 97.5% accuracy +/- 4%

6. Make Predictions

The LDA was the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set.

This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during such as overfitting to the training set or a data leak. Both will result in an overly optimistic result.

We can run the LDA model directly on the validation set and summarize the results in a confusion matrix.

We can see that the accuracy is 100%. It was a small validation dataset (20%), but this result is within our expected margin of 97% +/-4% suggesting we may have an accurate and a reliably accurate model.

You Can Do Machine Learning in R

Work through the tutorial above. It will take you 5-to-10 minutes, max!

You do not need to understand everything. (at least not right now) Your goal is to run through the tutorial end-to-end and get a result. You do not need to understand everything on the first pass. List down your questions as you go. Make heavy use of the ?FunctionName help syntax in R to learn about all of the functions that you’re using.

You do not need to know how the algorithms work. It is important to know about the limitations and how to configure machine learning algorithms. But learning about algorithms can come later. You need to build up this algorithm knowledge slowly over a long period of time. Today, start off by getting comfortable with the platform.

You do not need to be an R programmer. The syntax of the R language can be confusing. Just like other languages, focus on function calls (e.g. function()) and assignments (e.g. a <- “b”). This will get you most of the way. You are a developer, you know how to pick up the basics of a language real fast. Just get started and dive into the details later.

You do not need to be a machine learning expert. You can learn about the benefits and limitations of various algorithms later, and there are plenty of posts that you can read later to brush up on the steps of a machine learning project and the importance of evaluating accuracy using cross validation.

What about other steps in a machine learning project. We did not cover all of the steps in a machine learning project because this is your first project and we need to focus on the key steps. Namely, loading data, looking at the data, evaluating some algorithms and making some predictions. In later tutorials we can look at other data preparation and result improvement tasks.


In this post you discovered step-by-step how to complete your first machine learning project in R.

You discovered that completing a small end-to-end project from loading the data to making predictions is the best way to get familiar with a new platform.

Your Next Step

Do you work through the tutorial?

  1. Work through the above tutorial.
  2. List any questions you have.
  3. Search or research the answers.

Remember, you can use the ?FunctionName in R to get help on any function.

Do you have a question? Post it in the comments below.

Frustrated With Your Progress In R Machine Learning?

Master Machine Learning With R

Develop Your Own Models in Minutes

…with just a few lines of R code

Discover how in my new Ebook:
Machine Learning Mastery With R

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, build models, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

91 Responses to Your First Machine Learning Project in R Step-By-Step (tutorial and template for future projects)

  1. Leszek Pawlowicz February 3, 2016 at 6:28 am #

    This is what I can’t stand about open-source packages like R (and Python, and LibreOffice): Nobody puts in the effort required to make sure things work properly, it’s almost impossible to duplicate working environments, and the error messages are cryptically impossible. Trying to generate the scatterplot matrix above, cutting and pasting the command into R, I got the following error message:

    Error in, name$name, strict) :
    Viewport ‘’ was not found

    Google Search provided no help. After getting featurePlot to work with all options other than “ellipse”, finally stumbled across the solution that you needed to have the “ellipse” package installed on your system. I’m guessing that you have that as a default library on your system, so you didn’t specify it was required to use that function. But how many people reading this post will be able to figure that out?

    • Jason Brownlee February 3, 2016 at 8:21 am #

      Thanks for pointing that out Leszek. To be honest I’ve not heard of that package before. Perhaps it is installed automatically with the “caret” or “lattice” packages?

      • Rajendra December 11, 2016 at 1:07 am #

        Thanks for the post… worked after installed ellipse package. not installed with caret.

    • The R Enthusiast February 10, 2016 at 6:14 pm #

      He did not have the “ellipse” package as default on his system. What he did was that he installed the “caret” package using the code he provided above:

      install.packages(“caret”, dependencies = c(“Depends”, “Suggests”))

      The result was that ALL the packages that were likely to be used by the “caret” package were also installed… including the “ellipse” package. You could have avoided your frustration by simply following the instructions in the tutorial.

      • Jonathan March 30, 2017 at 11:35 pm #

        Wrong. I followed the instructions exactly as listed and it didn’t work for me either. When I explicitly installed the ellipse library it worked fine.

  2. Leszek Pawlowicz February 3, 2016 at 6:39 am #

    And, moving on, found that there were additional packages that needed to be installed and loaded, and then wound up with an Accuracy table that didn’t get the same results as you did, despite copying and pasting all the commands exactly as written. This doesn’t give me a lot of confidence about reproducibility in R.

    • Jason Brownlee February 3, 2016 at 8:20 am #

      Post your results!

      It is true that strictly reproducible results can be difficult in R. I find you need to sprinkle a lot of set.seed(…) calls around the place, and even then it’s difficult.

    • The R Enthusiast February 10, 2016 at 6:18 pm #

      The reason why your accuracy table is not the same mainly comes from the fact that the “createDataPartition()” function chooses observations in the dataset randomly. This means that the training and validation datasets are essentially different for everybody. Consequently, the end results will be slightly different.

  3. Mohan Raj February 4, 2016 at 10:02 pm #

    Hi Jason,

    When i loaded the caret package using below query,

    > require(caret)

    Loading required package: caret
    Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
    there is no package called ‘pbkrtest’
    In addition: Warning message:
    package ‘caret’ was built under R version 3.2.3

    I have assigned the iris dataset to dataset2. then i executed the below query
    createDataPartition(dataset2$Species, p=0.80, list=FALSE) is not working. I am getting the error message when i execute the above query.

    Error Message:
    Error: could not find function "createDataPartition"

    Let me know as what went wrong.


    • Jason Brownlee February 5, 2016 at 5:21 am #

      This looks like a problem specific to your environment. Consider re-installing the caret package with all dependencies:

      install.packages(“caret”, dependencies = c(“Depends”, “Suggests”))

      • Jason Brownlee February 7, 2016 at 6:43 am #

        I’ve added this command to the install packages section, just in case others find it useful.

  4. Rajesh February 6, 2016 at 5:52 am #

    If the R version is 3.2.1 or below the caret package may turn incompatible. I faced similar issue. After uninstalling the old version I installed R 3.2.3 which fixed the error.

    Hope it helps.


    • Jason Brownlee February 7, 2016 at 6:40 am #

      Thanks Rajesh, I updated the post and added a note to use R 3.2.3 or higher.

  5. CW February 7, 2016 at 3:32 am #

    Thanks for sharing this. I had to grab another package (kernlab) to run the SVM fit, but everything rolled smoothly, otherwise. Great 15min introduction!

  6. Ajit Jaokar February 16, 2016 at 3:17 am #

    Hello Jason, this is an interesting tutorial and getting to grips with Caret still. Qs is: in the sctarrerplot matix(which is used from caret I think) how do we know what colours corespond to which class Rgds Ajit

    • Jason Brownlee February 16, 2016 at 5:49 am #

      That is a good question. It is a good idea to add a legend to your graphs.

      I did not add a legend in this case because we were not interested in which class was which only in the general separation of the classes. Type ?featurePlot to learn more about adding a legend.

  7. Ajit Jaokar February 16, 2016 at 9:25 pm #

    Thanks Jason. But more to the point .. where in the code do you assign the legend(or does the legend get picked up automatically ie which colour to which class. If it does so implicitly, how do I know what colour coresponds to what class?

    Also another question
    It says “We will 10-fold cross validation to estimate accuracy. This will split our dataset into 10 parts, train in 9 and test on 1 and release for all combinations of train-test splits. We will also repeat the process 3 times for each algorithm with different splits of the data into 10 groups, in an effort to get a more accurate estimate.” Hence I should expect to see 15 steps(3 times per algorithm with different splits) but we see here 5 steps(once) where do we try the other two times? kind rgds Ajit

    • Roberto Ulloa December 8, 2016 at 12:26 am #

      Hi Ajit,

      The repetitions should be indicated in the trainControl function. When I was reading, I though 3 was the default, but this didn’t seem to be the case according to the documentation ?trainControl

      According to this (, the method parameter should have been “repeatedcv” and not just “cv”, and then the parameters repeats should have been 3.

      trainControl(method=”repeatedcv”, number=10, repeat=3)

      You can verify that the training takes longer and the confidence intervals of the plots are smaller, so I might be right. However, I am not absolutely sure if this is correct, because I don’t know how to visually check the folds. Also, I don’t know how to get each individual result of each cv and repetition from the fits, e.g. fit.lda.


  8. Kiri Kurukkal February 27, 2016 at 8:58 pm #

    Thanks Jason for the great tutorial. I was able to reproduce the same results by following your instructions carefully. BTW, I reviewed some of the other posts above and most of the dependencies could have been resolved by loading the library(caret) at the beginning. I did encounter one issue prior to loading the library(caret) with the Error: could not find function “createDataPartition”. This error was resolved by loading the required library(caret). This loaded other required packages.

    > library(caret)
    Loading required package: lattice
    Loading required package: ggplot2

    Happy coding / earning!

  9. Rick Pack February 29, 2016 at 11:29 am #

    If anyone wants more practice, I did my best to recall the code Chad Hines and I added to the tutorial so one can examine the mismatches for LDA on the training set. Thank you to Jason Brownlee for this tutorial and to Kevin Feasel and Jamie Dixon for coordinating the .NET Triangle “Introduction to R” dojo last week.

  10. Phuong May 24, 2016 at 5:14 pm #

    Hi Jason Brownlee,
    Thanks for your tutorial. However, I have a question about featurePlot function with plot = “density ” option. I couldn’t figure out the meaning of vertical axis in these plots for each features. Why the vertical axes have values that are greater than 1 (in the case of density)

  11. Johnny September 7, 2016 at 11:53 pm #

    Hi Jason, very thorough and great practice for a newbie like myself. I will definitely be referring back to this one often.

  12. Mrinal September 21, 2016 at 3:21 pm #

    Thanks, Brownlee. I would like to know of selecting best model. Is it guaranteed that a model giving highest accuracy can give the result of highest accuracy for test data? In this example, you have selected lda as the best model comparing the accuracies of the used models. But it may not predict best during testing. So my question is: when should we select our model? After training models or testing models?

    • Jason Brownlee September 22, 2016 at 8:07 am #

      We cannot be sure we have picked the best model.

      We must gather evidence to support a given decision.

      More testing with k-fold cross validation and hold-out validation datasets can increase our confidence.

  13. Justin Nunez October 18, 2016 at 10:34 am #

    So, what is next?

    What can one do to get better at this? Any practice? Anything that builds on this?

    What happens when there is “noise” in the data, how do we clean it and apply it to ML properly?

  14. Jerry October 27, 2016 at 11:41 am #

    Tested in rstudio-ide. Works fine! Two small changes required:
    # Install Packages
    install.packages(‘caret’, repos=’’)
    # e1071
    install.packages(‘e1071’, dependencies=TRUE)

  15. Kenneth November 24, 2016 at 5:20 am #

    Is this a typo on this page?

    “A machine learning project may not be linear, but (it has a has) a number of well known steps:”

    • Jason Brownlee November 24, 2016 at 10:43 am #

      Yes, I intended to talk about the process of a machine learning project not being linear.

      I’ve fixed the typo, thnaks.

  16. Rajendra December 11, 2016 at 12:59 am #

    Indeed it is good post, but as it is framed in the mind for ML Learners, would have explained in details of each section much more clear, for ex, 4.1 barplot section, would have explained understand number of diagram. mere walk-through would not help anything

  17. Stef January 13, 2017 at 6:59 pm #

    Excellent, thank you, managed to do this with my own dataset but struggling to plot an ROC curve after. Please can you help by posting the code to plot the ROC curve? thanks

  18. Lewis Walker February 26, 2017 at 10:30 pm #

    I have just started learning R and trying to use this Tutorial to fit my Dataset into it, and had a few problems like missing packages, I did however notice that when you library(caret) it will say what is missing so it’s a simple case of install.packages(missing package displayed).

    I would however like to split my dataset up a bit more, this tutorial uses
    “validation_index <- createDataPartition(dataset$Species, p=0.80, list=FALSE)"

    My dataset is pretty large and I would like to split it into 3 or 4, like rather than an 80/20 split I would like a 50/25/25 or a 40/30/30. As I said I'm new to R so if my way of splitting it isn't the way it should be done just tell me :).

    • Jason Brownlee February 27, 2017 at 5:52 am #

      Hi Lewis,

      I believe createDataPartition() is for creating train/test splits.

      You could use it to create one split, then re-split one of the halves if you like.

      I hope that helps.

  19. Sai March 25, 2017 at 6:34 am #

    Very Nice article. Thank you. It was a very good starter for me as a new R programmer.
    But one question I have is in section 6 (“Make Predictions”).

    I understand that we are predicting the accuracy of our model in that section. But can you please elaborate on how to make prediction for some new data set ? I am not clear in that prediction part.

  20. Muriel March 29, 2017 at 11:45 pm #

    This is very helpful. Thanks. But I have a question. In a case where I have two datasets, will name them trainingdata.csv and testdata.csv, how do I load them to R but train my algorithm on training data and test it on the data set?


  21. Muriel April 1, 2017 at 7:05 pm #

    I know how to load this data. My question is if I have two data sets, the training data and the test data. What functions must I use for R to recognise my training data to built the models on and test data to validate. Unlike on the Iris project where they have one data and splitted it on 80% 20%

    • Jason Brownlee April 2, 2017 at 6:25 am #

      Sorry, I don’t understand your question. Perhaps you can rephrase it?

  22. Muriel April 1, 2017 at 7:07 pm #

    on the iris project, am getting an error for the function to partition data. See below commands.

    > #attach the iris dataset to the environment
    > data(iris)
    > #rename the dataset
    > dataset # create a list of 80% of the rows in the original dataset we can use for training
    > validation_index validation_index <- createDataPartition(dataset$Species, p=0.80, list=FALSE)
    Error: could not find function "createDataParti

  23. Mike April 10, 2017 at 5:19 am #

    Great post, thanks. I got it working. But when I replaced my data with iris, I got an error:
    “Metric Accuracy not applicable for regression models” for all non-linier models. Here is my data:

    Any suggestions?

    • Jason Brownlee April 10, 2017 at 7:41 am #

      It sounds like your output variable is a real value (regression) and not a label (classification).

      You may want to convert your problem to classification or use regression algorithm and evaluation measure.

  24. Jonatan April 11, 2017 at 6:01 pm #

    Thanks for an excellent post Jason, great help!

  25. Yenework A Mola April 26, 2017 at 1:11 pm #

    Thank you sir ! This is really the best tutorial . I learned a lot from it and i applied it to a different dataset . But how about comparing the models using ROC curve using the caret package ?

  26. Virendra May 9, 2017 at 7:35 pm #

    In Multivariate Plots, while trying to scatterplot matrix I am getting following error:-

    Error in, name$name, strict) :
    Viewport ‘’ was not found

    I am using R x64 3.4.0.

    • Jason Brownlee May 10, 2017 at 8:47 am #

      Sorry, I have not seen that error before. Perhaps post it as a question on stackoverflow?

  27. Andre Yakana May 11, 2017 at 12:49 pm #

    this is very interesting sir, but i will like help on how to better explain the plots and what each mean especially the scatterplot. i am saying in a situation where i would have to explain to an audience

  28. dds May 13, 2017 at 11:45 am #

    for print(fit.lda), I don’t see Accuracy SD or Kappa SD printed/displayed … any hints? Thanks.

    • Jason Brownlee May 14, 2017 at 7:24 am #

      I think caret API has changed since I posted the example.

  29. tahir May 20, 2017 at 10:23 pm #

    good job Jason , but can I plot the SVM results in R? So I get the hyperplane and support vector points

    • Jason Brownlee May 21, 2017 at 5:59 am #

      You may, I have not done this myself in a long time. I do not recall the function name off-hand sorry.

  30. Hari May 25, 2017 at 12:05 pm #

    Hello Jason! Thanks for making this ML tutorial. I am having trouble in the model building part. I am getting error in “rpart”, “knn”. When I run the code for rpart, the error is “Something is wrong: all the accuracy metric values are missing:” “Error: Stopping” “In addition: There were 26 warnings (use warnings() to see them)” , however for “knn”, the last error line I am getting 50 warnings.

    Also , when I run “svmRadial” , it seems to run without any problem, however when i run the code for ‘rf”, I get this

    Loading required package: randomForest
    randomForest 4.6-12
    Type rfNews() to see new features/changes/bug fixes.

    Attaching package: ‘randomForest’

    The following object is masked from ‘package:dplyr’:


    The following object is masked from ‘package:ggplot2’:


    I have a Version 1.0.136 of RStudio. Your help is much appreciated!

    • Jason Brownlee June 2, 2017 at 11:41 am #

      I’m sorry to hear that. Ensure you have the latest version of R and the caret package installed. You may also want to install all recommended dependencies.

  31. Hans June 6, 2017 at 10:47 am #

    Do you have such an R tutorial for regression problems too?

  32. Sunny June 7, 2017 at 4:48 am #

    This post is exactly what I was looking for. Very well put together and I’m excited about it. I will share it with some students over at UCSF. For those reading the comments, I typed everything in manually directly from Dr. Brownlee’s scripts. You learn more that way because you’re likely to make a mistake when typing at some point.

  33. sudi June 8, 2017 at 3:48 pm #

    this post helps a lot but need little more clarification about boxplot and barchart becoz i am new for ml and r.could u plz explain me…it would be more helpful for me
    what does this code tell us.cant understand plz help me

    for(i in 1:4) {
    boxplot(x[,i], main=names(iris)[i])

    • Jason Brownlee June 9, 2017 at 6:18 am #

      It creates a composite plot of 4 boxplots side by side.

  34. Aman June 20, 2017 at 6:36 pm #

    Thank so much sir. This is very helpful. Sir, I have a question.
    When I run LDA, SVM, RF, CART model always shows that Loading required package: MASS for LDA and so on for all methods that you mention. Although I get the results without loading specific package for each methods,but is it any problem if load the specific package or not?

    And if I load the package for each methods then function will be change such as for random forest we need to call the model:- randomForest(…) with package “randomForest”.

    • Jason Brownlee June 21, 2017 at 8:10 am #

      It is normal for caret to load the packages it needes to make predictions.

  35. Manish Chakraborty July 2, 2017 at 1:16 am #

    Hi, I have installed the “caret” package. But after this when i am loading through library(caret), I am getting the below error:

    Loading required package: ggplot2
    Error: package or namespace load failed for ‘ggplot2’ in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]):
    there is no package called ‘munsell’
    Error: package ‘ggplot2’ could not be loaded

    • Jason Brownlee July 2, 2017 at 6:31 am #

      I’m sorry, I have not seen this error. Perhaps check on stackoverflow if anyone has had this fault or consider posting the error there.

      • Manish Chakraborty July 2, 2017 at 1:39 pm #

        Hi Jason,
        Post some R&D was able to resolve it. Below are the actions i did.









  36. Manish Chakraborty July 2, 2017 at 2:06 pm #

    Hi Jason,

    Need one help again. Thanks in advance.

    Since this is my first Data Science Project, so the question.

    What and how to interpret from the result of BoxPlot. It will be of help if you can kindly explain a bit of the outcome of the BoxPlot.

  37. Eesha July 6, 2017 at 2:33 pm #

    Hello Dr Brownlee,

    I am new to machine learning and attempting to go through your tutorial.
    I keep getting an error saying that the accuracy matrix values are missing for this line:

    results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit))

    The accuracy matrix for lad works however cart, knn, svn and rf do not work.
    Do you have any suggestions for how to fix this?


    • Jason Brownlee July 9, 2017 at 10:23 am #

      I’m sorry to hear that. Confirm your packages are up to date.

  38. jui July 6, 2017 at 6:54 pm #

    sir, how could i plot this confusionMatrix “confusionMatrix(predictions, validation$Species)”?

  39. Sankalp July 28, 2017 at 4:17 pm #

    > predictions confusionMatrix(predictions, validation$Species)
    Error in confusionMatrix(predictions, validation$Species) :
    object ‘predictions’ not found

    Could anyone clarify this error ?

    • Sankalp July 28, 2017 at 4:18 pm #

      predictions confusionMatrix(predictions, validation$Species)
      Error in confusionMatrix(predictions, validation$Species) :
      object ‘predictions’ not found

      Could anyone clarify this error ?Earlier I posted something wrong

      • Jason Brownlee July 29, 2017 at 8:06 am #

        Perhaps double check that you have all of the code from the post?

  40. Saurabh August 4, 2017 at 11:00 pm #


    I am beginner in this so may be the question I am going to ask wont make sense but I would request you to please answer:
    So when we say lets predict something, what exactly we are predicting here ?
    In case of a machine (motor, pump etc) data(current, RPM, vibration) what is that can be predicted ?


    • Jason Brownlee August 5, 2017 at 5:47 am #

      In this tutorial, given the measurements of iris flowers, we use a model to predict the species.

Leave a Reply