R Machine Learning Mini-Course

By Jason Brownlee on August 22, 2019 in R Machine Learning 42

From Developer to Machine Learning Practitioner in 14 Days

In this mini-course you will discover how you can get started, build accurate models and confidently complete predictive modeling machine learning projects using R in 14 days.

This is a big and important post. You might want to bookmark it.

Kick-start your project with my new book Machine Learning Mastery With R, including step-by-step tutorials and the R source code files for all examples.

Let’s get started.

Who Is This Mini-Course For?

Before we get started, let’s make sure you are in the right place. The list below provides some general guidelines as to who this course was designed for.

Don’t panic if you don’t match these points exactly, you might just need to brush up in one area or another to keep up.

Developers that know how to write a little code. This means that it is not a big deal for you to pick up a new programming language like R once you know the basic syntax. It does not mean your a wizard coder, just that you can follow a basic C-like language with little effort.
Developers that know a little machine learning. This means you know about the basics of machine learning like cross validation, some algorithms and bias-variance trade-off. It does not mean that you are a machine learning PhD, just that you know the landmarks or know where to look them up.

This mini-course is neither a textbook on R or a textbook on machine learning.

It will take you from a developer that knows a little machine learning to a developer who can get results using R, the most powerful and most popular platform for machine learning.

Mini-Course Overview (what to expect)

This mini-course is broken down into 14 lessons that I call “days”.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hard core!). It really depends on the time you have available and your level of enthusiasm.

Below are 14 lessons that will get you started and productive with machine learning in R:

Day 1: Download and Install R.
Day 2: Get Around In R with Basic Syntax.
Day 3: Load Data and Standard Machine Learning Datasets.
Day 4: Understand Data with Descriptive Statistics.
Day 5: Understand Data with Visualization.
Day 6: Prepare For Modeling by Pre-Processing Data.
Day 7: Algorithm Evaluation With Resampling Methods.
Day 8: Algorithm Evaluation Metrics.
Day 9: Spot-Check Algorithms.
Day 10: Model Comparison and Selection.
Day 11: Improve Accuracy with Algorithm Tuning.
Day 12: Improve Accuracy with Ensemble Predictions.
Day 13: Finalize And Save Your Model.
Day 14: Hello World End-to-End Project.

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help on and about the R platform (hint, I have all of the answers directly on this blog, use the search).

I do provide more help in the early lessons because I want you to build up some confidence and inertia. Hang in there, don’t give up!

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Day 1: Download and Install R

You cannot get started with machine learning in R until you have access to the platform.

Todays lesson is easy, you must download and install the R platform on your computer.

Visit the R homepage and download R for your operating system (Linux, OS X or Windows).
Install R on your computer.
Start R for the first time from command line by typing “R”.

If you need help installing R, checkout the post:

Use R For Machine Learning

Day 2: Get Around In R with Basic Syntax

You need to be able to read and write basic R scripts.

As a developer you can pick-up new programming languages pretty quickly. R is case sensitive, uses hash (#) for comments and uses the arrow operator (<-) for assignments instead of the single equals (=).

Todays task is to practice the basic syntax of the R programming language in the R interactive environment.

Practice assignment in the language using the arrow operator (<-).
Practice using basic data structures like vectors, lists and data frames.
Practice using flow control structures like If-Then-Else and loops.
Practice calling functions, installing and loading packages.

For example, below is an example of creating a list of numbers and calculating the mean.

numbers <- c(1, 2, 3, 4, 5, 6)
mean(numbers)

1 2	numbers <- c(1, 2, 3, 4, 5, 6) mean(numbers)

If you need help with basic R syntax, see the post:

Super Fast Crash Course in R.

Day 3: Load Data and Standard Machine Learning Datasets

Machine learning algorithms need data. You can load your own data from CSV files but when you are getting started with machine learning in R you should practice on standard machine learning datasets.

Your task for todays lesson are to get comfortable loading data into R and to find and load standard machine learning datasets.

The datasets package that comes with R has many standard datasets including the famous iris flowers dataset. The mlbench package also contains man standard machine learning datasets.

Practice loading CSV files into R using the read.csv() function.
Practice loading standard machine learning datasets from the datasets and mlbench packages.

Help: You can get help about a function by typing ?FunctionName or by calling the help() function and passing the function name that you need help with as an argument.

To get you started, the below snippet will install and load the mlbench package, list all of the datasets it offers and attach the PimaIndiansDiabetes dataset to your environment for you to play with.

install.packages("mlbench")
library(mlbench)
data(package="mlbench")
data(PimaIndiansDiabetes)
head(PimaIndiansDiabetes)

install.packages("mlbench")

library(mlbench)

data(package="mlbench")

data(PimaIndiansDiabetes)

head(PimaIndiansDiabetes)

Well done for making it this far! Hang in there.

Any questions so far? Ask in the comments.

Day 4: Understand Data with Descriptive Statistics

Once you have loaded your data into R you need to be able to understand it.

The better you can understand your data, the better and more accurate the models that you can build. The first step to understanding your data is to use descriptive statistics.

Today your lesson is to learn how to use descriptive statistics to understand your data.

Understand your data using the head() function to look at the first few rows.
Review the dimensions of your data with the dim() function.
Review the distribution of your data with the summary() function.
Calculate pair-wise correlation between your variables using the cor() function.

The below example loads the iris dataset and summarizes the distribution of each attribute.

data(iris)
summary(iris)

1 2	data(iris) summary(iris)

Try it out!

Day 5: Understand Data with Visualization

Continuing on from yesterdays lesson, you must spend time to better understand your data.

A second way to improve your understanding of your data is by using data visualization techniques (e.g. plotting).

Today, your lesson is to learn how to use plotting in R to understand attributes alone and their interactions.

Use the hist() function to create a histogram of each attribute.
Use the boxplot() function to create box and whisker plots of each attribute.
Use the pairs() function to create pair-wise scatterplots of all attributes.

For example the snippet below will load the iris dataset and create a scatterplot matrix of the dataset.

data(iris)
pairs(iris)

1 2	data(iris) pairs(iris)

Day 6: Prepare For Modeling by Pre-Processing Data

Your raw data may not be setup to be in the best shape for modeling.

Sometimes you need to pre-process your data in order to best present the inherent structure of the problem in your data to the modeling algorithms. In today’s lesson, you will use the pre-processing capabilities provided by the caret package.

The caret package provides the preprocess() function that takes a method argument to indicate the type of pre-processing to perform. Once the pre-processing parameters have been prepared from a dataset, the same pre-processing step can be applied to each dataset that you may have.

Remember, you can install and load the caret package as follows:

install.packages("caret")
library(caret)

1 2	install.packages("caret") library(caret)

Standardize numerical data (e.g. mean of 0 and standard deviation of 1) using the scale and center options.
Normalize numerical data (e.g. to a range of 0-1) using the range option.
Explore more advanced power transforms like the Box-Cox power transform with the BoxCox option.

For example, the snippet below loads the iris dataset, calculates the parameters needed to normalize the data, then creates a normalized copy of the data.

# load caret package
library(caret)
# load the dataset
data(iris)
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris[,1:4], method=c("range"))
# transform the dataset using the pre-processing parameters
transformed <- predict(preprocessParams, iris[,1:4])
# summarize the transformed dataset
summary(transformed)

# load caret package

library(caret)

# load the dataset

data(iris)

# calculate the pre-process parameters from the dataset

preprocessParams <- preProcess(iris[,1:4], method=c("range"))

# transform the dataset using the pre-processing parameters

transformed <- predict(preprocessParams, iris[,1:4])

# summarize the transformed dataset

summary(transformed)

Day 7: Algorithm Evaluation With Resampling Methods

The dataset used to train a machine learning algorithm is called a training dataset. The dataset used to train an algorithm cannot be used to give you reliable estimates of the accuracy of the model on new data. This is a big problem because the whole idea of creating the model is to make predictions on new data.

You can use statistical methods called resampling methods to split your training dataset up into subsets, some are used to train the model and others are held back and used to estimate the accuracy of the model on unseen data.

Your goal with todays lesson is to practice using the different resampling methods available in the caret package. Look up the help on the createDataPartition(), trainControl() and train() functions in R.

Split a dataset into training and test sets.
Estimate the accuracy of an algorithm using k-fold cross validation.
Estimate the accuracy of an algorithm using repeated k-fold cross validation.

The snippet below uses the caret package to estimate the accuracy of the Naive Bayes algorithm on the iris dataset using 10-fold cross validation.

# load the library
library(caret)
# load the iris dataset
data(iris)
# define training control
trainControl <- trainControl(method="cv", number=10)
# estimate the accuracy of Naive Bayes on the dataset
fit <- train(Species~., data=iris, trControl=trainControl, method="nb")
# summarize the estimated accuracy
print(fit)

# load the library

library(caret)

# load the iris dataset

data(iris)

# define training control

trainControl <- trainControl(method="cv", number=10)

# estimate the accuracy of Naive Bayes on the dataset

fit <- train(Species~., data=iris, trControl=trainControl, method="nb")

# summarize the estimated accuracy

print(fit)

Need more help on this step?

Take a look at the post on resampling methods:

How To Estimate Model Accuracy in R Using The Caret Package.

Did you realize that this is the half-way point? Well done!

Day 8: Algorithm Evaluation Metrics

There are many different metrics that you can use to evaluate the skill of a machine learning algorithm on a dataset.

You can specify the metric used for your test harness in caret in the train() function and defaults can be used for regression and classification problems. Your goal with todays lesson is to practice using the different algorithm performance metrics available in the caret package.

Practice using the Accuracy and Kappa metrics on a classification problem (e.g. iris dataset).
Practice using RMSE and RSquared metrics on a regression problem (e.g. longley dataset).
Practice using the ROC metrics on a binary classification problem (e.g. PimaIndiansDiabetes dataset from the mlbench package).

The snippet below demonstrates calculating the LogLoss metric on the iris dataset.

# load caret library
library(caret)
# load the iris dataset
data(iris)
# prepare 5-fold cross validation and keep the class probabilities
control <- trainControl(method="cv", number=5, classProbs=TRUE, summaryFunction=mnLogLoss)
# estimate accuracy using LogLoss of the CART algorithm
fit <- train(Species~., data=iris, method="rpart", metric="logLoss", trControl=control)
# display results
print(fit)

# load caret library

library(caret)

# load the iris dataset

data(iris)

# prepare 5-fold cross validation and keep the class probabilities

control <- trainControl(method="cv", number=5, classProbs=TRUE, summaryFunction=mnLogLoss)

# estimate accuracy using LogLoss of the CART algorithm

fit <- train(Species~., data=iris, method="rpart", metric="logLoss", trControl=control)

# display results

print(fit)

Day 9: Spot-Check Algorithms

You cannot possibly know which algorithm will perform best on your data before hand.

You have to discover it using a process of trial and error. I call this spot-checking algorithms. The caret package provides an interface to many machine learning algorithms and tools to compare the estimated accuracy of those algorithms.

In this lesson you must practice spot checking different machine learning algorithms.

Spot check linear algorithms on a dataset (e.g. linear regression, logistic regression and linear discriminate analysis).
Spot check some non-linear algorithms on a dataset (e.g. KNN, SVM and CART).
Spot-check some sophisticated ensemble algorithms on a dataset (e.g. random forest and stochastic gradient boosting).

Help: You can get a list of models that you can use in caret by typing: names(getModelInfo())

For example, the snippet below spot-checks two linear algorithms on the Pima Indians Diabetes dataset from the mlbench package.

# load libraries
library(caret)
library(mlbench)
# load the Pima Indians Diabetes dataset
data(PimaIndiansDiabetes)
# prepare 10-fold cross validation
trainControl <- trainControl(method="cv", number=10)
# estimate accuracy of logistic regression
set.seed(7)
fit.lr <- train(diabetes~., data=PimaIndiansDiabetes, method="glm", trControl=trainControl)
# estimate accuracy of linear discriminate analysis
set.seed(7)
fit.lda <- train(diabetes~., data=PimaIndiansDiabetes, method="lda", trControl=trainControl)
# collect resampling statistics
results <- resamples(list(LR=fit.lr, LDA=fit.lda))
# summarize results
summary(results)

# load libraries

library(caret)

library(mlbench)

# load the Pima Indians Diabetes dataset

data(PimaIndiansDiabetes)

# prepare 10-fold cross validation

trainControl <- trainControl(method="cv", number=10)

# estimate accuracy of logistic regression

set.seed(7)

fit.lr <- train(diabetes~., data=PimaIndiansDiabetes, method="glm", trControl=trainControl)

# estimate accuracy of linear discriminate analysis

set.seed(7)

fit.lda <- train(diabetes~., data=PimaIndiansDiabetes, method="lda", trControl=trainControl)

# collect resampling statistics

results <- resamples(list(LR=fit.lr, LDA=fit.lda))

# summarize results

summary(results)

Day 10: Model Comparison and Selection

Now that you know how to spot check machine learning algorithms on your dataset, you need to know how to compare the estimated performance of different algorithms and select the best model.

Thankfully the caret package provides a suite of tools to plot and summarize the differences in performance between models.

In todays lesson you will practice comparing the accuracy of machine learning algorithms in R.

Use the summary() caret function to create a table of results (hint, there is an example in the previous lesson)
Use the dotplot() caret function to compare results.
Use the bwplot() caret function to compare results.
Use the diff() caret function to calculate the statistical significance between results.

The snippet below extends yesterdays example and creates a plot of the spot-check results.

# load libraries
library(caret)
library(mlbench)
# load the Pima Indians Diabetes dataset
data(PimaIndiansDiabetes)
# prepare 10-fold cross validation
trainControl <- trainControl(method="cv", number=10)
# estimate accuracy of logistic regression
set.seed(7)
fit.lr <- train(diabetes~., data=PimaIndiansDiabetes, method="glm", trControl=trainControl)
# estimate accuracy of linear discriminate analysis
set.seed(7)
fit.lda <- train(diabetes~., data=PimaIndiansDiabetes, method="lda", trControl=trainControl)
# collect resampling statistics
results <- resamples(list(LR=fit.lr, LDA=fit.lda))
# plot the results
dotplot(results)

# load libraries

library(caret)

library(mlbench)

# load the Pima Indians Diabetes dataset

data(PimaIndiansDiabetes)

# prepare 10-fold cross validation

trainControl <- trainControl(method="cv", number=10)

# estimate accuracy of logistic regression

set.seed(7)

fit.lr <- train(diabetes~., data=PimaIndiansDiabetes, method="glm", trControl=trainControl)

# estimate accuracy of linear discriminate analysis

set.seed(7)

fit.lda <- train(diabetes~., data=PimaIndiansDiabetes, method="lda", trControl=trainControl)

# collect resampling statistics

results <- resamples(list(LR=fit.lr, LDA=fit.lda))

# plot the results

dotplot(results)

Day 11: Improve Accuracy with Algorithm Tuning

Once you have found one or two algorithms that perform well on your dataset, you may want to improve the performance of those models.

One way to increase the performance of an algorithm is to tune it’s parameters to your specific dataset.

The caret package provides three ways to search for combinations of parameters for a machine learning algorithm. Your goal in todays lesson is to practice each.

Tune the parameters of an algorithm automatically (e.g. see the tuneLength argument to train()).
Tune the parameters of an algorithm using a grid search that you specify.
Tune the parameters of an algorithm using a random search.

Take a look at the help for the trainControl() and train() functions and take note of the method and the tuneGrid arguments.

The snippet below uses is an example of using a grid search for the random forest algorithm on the iris dataset.

# load the library
library(caret)
# load the iris dataset
data(iris)
# define training control
trainControl <- trainControl(method="cv", number=10)
# define a grid of parameters to search for random forest
grid <- expand.grid(.mtry=c(1,2,3,4,5,6,7,8,10))
# estimate the accuracy of Random Forest on the dataset
fit <- train(Species~., data=iris, trControl=trainControl, tuneGrid=grid, method="rf")
# summarize the estimated accuracy
print(fit)

# load the library

library(caret)

# load the iris dataset

data(iris)

# define training control

trainControl <- trainControl(method="cv", number=10)

# define a grid of parameters to search for random forest

grid <- expand.grid(.mtry=c(1,2,3,4,5,6,7,8,10))

# estimate the accuracy of Random Forest on the dataset

fit <- train(Species~., data=iris, trControl=trainControl, tuneGrid=grid, method="rf")

# summarize the estimated accuracy

print(fit)

You’re nearly at the end! Just a few more lessons.

Day 12: Improve Accuracy with Ensemble Predictions

Another way that you can improve the performance of your models is to combine the predictions from multiple models.

Some models provide this capability built-in such as random forest for bagging and stochastic gradient boosting for boosting. Another type of ensembling called stacking (or blending) can learn how to best combine the predictions from multiple models and is provided in the package caretEnsemble.

In todays lesson you will practice using ensemble methods.

Practice bagging ensembles with the random forest and bagged CART algorithms in caret.
Practice boosting ensembles with the gradient boosting machine and C5.0 algorithms in caret.
Practice stacking ensembles using the caretEnsemble package and the caretStack() function.

The snippet below demonstrates how you can combine the predictions from multiple models using stacking.

# Load packages
library(mlbench)
library(caret)
library(caretEnsemble)
# load the Pima Indians Diabetes dataset
data(PimaIndiansDiabetes)
# create sub-models
trainControl <- trainControl(method="cv", number=5, savePredictions=TRUE, classProbs=TRUE)
algorithmList <- c('knn', 'glm')
set.seed(7)
models <- caretList(diabetes~., data=PimaIndiansDiabetes, trControl=trainControl, methodList=algorithmList)
print(models)
# learn how to best combine the predictions
stackControl <- trainControl(method="cv", number=5, savePredictions=TRUE, classProbs=TRUE)
set.seed(7)
stack.glm <- caretStack(models, method="glm", trControl=stackControl)
print(stack.glm)

# Load packages

library(mlbench)

library(caret)

library(caretEnsemble)

# load the Pima Indians Diabetes dataset

data(PimaIndiansDiabetes)

# create sub-models

trainControl <- trainControl(method="cv", number=5, savePredictions=TRUE, classProbs=TRUE)

algorithmList <- c('knn', 'glm')

set.seed(7)

models <- caretList(diabetes~., data=PimaIndiansDiabetes, trControl=trainControl, methodList=algorithmList)

print(models)

# learn how to best combine the predictions

stackControl <- trainControl(method="cv", number=5, savePredictions=TRUE, classProbs=TRUE)

set.seed(7)

stack.glm <- caretStack(models, method="glm", trControl=stackControl)

print(stack.glm)

Day 13: Finalize And Save Your Model

Once you have found a well performing model on your machine learning problem, you need to finalize it.

In todays lesson you will practice the tasks related to finalizing your model.

Practice using the predict() function to make predictions with a model trained using caret.
Practice training standalone versions of well performing models.
Practice saving trained models to file and loading them up again using the saveRDS() and readRDS() functions.

For example, the snippet below shows how you can create a random forest algorithm trained on your entire dataset ready for general use.

# load package
library(randomForest)
# load iris data
data(iris)
# train random forest model
finalModel <- randomForest(Species~., iris, mtry=2, ntree=2000)
# display the details of the final model
print(finalModel)

# load package

library(randomForest)

# load iris data

data(iris)

# train random forest model

finalModel <- randomForest(Species~., iris, mtry=2, ntree=2000)

# display the details of the final model

print(finalModel)

Day 14: Hello World End-to-End Project

You now know how to complete each task of a predictive modeling machine learning problem.

In todays lesson you need to practice putting the pieces together and working through a standard machine learning dataset end-to-end.

Work through the iris dataset end-to-end (the hello world of machine learning)

This includes the steps:

Understanding your data using descriptive statistics and visualization.
Pre-Processing the data to best expose the structure of the problem.
Spot-checking a number of algorithms using your own test harness.
Improving results using algorithm parameter tuning.
Improving results using ensemble methods.
Finalize the model ready for future use.

The End! (Look How Far You Have Come)

You made it. Well done!

Take a moment and look back at how far you have come.

You started off with an interest in machine learning and a strong desire to be able to practice and apply machine learning using R.
You downloaded, installed and started R, perhaps for the first time and started to get familiar with the syntax of the language.
Slowly and steadily over the course of a number of lessons you learned how the standard tasks of a predictive modeling machine learning project map onto the R platform.
Building upon the recipes for common machine learning tasks you worked through your first machine learning problems end-to-end using R.
Using a standard template, the recipes and experience you have gathered you are now capable of working through new and different predictive modeling machine learning problems on your own.

Don’t make light of this, you have come a long way in a short amount of time.

This is just the beginning of your machine learning journey with R. Keep practicing and developing your skills.

How Did You Go With The Mini-Course?

Did you enjoy this mini-course?

Do you have any questions? Were there any sticking points?

Let me know. Leave a comment below.

42 Responses to R Machine Learning Mini-Course

PAUL April 18, 2016 at 8:41 pm #

Hi I have brought machine learning mastery with R book, just want to know where do i get the data set files and scripts, cause i dont find it in the zip file.

Reply
- Jason Brownlee April 19, 2016 at 5:43 am #
  
  Hi Paul,
  
  After purchasing you will receive a link to download “machine_learning_mastery_with_r.zip”. Inside that ZIP, you will have two more files, the book: “machine_learning_mastery_with_r.pdf” and the code examples: “machine_learning_mastery_with_r_code.zip”. I hope that is clearer.
  
  Any more questions at all, please ask.
  
  Jason.
  
  Reply
Elvin January 30, 2020 at 7:45 pm #

caretStack() still doesn’t support multiclass problems?

https://github.com/zachmayer/caretEnsemble/pull/191

We can’t use the iris dataset for this as enumerated in Day 14 project.

Any tips?

Thanks!

Reply
- Jason Brownlee January 31, 2020 at 7:43 am #
  
  Use caret for the project instead of caretStack.
  
  Reply
Dominique April 13, 2020 at 9:43 pm #

Hi Jason,

I did a boxplot of glucose versus age (boxplot(glucose ~ age, data=PimaIndiansDiabetes ) and it is interesting to notice that glucose is increasing with age.

regards,
Dominique

Reply
- Jason Brownlee April 14, 2020 at 6:16 am #
  
  Nice work!
  
  Reply
Mark April 26, 2020 at 5:54 pm #

Is it a good idea to use RStudio Cloud?

Reply
- Jason Brownlee April 27, 2020 at 5:31 am #
  
  I have never used rstudio, I cannot give you advice on it.
  
  Reply
Mridhu Sharma April 29, 2020 at 4:22 am #

Thanks for helping beginners.

Reply
- Jason Brownlee April 29, 2020 at 6:33 am #
  
  You’re welcome.
  
  Reply
Skylar April 29, 2020 at 6:30 am #

Hi Jason,

I want to ask you in the preprocessing step with the function “preProcessing” in caret package, what is the exact differences between “method = c(“range”) and “method = c(“scale”, “center”)? I got to know from your mini-course that standardize numerical data (e.g. mean of 0 and standard deviation of 1) using the scale and center options; while normalize numerical data (e.g. to a range of 0-1) using the range option. But what does it exactly mean? Also a more practical question, if my input (predictors) are gene expression or metabolite intensity, which method I should use? Thank you very much and I look forward your answers!

Reply
- Jason Brownlee April 29, 2020 at 6:35 am #
  
  The first normalizes to a range, the second standardized to a “standard gaussian”.
  
  More here:
  https://machinelearningmastery.com/faq/single-faq/what-is-the-difference-between-standardization-and-normalization
  
  Reply
skylar April 30, 2020 at 4:15 am #

Hi Jason,

Thank you for your nice posts and all learning materials, I indeed learn a lot! In your mini-course, you wrote we can use k-fold cross validation and repeated k-fold cross validation to estimate the accuracy of an algorithm. I want to ask you what is the exact differences between k-fold cross validation and repeated k-fold cross validation? In my understanding, repeated k-fold cross validation is that we can define how many times we repeat the k-fold cross validation, and then get the median value for the accuracy, right? If so, does it mean the results from the repeated k-fold cross validation is more robust and normally we should use this method? Then what is the advantage of k-fold cross validation without repeat? Thank you!

Reply
- Jason Brownlee April 30, 2020 at 6:53 am #
  
  You’re welcome.
  
  The difference is repeated k-fold cv repeats the process many times with different splits.
  
  It is a more robust estimate, but it is more computationally expensive.
  
  Reply
  - Skylar April 30, 2020 at 7:33 am #
    
    Got it, thank you Jason!
    
    Reply
    - Jason Brownlee April 30, 2020 at 11:35 am #
      
      You’re welcome.
      
      Reply
Skylar May 1, 2020 at 5:42 am #

Hi Jason,

Thank you for sending the mini-course email every day, making me to keep learning every day, I like this feeling!

I have a question about ML on classification: let’s assume we need to use ML for classification of two groups, I found there are two ways for the prediction output, one way is to predict either group 1 or group 2; another way is to give the prediction probability. When should we use which one? which one is normally used? In my understanding, for the first one, the default metrics are Accuracy and Kappa, for the second one, we usually use “LogLoss”, right?

Thank you very much in advance!

Reply
- Jason Brownlee May 1, 2020 at 6:47 am #
  
  You’re welcome!
  
  Use the method that best solves your task, e.g. meets the requirements of your project as defined by you and project stakeholders.
  
  Choose the metric very carefully based on your project goals:
  https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/
  
  Reply
  - Skylar May 1, 2020 at 7:23 am #
    
    The link you sent looks very informative, I will definitely learn it, thank you for always good help!
    
    Reply
    - Jason Brownlee May 1, 2020 at 2:01 pm #
      
      Thanks.
      
      Reply
Si Wu May 1, 2020 at 9:05 am #

Hi Jason,

I found your several very helpful post for multiple group classification, especially for unbalanced case, e.g. https://machinelearningmastery.com/imbalanced-multiclass-classification-with-the-e-coli-dataset/. But they are written mainly in Python, I am more familiar with R (though I am also learning Python), do you have any similar posts but written in R? Thank you!

Reply
- Jason Brownlee May 1, 2020 at 2:02 pm #
  
  Thanks.
  
  Not at this stage.
  
  Reply
Skylar May 4, 2020 at 3:12 pm #

Hi Jason,

We usually divide our data to training dataset and test dataset (e.g. with the ration 80% and 20%), in my understanding, the model and the parameters in the model are tuned and optimized based on the training dataset, so I understand that the accuracy for this model in the training dataset should be higher than that in the test dataset, right? If so, what could be the reasons for the opposite case that the accuracy is higher in the test dataset? I already used the repeated cross validation. Thank you very much in advance!

Reply
- Jason Brownlee May 5, 2020 at 6:18 am #
  
  Good question.
  
  This can happen if your test set is small and not representative of the broader problem. E.g. the model performs well on it, but the result is misleading.
  
  Reply
  - Skylar May 5, 2020 at 6:59 am #
    
    Thank you Jason for your answers! what can we do if we meet this kind of problem? Does it mean it is wrong?
    
    Reply
    - Jason Brownlee May 5, 2020 at 7:46 am #
      
      Some ideas:
      
      Perhaps use a different split of the data.
      Perhaps get more data.
      Perhaps use the mean of a resampling method like cross-validation.
      
      Reply
      - Skylar May 5, 2020 at 3:14 pm #
        
        Yes, your ideas make sense, thank you!
      - Jason Brownlee May 6, 2020 at 6:21 am #
        
        I’m happy to hear that.
Abhay V May 15, 2020 at 10:19 am #

Dear Jason,
First of all Thank you for such a nice course !!Hope to see Many more in future !!

I tried installin R 3.6.x version but it was showing some error.
So i went to you tube and saw how to install R with R studio and I could install that by looking at Video.
Hope that will also serve ou purpose of Tutorial and future learnings !!

Regds
Abhay V

Reply
- Jason Brownlee May 15, 2020 at 1:28 pm #
  
  Thanks!
  
  Well done.
  
  Reply
Mubbasher Munir September 29, 2020 at 4:51 pm #

Thank you

Reply
- Jason Brownlee September 30, 2020 at 6:23 am #
  
  You’re welcome.
  
  Reply
Khaled October 6, 2020 at 10:20 pm #

Great course, thank you very much.

Reply
- Jason Brownlee October 7, 2020 at 6:46 am #
  
  Thanks!
  
  Reply
Jcc February 26, 2021 at 2:15 pm #

Thanks for sharing. Anyone prefers R over Python? Why?

Reply
- Jason Brownlee February 27, 2021 at 6:00 am #
  
  Yes, more methods in R, larger technical community, and historical reasons.
  
  Reply
M Thackray October 28, 2021 at 10:06 pm #

Hi,
i am trying to run a very basic model as follows:
# make predictions
x_test1 <- data_test1[,1]
y_test1 <- data_test1[,2]
predictions <- predict(model1, x_test1)

However, I keep getting this error: Error in eval(predvars, data, env) : object 'PARM' not found

The data that I have imported very clearly states PARM as the heading in column 1. I only know very basics in R Studio. Please may you help me with a step by step solution

Reply
- Adrian Tam October 29, 2021 at 2:07 am #
  
  Which sample code you’re running here?
  
  Reply
Uju Mbadiwe July 12, 2023 at 12:04 am #

Hi, thanks. I installed R version 4.3.1 but when I typed R in the command prompt, it showed error. I was only able to open R with the R console.

Reply
- James Carmichael July 12, 2023 at 11:40 am #
  
  Hi Uju…You are very welcome! What is the exact error you are receiving? That will enable us to better assist you.
  
  Reply
Lamri June 11, 2024 at 8:51 pm #

Hi, thank you. I am trying to run the hist() function to create a histogram but it doesn’t work with me: > hist(iris)
Error in hist.default(iris) : ‘x’ must be numeric…..> hist(Sepal.Length)
Error: object ‘Sepal.Length’ not found !!!?? where is the problem her?

Reply
- James Carmichael June 12, 2024 at 7:43 am #
  
  Hi Lamri…Did you copy and paste the code or type it in? Here are some other thoughts:
  
  The issue you’re encountering is due to how the data is being referenced in R. When you use the hist() function, it expects a numeric vector. The error you’re seeing suggests that you’re not providing the function with a proper numeric vector.
  
  Here’s how to correctly create a histogram using the iris dataset in R:
  
  ### Step-by-Step Solution
  
  1. **Loading the Dataset**:
  Ensure that the iris dataset is loaded. It’s a built-in dataset, so you can directly use it.
  
  2. **Referencing Columns in a Data Frame**:
  To reference a specific column in a data frame like iris, you need to use the $ operator.
  
  ### Correct Usage
  
  r # Ensure the iris dataset is loaded (it's built-in, so this should be fine) data(iris)
  # Create a histogram of the Sepal.Length column hist(iris$Sepal.Length, main="Histogram of Sepal Length", xlab="Sepal Length", col="lightblue")
  
  ### Explanation
  
  1. **data(iris)**: This ensures the iris dataset is loaded, though it’s typically not necessary for built-in datasets.
  
  2. **iris$Sepal.Length**: This references the Sepal.Length column in the iris dataset.
  
  3. **hist() function**: The hist() function takes a numeric vector (in this case, iris$Sepal.Length) and creates a histogram.
  
  ### Additional Tips
  
  – **Checking Column Names**: If you’re unsure of the column names, you can use the names() function to list them.
  r names(iris)
  
  – **Subsetting Data**: If you want to create a histogram for a subset of data, use indexing or logical conditions.
  r hist(iris[iris$Species == "setosa", "Sepal.Length"], main="Histogram of Sepal Length for Setosa", xlab="Sepal Length", col="lightgreen")
  
  ### Full Example with Additional Parameters
  
  r # Load the iris dataset data(iris)
  # Check the structure of the dataset to confirm column names and types str(iris)
  # Create a histogram for Sepal.Length with additional parameters hist(iris$Sepal.Length, main="Histogram of Sepal Length", xlab="Sepal Length", col="lightblue", border="black", breaks=20) # You can adjust the number of breaks
  
  This should resolve your issue and allow you to create the desired histogram. If you have any further questions or run into other issues, feel free to ask!
  
  Reply

Navigation

R Machine Learning Mini-Course

From Developer to Machine Learning Practitioner in 14 Days

Who Is This Mini-Course For?

Mini-Course Overview (what to expect)

Need more Help with R for Machine Learning?

Day 1: Download and Install R

Day 2: Get Around In R with Basic Syntax

Day 3: Load Data and Standard Machine Learning Datasets

Day 4: Understand Data with Descriptive Statistics

Day 5: Understand Data with Visualization

Day 6: Prepare For Modeling by Pre-Processing Data

Day 7: Algorithm Evaluation With Resampling Methods

Day 8: Algorithm Evaluation Metrics

Day 9: Spot-Check Algorithms

Day 10: Model Comparison and Selection

Day 11: Improve Accuracy with Algorithm Tuning

Day 12: Improve Accuracy with Ensemble Predictions

Day 13: Finalize And Save Your Model

Day 14: Hello World End-to-End Project

The End! (Look How Far You Have Come)

How Did You Go With The Mini-Course?

Discover Faster Machine Learning in R!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To Your Own Projects

More On This Topic

42 Responses to R Machine Learning Mini-Course

Leave a Reply Click here to cancel reply.