The post R Machine Learning Mini-Course appeared first on Machine Learning Mastery.

]]>In this mini-course you will discover how you can get started, build accurate models and confidently complete predictive modeling machine learning projects using R in 14 days.

This is a big and important post. You might want to bookmark it.

Before we get started, let’s make sure you are in the right place. The list below provides some general guidelines as to who this course was designed for.

Don’t panic if you don’t match these points exactly, you might just need to brush up in one area or another to keep up.

**Developers that know how to write a little code**. This means that it is not a big deal for you to pick up a new programming language like R once you know the basic syntax. It does not mean your a wizard coder, just that you can follow a basic C-like language with little effort.**Developers that know a little machine learning**. This means you know about the basics of machine learning like cross validation, some algorithms and bias-variance trade-off. It does not mean that you are a machine learning PhD, just that you know the landmarks or know where to look them up.

This mini-course is neither a textbook on R or a textbook on machine learning.

It will take you from a developer that knows a little machine learning to a developer who can get results using R, the most powerful and most popular platform for machine learning.

This mini-course is broken down into 14 lessons that I call “days”.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hard core!). It really depends on the time you have available and your level of enthusiasm.

Below are 14 lessons that will get you started and productive with machine learning in R:

**Day 1**: Download and Install R.**Day 2**: Get Around In R with Basic Syntax.**Day 3**: Load Data and Standard Machine Learning Datasets.**Day 4**: Understand Data with Descriptive Statistics.**Day 5**: Understand Data with Visualization.**Day 6**: Prepare For Modeling by Pre-Processing Data.**Day 7**: Algorithm Evaluation With Resampling Methods.**Day 8**: Algorithm Evaluation Metrics.**Day 9**: Spot-Check Algorithms.**Day 10**: Model Comparison and Selection.**Day 11**: Improve Accuracy with Algorithm Tuning.**Day 12**: Improve Accuracy with Ensemble Predictions.**Day 13**: Finalize And Save Your Model.**Day 14**: Hello World End-to-End Project.

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help on and about the R platform (hint, I have all of the answers directly on this blog, use the search).

I do provide more help in the early lessons because I want you to build up some confidence and inertia. Hang in there, don’t give up!

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

You cannot get started with machine learning in R until you have access to the platform.

Todays lesson is easy, you must download and install the R platform on your computer.

- Visit the R homepage and download R for your operating system (Linux, OS X or Windows).
- Install R on your computer.
- Start R for the first time from command line by typing “R”.

If you need help installing R, checkout the post:

You need to be able to read and write basic R scripts.

As a developer you can pick-up new programming languages pretty quickly. R is case sensitive, uses hash (#) for comments and uses the arrow operator (<-) for assignments instead of the single equals (=).

Todays task is to practice the basic syntax of the R programming language in the R interactive environment.

- Practice assignment in the language using the arrow operator (<-).
- Practice using basic data structures like vectors, lists and data frames.
- Practice using flow control structures like If-Then-Else and loops.
- Practice calling functions, installing and loading packages.

For example, below is an example of creating a list of numbers and calculating the mean.

numbers <- c(1, 2, 3, 4, 5, 6) mean(numbers)

If you need help with basic R syntax, see the post:

Machine learning algorithms need data. You can load your own data from CSV files but when you are getting started with machine learning in R you should practice on standard machine learning datasets.

Your task for todays lesson are to get comfortable loading data into R and to find and load standard machine learning datasets.

The *datasets* package that comes with R has many standard datasets including the famous iris flowers dataset. The *mlbench* package also contains man standard machine learning datasets.

- Practice loading CSV files into R using the
*read.csv()*function. - Practice loading standard machine learning datasets from the datasets and
*mlbench*packages.

**Help**: You can get help about a function by typing *?FunctionName* or by calling the *help()* function and passing the function name that you need help with as an argument.

To get you started, the below snippet will install and load the *mlbench* package, list all of the datasets it offers and attach the PimaIndiansDiabetes dataset to your environment for you to play with.

install.packages("mlbench") library(mlbench) data(package="mlbench") data(PimaIndiansDiabetes) head(PimaIndiansDiabetes)

Well done for making it this far! **Hang in there**.

Any questions so far? Ask in the comments.

Once you have loaded your data into R you need to be able to understand it.

The better you can understand your data, the better and more accurate the models that you can build. The first step to understanding your data is to use descriptive statistics.

Today your lesson is to learn how to use descriptive statistics to understand your data.

- Understand your data using the
*head()*function to look at the first few rows. - Review the dimensions of your data with the
*dim()*function. - Review the distribution of your data with the
*summary()*function. - Calculate pair-wise correlation between your variables using the
*cor()*function.

The below example loads the iris dataset and summarizes the distribution of each attribute.

data(iris) summary(iris)

Try it out!

Continuing on from yesterdays lesson, you must spend time to better understand your data.

A second way to improve your understanding of your data is by using data visualization techniques (e.g. plotting).

Today, your lesson is to learn how to use plotting in R to understand attributes alone and their interactions.

- Use the
*hist()*function to create a histogram of each attribute. - Use the
*boxplot()*function to create box and whisker plots of each attribute. - Use the
*pairs()*function to create pair-wise scatterplots of all attributes.

For example the snippet below will load the iris dataset and create a scatterplot matrix of the dataset.

data(iris) pairs(iris)

Your raw data may not be setup to be in the best shape for modeling.

Sometimes you need to pre-process your data in order to best present the inherent structure of the problem in your data to the modeling algorithms. In today’s lesson, you will use the pre-processing capabilities provided by the caret package.

The caret package provides the *preprocess()* function that takes a method argument to indicate the type of pre-processing to perform. Once the pre-processing parameters have been prepared from a dataset, the same pre-processing step can be applied to each dataset that you may have.

Remember, you can install and load the caret package as follows:

install.packages("caret") library(caret)

- Standardize numerical data (e.g. mean of 0 and standard deviation of 1) using the
*scale*and*center*options. - Normalize numerical data (e.g. to a range of 0-1) using the
*range*option. - Explore more advanced power transforms like the Box-Cox power transform with the
*BoxCox*option.

For example, the snippet below loads the iris dataset, calculates the parameters needed to normalize the data, then creates a normalized copy of the data.

# load caret package library(caret) # load the dataset data(iris) # calculate the pre-process parameters from the dataset preprocessParams <- preProcess(iris[,1:4], method=c("range")) # transform the dataset using the pre-processing parameters transformed <- predict(preprocessParams, iris[,1:4]) # summarize the transformed dataset summary(transformed)

The dataset used to train a machine learning algorithm is called a training dataset. The dataset used to train an algorithm cannot be used to give you reliable estimates of the accuracy of the model on new data. This is a big problem because the whole idea of creating the model is to make predictions on new data.

You can use statistical methods called resampling methods to split your training dataset up into subsets, some are used to train the model and others are held back and used to estimate the accuracy of the model on unseen data.

Your goal with todays lesson is to practice using the different resampling methods available in the caret package. Look up the help on the *createDataPartition()*, *trainControl()* and *train()* functions in R.

- Split a dataset into training and test sets.
- Estimate the accuracy of an algorithm using k-fold cross validation.
- Estimate the accuracy of an algorithm using repeated k-fold cross validation.

The snippet below uses the caret package to estimate the accuracy of the Naive Bayes algorithm on the iris dataset using 10-fold cross validation.

# load the library library(caret) # load the iris dataset data(iris) # define training control trainControl <- trainControl(method="cv", number=10) # estimate the accuracy of Naive Bayes on the dataset fit <- train(Species~., data=iris, trControl=trainControl, method="nb") # summarize the estimated accuracy print(fit)

Need more help on this step?

Take a look at the post on resampling methods:

Did you realize that this is the half-way point? Well done!

There are many different metrics that you can use to evaluate the skill of a machine learning algorithm on a dataset.

You can specify the metric used for your test harness in caret in the *train()* function and defaults can be used for regression and classification problems. Your goal with todays lesson is to practice using the different algorithm performance metrics available in the caret package.

- Practice using the
*Accuracy*and*Kappa*metrics on a classification problem (e.g. iris dataset). - Practice using
*RMSE*and*RSquared*metrics on a regression problem (e.g. longley dataset). - Practice using the
*ROC*metrics on a binary classification problem (e.g. PimaIndiansDiabetes dataset from the*mlbench*package).

The snippet below demonstrates calculating the LogLoss metric on the iris dataset.

# load caret library library(caret) # load the iris dataset data(iris) # prepare 5-fold cross validation and keep the class probabilities control <- trainControl(method="cv", number=5, classProbs=TRUE, summaryFunction=mnLogLoss) # estimate accuracy using LogLoss of the CART algorithm fit <- train(Species~., data=iris, method="rpart", metric="logLoss", trControl=control) # display results print(fit)

You cannot possibly know which algorithm will perform best on your data before hand.

You have to discover it using a process of trial and error. I call this spot-checking algorithms. The caret package provides an interface to many machine learning algorithms and tools to compare the estimated accuracy of those algorithms.

In this lesson you must practice spot checking different machine learning algorithms.

- Spot check linear algorithms on a dataset (e.g. linear regression, logistic regression and linear discriminate analysis).
- Spot check some non-linear algorithms on a dataset (e.g. KNN, SVM and CART).
- Spot-check some sophisticated ensemble algorithms on a dataset (e.g. random forest and stochastic gradient boosting).

**Help**: You can get a list of models that you can use in caret by typing: *names(getModelInfo())*

For example, the snippet below spot-checks two linear algorithms on the Pima Indians Diabetes dataset from the *mlbench* package.

# load libraries library(caret) library(mlbench) # load the Pima Indians Diabetes dataset data(PimaIndiansDiabetes) # prepare 10-fold cross validation trainControl <- trainControl(method="cv", number=10) # estimate accuracy of logistic regression set.seed(7) fit.lr <- train(diabetes~., data=PimaIndiansDiabetes, method="glm", trControl=trainControl) # estimate accuracy of linear discriminate analysis set.seed(7) fit.lda <- train(diabetes~., data=PimaIndiansDiabetes, method="lda", trControl=trainControl) # collect resampling statistics results <- resamples(list(LR=fit.lr, LDA=fit.lda)) # summarize results summary(results)

Now that you know how to spot check machine learning algorithms on your dataset, you need to know how to compare the estimated performance of different algorithms and select the best model.

Thankfully the caret package provides a suite of tools to plot and summarize the differences in performance between models.

In todays lesson you will practice comparing the accuracy of machine learning algorithms in R.

- Use the
*summary()*caret function to create a table of results (hint, there is an example in the previous lesson) - Use the
*dotplot()*caret function to compare results. - Use the
*bwplot()*caret function to compare results. - Use the
*diff()*caret function to calculate the statistical significance between results.

The snippet below extends yesterdays example and creates a plot of the spot-check results.

# load libraries library(caret) library(mlbench) # load the Pima Indians Diabetes dataset data(PimaIndiansDiabetes) # prepare 10-fold cross validation trainControl <- trainControl(method="cv", number=10) # estimate accuracy of logistic regression set.seed(7) fit.lr <- train(diabetes~., data=PimaIndiansDiabetes, method="glm", trControl=trainControl) # estimate accuracy of linear discriminate analysis set.seed(7) fit.lda <- train(diabetes~., data=PimaIndiansDiabetes, method="lda", trControl=trainControl) # collect resampling statistics results <- resamples(list(LR=fit.lr, LDA=fit.lda)) # plot the results dotplot(results)

Once you have found one or two algorithms that perform well on your dataset, you may want to improve the performance of those models.

One way to increase the performance of an algorithm is to tune it’s parameters to your specific dataset.

The caret package provides three ways to search for combinations of parameters for a machine learning algorithm. Your goal in todays lesson is to practice each.

- Tune the parameters of an algorithm automatically (e.g. see the
*tuneLength*argument to*train()*). - Tune the parameters of an algorithm using a grid search that you specify.
- Tune the parameters of an algorithm using a random search.

Take a look at the help for the *trainControl()* and *train()* functions and take note of the *method* and the *tuneGrid* arguments.

The snippet below uses is an example of using a grid search for the random forest algorithm on the iris dataset.

# load the library library(caret) # load the iris dataset data(iris) # define training control trainControl <- trainControl(method="cv", number=10) # define a grid of parameters to search for random forest grid <- expand.grid(.mtry=c(1,2,3,4,5,6,7,8,10)) # estimate the accuracy of Random Forest on the dataset fit <- train(Species~., data=iris, trControl=trainControl, tuneGrid=grid, method="rf") # summarize the estimated accuracy print(fit)

You’re nearly at the end! Just a few more lessons.

Another way that you can improve the performance of your models is to combine the predictions from multiple models.

Some models provide this capability built-in such as random forest for bagging and stochastic gradient boosting for boosting. Another type of ensembling called stacking (or blending) can learn how to best combine the predictions from multiple models and is provided in the package *caretEnsemble*.

In todays lesson you will practice using ensemble methods.

- Practice bagging ensembles with the random forest and bagged CART algorithms in caret.
- Practice boosting ensembles with the gradient boosting machine and C5.0 algorithms in caret.
- Practice stacking ensembles using the
*caretEnsemble*package and the*caretStack()*function.

The snippet below demonstrates how you can combine the predictions from multiple models using stacking.

# Load packages library(mlbench) library(caret) library(caretEnsemble) # load the Pima Indians Diabetes dataset data(PimaIndiansDiabetes) # create sub-models trainControl <- trainControl(method="cv", number=5, savePredictions=TRUE, classProbs=TRUE) algorithmList <- c('knn', 'glm') set.seed(7) models <- caretList(diabetes~., data=PimaIndiansDiabetes, trControl=trainControl, methodList=algorithmList) print(models) # learn how to best combine the predictions stackControl <- trainControl(method="cv", number=5, savePredictions=TRUE, classProbs=TRUE) set.seed(7) stack.glm <- caretStack(models, method="glm", trControl=stackControl) print(stack.glm)

Once you have found a well performing model on your machine learning problem, you need to finalize it.

In todays lesson you will practice the tasks related to finalizing your model.

- Practice using the
*predict()*function to make predictions with a model trained using caret. - Practice training standalone versions of well performing models.
- Practice saving trained models to file and loading them up again using the
*saveRDS()*and*readRDS()*functions.

For example, the snippet below shows how you can create a random forest algorithm trained on your entire dataset ready for general use.

# load package library(randomForest) # load iris data data(iris) # train random forest model finalModel <- randomForest(Species~., iris, mtry=2, ntree=2000) # display the details of the final model print(finalModel)

You now know how to complete each task of a predictive modeling machine learning problem.

In todays lesson you need to practice putting the pieces together and working through a standard machine learning dataset end-to-end.

- Work through the iris dataset end-to-end (the hello world of machine learning)

This includes the steps:

- Understanding your data using descriptive statistics and visualization.
- Pre-Processing the data to best expose the structure of the problem.
- Spot-checking a number of algorithms using your own test harness.
- Improving results using algorithm parameter tuning.
- Improving results using ensemble methods.
- Finalize the model ready for future use.

You made it. Well done!

Take a moment and look back at how far you have come.

- You started off with an interest in machine learning and a strong desire to be able to practice and apply machine learning using R.
- You downloaded, installed and started R, perhaps for the first time and started to get familiar with the syntax of the language.
- Slowly and steadily over the course of a number of lessons you learned how the standard tasks of a predictive modeling machine learning project map onto the R platform.
- Building upon the recipes for common machine learning tasks you worked through your first machine learning problems end-to-end using R.
- Using a standard template, the recipes and experience you have gathered you are now capable of working through new and different predictive modeling machine learning problems on your own.

Don’t make light of this, you have come a long way in a short amount of time.

This is just the beginning of your machine learning journey with R. Keep practicing and developing your skills.

Did you enjoy this mini-course?

Do you have any questions? Were there any sticking points?

Let me know. Leave a comment below.

The post R Machine Learning Mini-Course appeared first on Machine Learning Mastery.

]]>The post How To Get Started With Machine Learning in R (get results in one weekend) appeared first on Machine Learning Mastery.

]]>R is a large and complex platform. It is also the most popular platform for the best data scientists in the world.

In this post you will discover the step-by-step process that you can use to get started using machine learning for predictive modeling on the R platform.

The steps are practical and so simple that you could be able to build accurate predictive models after one weekend.

The process does assume that you are a developer, know a little machine learning and will actually do the work, but the process does deliver results.

Let’s get started.

Here is how I DON’T think you should study machine learning in R.

**Step 1**: Get really good at R programming and R syntax.**Step 2**: Know the deep theory of every possible algorithm you could use in R.**Step 3**: Study to great detail how to use each machine learning algorithm in R.**Step 4**: Only lightly touch on how to evaluate models.

I think this is the wrong way.

- It teaches you that you need to spend all your time learning how to use individual machine learning algorithms.
- It does not teach you the process of building predictive machine learning models in R that you can actually use in practice to make predictions.

Sadly, this is the approach used to teach machine learning in R that I see in almost all books and online courses on the topic.

You don’t want to be a badass at R or even at machine learning algorithms in R. You want to be a badass at building accurate predictive models using R. This is the context.

You can take time to learn individual machine learning algorithms in great detail, so long as it aids you in building more accurate predictive models, more reliably.

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

You can just dive into R. Go for it.

In my opinion though, I think you will get a lot more out of it if you have some background.

R is an advanced platform and you can get a lot out of it as a beginner. But, if you have a little machine learning and a little programming as a foundation, R will become a superpower for building accurate predictive models very quickly.

Here are some suggestions for getting the most out of getting started with machine learning in R. I think these are reasonable for a modern developer interested in machine learning.

**A developer who knows how to program**. This helps because it won’t be a big deal to pick up the syntax of R, which at times can be a little odd. It is also helpful to know who to whip up scripts or script-lets (mini scripts) to do this or that task. R is a programming language after all.

**Interested in predictive modeling machine learning**. Machine learning is a big field that covered a variety of interesting algorithms. Predictive modeling is a subset that is only concerned with building models that make predictions on new data. Not explaining the relationships between data, nor learning from data in general. I predictive modeling is where R really shines as a platform for machine learning.

**Familiar with machine learning basics**. You understand machine learning as induction problem where all algorithms are really just trying to estimate and underlying mapping function from an input space to an output space. All predictive machine learning makes sense through this lens as do strategies of searching for good and best machine learning algorithms, algorithm parameters and data transforms.

The approach I layout in the next section also makes some assumptions about your background.

**You are not an absolute beginner in machine learning**. You could be, and the approach may work for you, but the you will get a lot more out of it if you have some additional suggested background.

**You want to use a top-down approach to studying machine learning**. This is the approach I teach where rather than starting with theory and principles and eventually touch in practical machine learning if there is time, that you start with the goal of working through a project end-to-end and research details as you need them in order to deliver better results.

**You are familiar with the steps in a predictive modeling machine learning project**. Specifically:

- Define Problem
- Prepare Data
- Evaluate Algorithms
- Improve Results
- Present Results

You can learn more about this process and these steps here:

- How to Use a Machine Learning Checklist to Get Accurate Predictions, Reliably (even if you are a beginner)
- Process for working through Machine Learning Problems

**You are at least familiar with some machine learning algorithms**. Or you may know how to pick them up quickly, for example using the algorithm description template method. I think learning the details of how and why machine learning algorithms is a separate task from learning how to use those algorithms on a machine learning platform like R. They are often conflated in books and course at the determinant of learning.

You can learn more about how to learn any machine learning algorithm using the template method here:

- How to Learn a Machine Learning Algorithm
- 5 Techniques To Understand Machine Learning Algorithms Without the Background in Mathematics

This section lays out a process that you can use to get started with building machine learning predictive models on the R platform.

It is divided into two parts:

- Map the tasks of a machine learning project onto the R platform.
- Work through predictive modeling projects using standard datasets.

You need to know how to do specific tasks of a machine learning on the R platform. Once you know how to complete a discrete task using the platform and get a result reliably, you can do it again and again on project after project.

This process is straightforward:

- List out all of the discrete tasks of a predictive modeling machine learning project.
- Create recipes to complete the task reliably that you can copy-paste as a starting point on future projects.
- Add to and maintain the recipes are your understanding of the platform and machine learning improves.

Below is a minimum list of predictive modeling tasks you may want to map to R the R platform and create recipes. This not complete, but does cover the broad strokes of the platform:

- Overview of R syntax
- Prepare Data
- Loading Data
- Working With Data
- Data Summarization
- Data Visualization
- Data Cleaning
- Feature Selection
- Data Transforms

- Evaluate Algorithms
- Resampling Methods
- Evaluation Metrics
- Spot-Check Algorithms
- Model Selection

- Improve Results
- Algorithm Tuning
- Ensemble Methods

- Present Results
- Finalize Model
- Make New Predictions

You will notice the first task is an overview of R syntax. As a developer, you need to know the basics of the language before you can do anything. Such as assignment, data structures, flow control and creating and calling functions.

I recommend creating recipes that are standalone. That means that each recipe is a complete program that has everything it needs to achieve the task and produce an output. This means that you can copy it directly into a future predictive modeling project.

You can store the recipes in a directory or on GitHub.

Recipes for common predictive modeling tasks with machine learning are not enough.

Again, this is where most books and courses stop. They leave it to you to piece together the recipes into end-to-end projects.

You need to piece the recipes together into end-to-end projects. This will teach and show you how to actually deliver a result using the platform. I recommend only using small well understood machine learning datasets from the UCI Machine learning repository.

These datasets are available for free as CSV downloads, and most are available directly in R by loading third party libraries. These datasets are excellent for practicing because:

- They are small, meaning they fit into memory and algorithms can model them in reasonable time.
- They are well behaved, meaning you often don’t need to do a lot of feature engineering to get a good result.
- There are standards, meaning that many people have used them before and you can get ideas of good algorithms to try and good results you should expect.

I recommend at least three projects:

**Hello World Project (iris flowers)**. This is a quick pass through the project steps without much tuning or optimizing on a dataset that is widely used as the hello world of machine learning (more on the iris flowers dataset).**Binary Classification end-to-end**. Work through each step on a binary classification problem (e.g. the Pima Indians diabetes dataset).**Regression end-to-end**. Work through each step of the process with a regression problem (e.g. the Boston housing dataset).

Machine learning with R does not stop at working through a few small standard datasets. You need to take on more and different challenges.

**Standard Datasets**: You could practice on additional standard datasets from the UCI Machine Learning repository, overcoming the challenges of different problem types.**Competition Datasets**: You could try working through some more challenging datasets, such as those from past Kaggle competitions or those from past KDDCup challenges.**Your Own Projects**: Ideally, you need to start working through your own projects.

All the while you will be dipping into help, adapting your scripts and learning how to get more out of machine learning on R.

It is important that you fold this knowledge back into your catalog of machine learning recipes. This will let you leverage this knowledge quickly on new projects and contribute greatly to your skill and speed at developing predictive models.

You could work through this process in one weekend. By the end of that weekend, you will have the recipes and project templates that you can use to start modeling your own problems using machine learning in R.

You will go from a developer that is interested in machine learning on R to a developer who has the resources and capability to work through a new dataset end-to-end using R and develop a predictive model to be presented and deployed.

Specifically, you will know:

- How to achieve the subtasks of a predictive modeling problem in R.
- How to learn new and different sub tasks in R.
- How to get help with R.
- How to work through a small to medium sized dataset end-to-end.
- How to deliver a model that can make predictions on new unseen data.

From here you can start to dive into the specifics of the functions, techniques and algorithms used with the goal of learning how to use them better in order to deliver more accurate predictive models, more reliably in less time.

In this post you discovered a step-by-step process that you can use to study and get started with machine learning in R.

The three high-level steps of the process are:

- Map the steps of a predictive modeling process onto the R platform with recipes that you can reuse.
- Work through small standard machine learning datasets to piece the recipes together into projects.
- Work through more and different datasets, ideally your own, and add to your library of recipes.

You also discovered he philosophy behind the process and the reasons why this process is the best process for you.

Do you want to get started in machine learning with R?

- Download and install R right now.
- Use the process outline above, limit yourself to one weekend and go as far as you can.
- Report back. Leave a comment. I would love to hear how you went.

Do you have a question about this process? Leave a comment, I’ll do my best to answer it.

The post How To Get Started With Machine Learning in R (get results in one weekend) appeared first on Machine Learning Mastery.

]]>The post Machine Learning Evaluation Metrics in R appeared first on Machine Learning Mastery.

]]>In this post you will discover how you can evaluate your machine learning algorithms in R using a number of standard evaluation metrics.

Let’s get started.

There are many different metrics that you can use to evaluate your machine learning algorithms in R.

When you use caret to evaluate your models, the default metrics used are **accuracy** for classification problems and **RMSE** for regression. But caret supports a range of other popular evaluation metrics.

In the next section you will step through each of the evaluation metrics provided by caret. Each example provides a complete case study that you can copy-and-paste into your project and adapt to your problem.

Note that this post does assume you are already know how to interpret these other metrics. Don’t fret if they are new to you, I’ve provided some links for further reading where you can learn more.

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section you will discover how you can evaluate machine learning algorithms using a number of different common evaluation metrics.

Specifically, this section will show you how to use the following evaluation metrics with the caret package in R:

- Accuracy and Kappa
- RMSE and R^2
- ROC (AUC, Sensitivity and Specificity)
- LogLoss

These are the default metrics used to evaluate algorithms on binary and multi-class classification datasets in caret.

**Accuracy** is the percentage of correctly classifies instances out of all instances. It is more useful on a binary classification than multi-class classification problems because it can be less clear exactly how the accuracy breaks down across those classes (e.g. you need to go deeper with a confusion matrix). Learn more about Accuracy here.

**Kappa** or Cohen’s Kappa is like classification accuracy, except that it is normalized at the baseline of random chance on your dataset. It is a more useful measure to use on problems that have an imbalance in the classes (e.g. 70-30 split for classes 0 and 1 and you can achieve 70% accuracy by predicting all instances are for class 0). Learn more about Kappa here.

In the example below the Pima Indians diabetes dataset is used. It has a class break down of 65% to 35% for negative and positive outcomes.

# load libraries library(caret) library(mlbench) # load the dataset data(PimaIndiansDiabetes) # prepare resampling method control <- trainControl(method="cv", number=5) set.seed(7) fit <- train(diabetes~., data=PimaIndiansDiabetes, method="glm", metric="Accuracy", trControl=control) # display results print(fit)

Running this example, we can see tables of Accuracy and Kappa for each machine learning algorithm evaluated. This includes the mean values (left) and the standard deviations (marked as SD) for each metric, taken over the population of cross validation folds and trials.

You can see that the accuracy of the model is approximately 76% which is 11 percentage points above the baseline accuracy of 65% which is not really that impressive. The Kappa the other hand shows approximately 46% which is more interesting.

Generalized Linear Model 768 samples 8 predictor 2 classes: 'neg', 'pos' No pre-processing Resampling: Cross-Validated (5 fold) Summary of sample sizes: 614, 614, 615, 615, 614 Resampling results Accuracy Kappa Accuracy SD Kappa SD 0.7695442 0.4656824 0.02692468 0.0616666

These are the default metrics used to evaluate algorithms on regression datasets in caret.

**RMSE** or Root Mean Squared Error is the average deviation of the predictions from the observations. It is useful to get a gross idea of how well (or not) an algorithm is doing, in the units of the output variable. Learn more about RMSE here.

**R^2** spoken as R Squared or also called the coefficient of determination provides a “goodness of fit” measure for the predictions to the observations. This is a value between 0 and 1 for no-fit and perfect fit respectively. Learn more about R^2 here.

In this example the longly economic dataset is used. The output variable is a “number employed”. It is not clear whether this is an actual count (e.g. in millions) or a percentage.

# load libraries library(caret) # load data data(longley) # prepare resampling method control <- trainControl(method="cv", number=5) set.seed(7) fit <- train(Employed~., data=longley, method="lm", metric="RMSE", trControl=control) # display results print(fit)

Running this example, we can see tables of RMSE and R Squared for each machine learning algorithm evaluated. Again, you can see the mean and standard deviations of both metrics are provided.

You can see that the RMSE was 0.38 in the units of Employed (whatever those units are). Whereas, the R Square value shows a good fit for the data with a value very close to 1 (0.988).

Linear Regression 16 samples 6 predictor No pre-processing Resampling: Cross-Validated (5 fold) Summary of sample sizes: 12, 12, 14, 13, 13 Resampling results RMSE Rsquared RMSE SD Rsquared SD 0.3868618 0.9883114 0.1025042 0.01581824

ROC metrics are only suitable for binary classification problems (e.g. two classes).

To calculate ROC information, you must change the summaryFunction in your trainControl to be twoClassSummary. This will calculate the Area Under ROC Curve (AUROC) also called just Area Under curve (AUC), sensitivity and specificity.

**ROC** is actually the area under the ROC curve or AUC. The AUC represents a models ability to discriminate between positive and negative classes. An area of 1.0 represents a model that made all predicts perfectly. An area of 0.5 represents a model as good as random. Learn more about ROC here.

ROC can be broken down into sensitivity and specificity. A binary classification problem is really a trade-off between sensitivity and specificity.

**Sensitivity** is the true positive rate also called the recall. It is the number instances from the positive (first) class that actually predicted correctly.

**Specificity** is also called the true negative rate. Is the number of instances from the negative class (second) class that were actually predicted correctly. Learn more about sensitivity and specificity here.

# load libraries library(caret) library(mlbench) # load the dataset data(PimaIndiansDiabetes) # prepare resampling method control <- trainControl(method="cv", number=5, classProbs=TRUE, summaryFunction=twoClassSummary) set.seed(7) fit <- train(diabetes~., data=PimaIndiansDiabetes, method="glm", metric="ROC", trControl=control) # display results print(fit)

Here, you can see the “good” but not “excellent” AUC score of 0.833. The first level is taken as the positive class, in this case “neg” (no onset of diabetes).

Generalized Linear Model 768 samples 8 predictor 2 classes: 'neg', 'pos' No pre-processing Resampling: Cross-Validated (5 fold) Summary of sample sizes: 614, 614, 615, 615, 614 Resampling results ROC Sens Spec ROC SD Sens SD Spec SD 0.8336003 0.882 0.5600978 0.02111279 0.03563706 0.0560184

Logarithmic Loss or LogLoss is used to evaluate binary classification but it is more common for multi-class classification algorithms. Specifically, it evaluates the probabilities estimated by the algorithms. Learn more about log loss here.

In this case we see logloss calculated for the iris flower multi-class classification problem.

# load libraries library(caret) # load the dataset data(iris) # prepare resampling method control <- trainControl(method="cv", number=5, classProbs=TRUE, summaryFunction=mnLogLoss) set.seed(7) fit <- train(Species~., data=iris, method="rpart", metric="logLoss", trControl=control) # display results print(fit)

Logloss is minimized and we can see the optimal CART model had a cp of 0.

CART 150 samples 4 predictor 3 classes: 'setosa', 'versicolor', 'virginica' No pre-processing Resampling: Cross-Validated (5 fold) Summary of sample sizes: 120, 120, 120, 120, 120 Resampling results across tuning parameters: cp logLoss logLoss SD 0.00 0.4105613 0.6491893 0.44 0.6840517 0.4963032 0.50 1.0986123 0.0000000 logLoss was used to select the optimal model using the smallest value. The final value used for the model was cp = 0.

In this post you discovered different metrics that you can use to evaluate the performance of your machine learning algorithms in R using caret. Specifically:

- Accuracy and Kappa
- RMSE and R^2
- ROC (AUC, Sensitivity and Specificity)
- LogLoss

You can use the recipes in this post you evaluate machine learning algorithms on your current or next machine learning project.

Work through the example in this post.

- Open your R interactive environment.
- Type of copy-paste the sample code above.
- Take your time and understand what is going on, use R help to read-up on functions.

Do you have any questions? Leave a comment and I will do my best.

The post Machine Learning Evaluation Metrics in R appeared first on Machine Learning Mastery.

]]>The post Compare The Performance of Machine Learning Algorithms in R appeared first on Machine Learning Mastery.

]]>In this post you will discover 8 techniques that you can use to compare machine learning algorithms in R.

You can use these techniques to choose the most accurate model, and be able to comment on the statistical significance and the absolute amount it beat out other algorithms.

Let’s get started.

How do you choose the best model for your problem?

When you work on a machine learning project, you often end up with multiple good models to choose from. Each model will have different performance characteristics.

Using resampling methods like cross validation, you can get an estimate for how accurate each model may be on unseen data. You need to be able to uses the estimates to choose one or two best models from the suite of models that you have created.

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

When you have a new dataset it is a good idea to visualize the data using a number of different graphing techniques in order to look at the data from different perspectives.

The same idea applies to model selection. You should use a number of different ways of looking at the estimated accuracy of your machine learning algorithms in order to choose the one or two to finalize.

The way that you can do that is to use different visualization methods to show the average accuracy, variance and other properties of the distribution of model accuracies.

In the next section you will discover exactly how you can do that in R.

In this section you will discover how you can objectively compare machine learning models in R.

Through the case study in this section you will create a number of machine learning models for the Pima Indians diabetes dataset. You will then use a suite of different visualization techniques to compare the estimated accuracy of the models.

This case study is split up into three sections:

**Prepare Dataset**. Load the libraries and dataset ready to train the models.**Train Models**. Train standard machine learning models on the dataset ready for evaluation.**Compare Models**. Compare the trained models using 8 different techniques.

The dataset used in this case study is the Pima Indians diabetes dataset, available on the UCI Machine Learning Repository. It is also available in the mlbench package in R.

It is a binary classification problem as to whether a patient will have an onset of diabetes within the next 5 years. The input attributes are numeric and describe medical details for female patients.

Let’s load the libraries and dataset for this case study.

# load libraries library(mlbench) library(caret) # load the dataset data(PimaIndiansDiabetes)

In this section we will train the 5 machine learning models that we will compare in the next section.

We will use repeated cross validation with 10 folds and 3 repeats, a common standard configuration for comparing models. The evaluation metric is accuracy and kappa because they are easy to interpret.

The algorithms were chosen semi-randomly for their diversity of representation and learning style. They include:

- Classification and Regression Trees
- Linear Discriminant Analysis
- Support Vector Machine with Radial Basis Function
- k-Nearest Neighbors
- Random forest

After the models are trained, they are added to a list and resamples() is called on the list of models. This function checks that the models are comparable and that they used the same training scheme (trainControl configuration). This object contains the evaluation metrics for each fold and each repeat for each algorithm to be evaluated.

The functions that we use in the next section all expect an object with this data.

# prepare training scheme control <- trainControl(method="repeatedcv", number=10, repeats=3) # CART set.seed(7) fit.cart <- train(diabetes~., data=PimaIndiansDiabetes, method="rpart", trControl=control) # LDA set.seed(7) fit.lda <- train(diabetes~., data=PimaIndiansDiabetes, method="lda", trControl=control) # SVM set.seed(7) fit.svm <- train(diabetes~., data=PimaIndiansDiabetes, method="svmRadial", trControl=control) # kNN set.seed(7) fit.knn <- train(diabetes~., data=PimaIndiansDiabetes, method="knn", trControl=control) # Random Forest set.seed(7) fit.rf <- train(diabetes~., data=PimaIndiansDiabetes, method="rf", trControl=control) # collect resamples results <- resamples(list(CART=fit.cart, LDA=fit.lda, SVM=fit.svm, KNN=fit.knn, RF=fit.rf))

In this section we will look at 8 different techniques for comparing the estimated accuracy of the constructed models.

This is the easiest comparison that you can do, simply call the summary function() and pass it the resamples result. It will create a table with one algorithm for each row and evaluation metrics for each column. In this case we have sorted.

# summarize differences between modes summary(results)

I find it useful to look at the mean and the max columns.

Accuracy Min. 1st Qu. Median Mean 3rd Qu. Max. NA's CART 0.6234 0.7115 0.7403 0.7382 0.7760 0.8442 0 LDA 0.6711 0.7532 0.7662 0.7759 0.8052 0.8701 0 SVM 0.6711 0.7403 0.7582 0.7651 0.7890 0.8961 0 KNN 0.6184 0.6984 0.7321 0.7299 0.7532 0.8182 0 RF 0.6711 0.7273 0.7516 0.7617 0.7890 0.8571 0 Kappa Min. 1st Qu. Median Mean 3rd Qu. Max. NA's CART 0.1585 0.3296 0.3765 0.3934 0.4685 0.6393 0 LDA 0.2484 0.4196 0.4516 0.4801 0.5512 0.7048 0 SVM 0.2187 0.3889 0.4167 0.4520 0.5003 0.7638 0 KNN 0.1113 0.3228 0.3867 0.3819 0.4382 0.5867 0 RF 0.2624 0.3787 0.4516 0.4588 0.5193 0.6781 0

This is a useful way to look at the spread of the estimated accuracies for different methods and how they relate.

# box and whisker plots to compare models scales <- list(x=list(relation="free"), y=list(relation="free")) bwplot(results, scales=scales)

Note that the boxes are ordered from highest to lowest mean accuracy. I find it useful to look at the mean values (dots) and the overlaps of the boxes (middle 50% of results).

You can show the distribution of model accuracy as density plots. This is a useful way to evaluate the overlap in the estimated behavior of algorithms.

# density plots of accuracy scales <- list(x=list(relation="free"), y=list(relation="free")) densityplot(results, scales=scales, pch = "|")

I like to look at the differences in the peaks as well as the spread or base of the distributions.

These are useful plots as the show both the mean estimated accuracy as well as the 95% confidence interval (e.g. the range in which 95% of observed scores fell).

# dot plots of accuracy scales <- list(x=list(relation="free"), y=list(relation="free")) dotplot(results, scales=scales)

I find it useful to compare the means and eye-ball the overlap of the spreads between algorithms.

This is another way to look at the data. It shows how each trial of each cross validation fold behaved for each of the algorithms tested. It can help you see how those hold-out subsets that were difficult for one algorithms faired for other algorithms.

# parallel plots to compare models parallelplot(results)

This can be a trick one to interpret. I like to think that this can be helpful in thinking about how different methods could be combined in an ensemble prediction (e.g. stacking) at a later time, especially if you see correlated movements in opposite directions.

This create a scatterplot matrix of all fold-trial results for an algorithm compared to the same fold-trial results for all other algorithms. All pairs are compared.

# pair-wise scatterplots of predictions to compare models splom(results)

This is invaluable when considering whether the predictions from two different algorithms are correlated. If weakly correlated, they are good candidates for being combined in an ensemble prediction.

For example, eye-balling the graphs it looks like LDA and SVM look strongly correlated, as does SVM and RF. SVM and CART look weekly correlated.

You can zoom in on one pair-wise comparison of the accuracy of trial-folds for two machine learning algorithms with an xyplot.

# xyplot plots to compare models xyplot(results, models=c("LDA", "SVM"))

In this case we can see the seemingly correlated accuracy of the LDA and SVM models.

You can calculate the significance of the differences between the metric distributions of different machine learning algorithms. We can summarize the results directly by calling the summary() function.

# difference in model predictions diffs <- diff(results) # summarize p-values for pair-wise comparisons summary(diffs)

We can see a table of pair-wise statistical significance scores. The lower diagonal of the table shows p-values for the null hypothesis (distributions are the same), smaller is better. We can see no difference between CART and kNN, we can also see little difference between the distributions for LDA and SVM.

The upper diagonal of the table shows the estimated difference between the distributions. If we think that LDA is the most accurate model from looking at the previous graphs, we can get an estimate of how much better than specific other models in terms of absolute accuracy.

These scores can help with any accuracy claims you might want to make between specific algorithms.

p-value adjustment: bonferroni Upper diagonal: estimates of the difference Lower diagonal: p-value for H0: difference = 0 Accuracy CART LDA SVM KNN RF CART -0.037759 -0.026908 0.008248 -0.023473 LDA 0.0050068 0.010851 0.046007 0.014286 SVM 0.0919580 0.3390336 0.035156 0.003435 KNN 1.0000000 1.218e-05 0.0007092 -0.031721 RF 0.1722106 0.1349151 1.0000000 0.0034441

A good tip is to increase the number of trials to increase the size of the populations and perhaps more precise p values. You can also plot the differences, but I find the plots a lot less useful than the above summary table.

In this post you discovered 8 different techniques that you can use compare the estimated accuracy of your machine learning models in R.

The 8 techniques you discovered were:

- Table Summary
- Box and Whisker Plots
- Density Plots
- Dot Plots
- Parallel Plots
- Scatterplot Matrix
- Pairwise xyPlots
- Statistical Significance Tests

Did I miss one of your favorite ways to compare the estimated accuracy of machine learning algorithms in R? Leave a comment, I’d love to hear about it!

Did you try out these recipes?

- Start your R interactive environment.
- Type or copy-paste the recipes above and try them out.
- Use the built-in help in R to learn more about the functions used.

Do you have a question. Ask it in the comments and I will do my best to answer it.

The post Compare The Performance of Machine Learning Algorithms in R appeared first on Machine Learning Mastery.

]]>The post Review of Machine Learning With R appeared first on Machine Learning Mastery.

]]>In this post you will discover the book Machine Learning with R by Brett Lantz that has the goal of telling you exactly how to get started practicing machine learning in R.

We cover the audience for the book, a nice deep breakdown of the contents and a summary of the good and bad points.

Let’s get started.

Note: We are talking about the second edition in this review

There are two types of people who should read this book:

- Machine Learning Practitioner. You already know some machine learning and you want to learn how to practice machine learning using R.
- R Practitioner. You are a user of R and you want to learn enough machine learning to practice with R.

From the preface:

It would be helpful yo have a bit of familiarity with basic math and programming concepts, bit no prior experience is required.

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

This section steps you through the topics covered in the book.

When picking up a new book, I like to step through each chapter and see the steps or journey it takes me on. The journey of this book is as follows:

- What is machine learning.
- Handle data.
- Lots of machine learning algorithms
- Evaluate model accuracy.
- Improve accuracy of models.

This covers many of the tasks you need for a machine learning project, but it does miss some.

Let’s step through each chapter and see what the book offers:

Provides an introduction to machine learning, terminology and (very) high-level learning theory.

Topics covered include:

- Uses and abuses of machine learning
- How machines learn
- Machine learning in practice
- Machine learning with R

Interestingly, the topic of machine learning ethics is covered, a topic you don’t often see addressed.

Covers R basics but really focuses on how to load, summarize and visualize data.

Topics include:

- R data structures
- Managing data with R
- Exploring and understanding data

A lot of time is spent on different graph types, which I generally like. It is good to know about and use more than one or two graphs.

This chapter introduces and demonstrates the k-nearest neighbors (kNN) algorithm.

Topics covered include:

- Understanding nearest neighbor classification
- Example – diagnosing breast cancer with k-NN algorithm

I like how good time is spent on data transforms, so critical to the accuracy of kNN.

This chapter introduces and demonstrates the Naive Bayes algorithm for classification.

Topics covered include:

- Understanding Naive Bayes
- Example – filtering mobile phone spam with the Naive Bayes Algorithm

I like the interesting case study problem used.

This chapter introduces decision trees and rule systems with the algorithms C5.0, 1R and RIPPER.

Topics covered include:

- Understanding decision trees
- Example – identifying risky bank loans using C5.0 decision trees
- Understanding classification rules
- Example – identifying poisonous mushrooms with rule learners

I like that C5.0 is covered as it has been priority for a long time and has only recently been released as open source and made available in R. I am surprised that CART was not covered, the hello world of decision tree algorithms.

This chapter is all about regression, with a demonstrations of linear regression, CART and M5P.

Topics covered include:

- Understanding Regression
- Example – predicting medical expenses using linear regression
- Understanding regression trees and model trees
- Example – estimating the quality of wines with regression trees and model trees

It is good to see the classics linear regression and CART covered here. M5P is also a nice touch.

This chapter introduces artificial neural networks and support vector machines.

Topics covered include:

- Understanding neural networks
- Example – Modeling the strength of concrete with ANNs
- Understanding support vector machines
- Example – performing OCR with SVMs

It is good to see these algorithms covered and the example problems are interesting.

This chapter introduces and demonstrates association rule algorithms, typically used for market basket analysis.

Topics covered include:

- Understanding association rules.
- Example – identifying frequently purchased groceries with association rules

It’s not a topic I like much nor an algorithm I have ever had to use on a project. I’d drop this chapter.

This chapter introduces he k-means clustering algorithm and demonstrates it on data.

Topics covered include:

- Understanding clustering
- Example – finding teen market segments using k-means clustering

Another esoteric topic that I would probably drop. Clustering is interesting but often unsupervised learning algorithms are really hard to use well in practice. Here’s some clusters, now what.

This chapter presents methods for evaluating model skill.

Topics covered include:

- Measuring performance for classification
- Evaluating future performance

I like that performance measures and resampling methods are covered. Many texts skip it. I like that a lot of time is spent on the more detailed concerns of classification accuracy (e.g. touching on Kappa and F1 scores).

This chapter introduces techniques that you can use to improve the accuracy of your models, namely algorithm tuning and ensembles.

Topics covered include:

- Tuning stock models for better performance
- Improving model performance with meta-learning

Good but too brief. Algorithm tuning and ensembles are a big part of building accurate models in modern machine learning. Length could be suitable given that it is an introductory text, but more time should be given to the caret package.

If you’re not using caret for machine learning in R, you’re doing it wrong.

This chapter contains a mess of other topics, including:

- Working with proprietary files and databases
- Working with online data and services
- Working with domain-specific data
- Improving performance of R

The topics are very specialized. Perhaps only the last on “improving performance of R” is really actionable for your machine learning projects.

The book covers a number of different machine learning algorithms. This section lists all of the algorithms covered and in which chapter they can be found.

I note that page 21 of the book does provide a look-up table of algorithms to chapters, but it is too high-level and glosses over the actual names of the algorithms used.

- k-nearest neighbors (chapter 3)
- Naive Bayes (chapter 4)
- C5.0 (chapter 5)
- 1R (chapter 5)
- RIPPER (chapter 5)
- Linear Regression (chapter 6)
- Classification and Regression Trees (chapter 6)
- M5P (chapter 6)
- Artificial Neural Networks (chapter 7)
- Support Vector Machines (chapter 7)
- Apriori (chapter 8)
- k-means (chapter 9)
- Bagged CART (chapter 10)
- AdaBoost (chapter 10)
- Random Forest (chapter 10)

I like the book as an introduction for how to do machine learning on the R platform.

You must know how to program. You must know a little bit of R. You must have some sense of how to drive a machine learning project from beginning to end. This book will not cover these topics, but it will show you how to complete common machine learning tasks using R.

Set your expectations accordingly:

- This is a practical book with worked examples and high-level algorithm descriptions.
- This is not a machine learning textbook with theory, proof and lots of equations.

- I like the structured examples how each algorithm is demonstrated with a different dataset.
- I like that the datasets are small in memory examples perhaps all taken from the UCI Machine Learning Repository.
- I like that references to research papers are provided where appropriate for further reading.
- I like the boxes that summarize usage information for algorithms and other key techniques.
- I like that it is practically focused, the how of machine learning not the deep why.

- I don’t like that it is so algorithms focused. It general structure of most “applied books” and dumps a lot of algorithms on you rather than the extended project lifecycle.
- I don’t like that there are no end-to-end examples (problem definition, through to model selection, through to presentation of results). The formal structure of examples is good, but I’d a deep case study chapter I think.
- I cannot download the code and datasets from a GitHub repository or as a zip. I have sign up and go through their process.
- There are chapters there that feel like they are only there because similar chapters exist in other machine learning books (clustering and association rules). These may be machine learning methods, but are not used nearly as often as core predictive modeling methods (IMHO).
- Perhaps a little too much filler. I like less talk more action. If I want long algorithm description I’d read an algorithms textbook. Tell me the broad strokes and let’s get to it.

If you are looking for a good applied book for machine learning with R, this is it. I like it for beginners who know a little machine learning and/or a little R and want to practice machine learning on the R platform.

Even though I think O’Reilly books are generally better applied books than Packt, I don’t see an offering from O’Reilly that can compete.

If you want to go one step deeper and get some more theory and more explanations I would advise checking out: Applied Predictive Modeling. If you want more math I would suggest An Introduction to Statistical Learning: with Applications in R.

Both books have examples in R, but less focus on R and more focus on the details of machine learning algorithms.

Have you read this book? Let me know what you think in the comments.

Are you thinking of buying this book? Have any questions? Let me know in the comments and I’ll do my best to answer them.

The post Review of Machine Learning With R appeared first on Machine Learning Mastery.

]]>The post Machine Learning Project Template in R appeared first on Machine Learning Mastery.

]]>You cannot get better at it by reading books and blog posts. You have to practice.

In this post, you will discover the simple 6-step machine learning project template that you can use to jump-start your project in R.

Let’s get started.

Working through machine learning problems from end-to-end is critically important.

You can read about machine learning. You can also try out small one-off recipes. But applied machine learning will not come alive for you until you work through a dataset from beginning to end.

Working through a project forces you to think about how the model will be used, to challenge your assumptions and to get good at all parts of a project, not just your favorite.

The best way to practice predictive modeling machine learning projects is to use standardized datasets from the UCI Machine Learning Repository. For more this approach to practicing, see the post:

Once you have a practice dataset and a bunch of R recipes, how do you put it all together and work through the problem end-to-end?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Any predictive modeling machine learning project can be broken down into about 6 common tasks:

- Define Problem
- Summarize Data
- Prepare Data
- Evaluate Algorithms
- Improve Results
- Present Results

Tasks can be combined or broken down further, but this is the general structure. For more on this structure see the post:

To work through predictive modeling machine learning problems in R, you need to map R onto this process. They may need to be adapted or renamed slightly to suit the “R way of doing things” (e.g. the caret package).

The next section provides exactly this mapping and elaborates each task and the types of sub tasks and caret packages that you can use.

This section presents a project template that you can use to work through machine learning problems in R end-to-end.

Below is the project template that you can use in your machine learning projects in R.

# R Project Template # 1. Prepare Problem # a) Load libraries # b) Load dataset # c) Split-out validation dataset # 2. Summarize Data # a) Descriptive statistics # b) Data visualizations # 3. Prepare Data # a) Data Cleaning # b) Feature Selection # c) Data Transforms # 4. Evaluate Algorithms # a) Test options and evaluation metric # b) Spot Check Algorithms # c) Compare Algorithms # 5. Improve Accuracy # a) Algorithm Tuning # b) Ensembles # 6. Finalize Model # a) Predictions on validation dataset # b) Create standalone model on entire training dataset # c) Save model for later use

- Create a new file for your project (e.g. project_name.R).
- Copy the project template.
- Paste it into your empty project file.
- Start to fill it in, using recipes from blog posts on this site and others.

This section gives you additional details on each of the steps of the template.

This step is about loading everything you need to start working on your problem. This includes:

- R libraries you will use like caret.
- Loading your dataset from CSV.
- Using caret to create a separate training and validation datasets.

This is also the home of any global configuration you might need to do, like setting up any parallel libraries and functions for using multiple cores.

It is also the place where you might need to make reduced sample of your dataset if it is too large to work with. Ideally, your dataset should be small enough to build a model or great a visualization within a minute, ideally 30 seconds. You can always scale up well performing models later.

This step is about better understanding the data that you have available.

This includes understanding your data using:

- Descriptive statistics such as summaries.
- Data visualizations such as plots from the graphics and lattice packages.

Take your time and use the results to prompt a lot of questions, assumptions and hypotheses that you can investigate later with specialized models.

This step is about preparing the data in such a way that it best exposes the structure of the problem and the relationships between your input attributes with the output variable.

This includes tasks such as:

- Cleaning data by removing duplicates, marking missing values and even imputing missing values.
- Feature selection where redundant features may be removed.
- Data transforms where attributes are scaled or redistributed in order to best expose the structure of the problem later to learning algorithms.

Start simple. Revisit this step often and cycle with the next step until you converge on a subset of algorithms and a presentation of the data that results in accurate or accurate enough models to proceed.

This step is about finding a subset of machine learning algorithms that are good at exploiting the structure of your data (e.g. have better than average skill).

This involves steps such as:

- Defining test options using caret such as cross validation and the evaluation metric to use.
- Spot checking a suite of linear and nonlinear machine learning algorithms.
- Comparing the estimated accuracy of algorithms.

Practically, on a given problem you will likely spend most of your time on this and the previous step until you converge on a set of 3-to-5 well performing machine learning algorithms.

Once you have a short list of machine learning algorithms, you need to get the most out of them. There are two different ways to improve the accuracy of your models:

- Search for a combination of parameters for each algorithm using caret that yields the best results.
- Combine the prediction of multiple models into an ensemble prediction using standalone algorithms or the caretEnsemble package.

The line between this and the previous step can blur when a project becomes concrete. There may be a little algorithm tuning in the previous step. And in the case of ensembles, you may bring more than a short list of algorithms forward to combine their predictions.

Once you have found a model that you believe can make accurate predictions on unseen data, you are ready to finalize it.

Finalizing a model may involve sub-tasks such as:

- Using an optimal model tuned by caret to make predictions on unseen data.
- Creating a standalone model using the parameters tuned by caret.
- Saving an optimal model to file for later use.

Once you make this far you are ready to present results to stakeholders and/or deploy your model to start making predictions on unseen data.

Below are tips that you can use to make the most of the machine learning project template in R.

- Fast First Pass. Make a first-pass through the project steps as fast as possible. This will give you confidence that you have all the parts that you need and a base line from which to improve.
- Cycles. The process in not linear but cyclic. You will loop between steps, and probably spend most of your time in tight loops between steps 3-4 or 3-4-5 until you achieve a level of accuracy that is sufficient or you run out of time.
- Attempt Every Step. It easy to skip steps, especially if you are not confident or familiar with the tasks of that step. Try and do something at each step in the process, even if it does not improve accuracy. You can always build upon it later. Don’t skip steps, just reduce their contribution.
- Ratchet Accuracy. The goal of the project is model accuracy. Every step contributes towards this goal. Treat changes that you make and experiments that increase accuracy as the golden path in the process and reorganize other steps around them. Accuracy is a ratchet that can only move in one direction (better, not worse).
- Adapt As Needed. Modify the steps as you need on a project, especially as you become more experienced with the template. Blur the edges of tasks, such as 4-5 to best serve model accuracy.

**You do not need to be an R programmer**. You can use the recipes on this blog and others to jump-start your machine learning project, and lean on the R help system to understand the functions and arguments used.

**You do not need to be a machine learning expert**. Machine learning is an empirical skill that you can only improve at by practicing. Start practicing now on small in-memory datasets.

**You do not need to be a master of machine learning algorithms**. There are far too many machine algorithms to be an expert in them all. It is much easier to focus on the goal of getting good at applying machine learning algorithms to data. Start practicing now using the template above.

In this post you discovered a machine learning project template in R.

It lays out the steps of a predictive modeling machine learning project with the goal of maximizing model accuracy.

You can copy-and-paste the template and use it jump-start your current or next machine learning project in R.

Use the template on a project.

- If you are currently working or about to start working on a machine learning project, use the template. Report back how you went using template and any modifications you need to make.
- Don’t have a project? Use a standard small in-memory machine learning dataset from the UCI Machine Learning Repository and start practicing today. Right now if possible.

Do you have a question about the machine learning project template? Ask in the comments, I’ll do my best to answer.

The post Machine Learning Project Template in R appeared first on Machine Learning Mastery.

]]>The post Save And Finalize Your Machine Learning Model in R appeared first on Machine Learning Mastery.

]]>In this post you will discover how to finalize your machine learning model in R including: making predictions on unseen data, re-building the model from scratch and saving your model for later use.

Let’s get started.

Once you have an accurate model on your test harness you are nearly, done. But not yet.

There are still a number of tasks to do to finalize your model. The whole idea of creating an accurate model for your dataset was to make predictions on unseen data.

There are three tasks you may be concerned with:

- Making new predictions on unseen data.
- Creating a standalone model using all training data.
- Saving your model to file for later loading and making predictions on new data.

Once you have finalized your model you are ready to make use of it. You could use the R model directly. You could also discover the key internal representation found by the learning algorithm (like the coefficients in a linear model) and use them in a new implementation of the prediction algorithm on another platform.

In the next section, you will look at how you can finalize your machine learning model in R.

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Caret is an excellent tool that you can use to find good or even best machine learning algorithms and parameters for machine learning algorithms.

But what do you do after you have discovered a model that is accurate enough to use?

Once you have found a good model in R, you have three main concerns:

- Making new predictions using your tuned caret model.
- Creating a standalone model using the entire training dataset.
- Saving/Loading a standalone model to file.

This section will step you through how to achieve each of these tasks in R.

You can make new predictions using a model you have tuned using caret using the *predict.train()* function.

In the recipe below, the dataset is split into a validation dataset and a training dataset. The validation dataset could just as easily be a new dataset stored in a separate file and loaded as a data frame.

A good model of the data is found using LDA. We can see that caret provides access to the best model from a training run in the finalModel variable.

We can use that model to make predictions by calling predict using the fit from train which will automatically use the final model. We must specify the data one which to make predictions via the *newdata* argument.

# load libraries library(caret) library(mlbench) # load dataset data(PimaIndiansDiabetes) # create 80%/20% for training and validation datasets set.seed(9) validation_index <- createDataPartition(PimaIndiansDiabetes$diabetes, p=0.80, list=FALSE) validation <- PimaIndiansDiabetes[-validation_index,] training <- PimaIndiansDiabetes[validation_index,] # train a model and summarize model set.seed(9) control <- trainControl(method="cv", number=10) fit.lda <- train(diabetes~., data=training, method="lda", metric="Accuracy", trControl=control) print(fit.lda) print(fit.lda$finalModel) # estimate skill on validation dataset set.seed(9) predictions <- predict(fit.lda, newdata=validation) confusionMatrix(predictions, validation$diabetes)

Running the example, we can see that the estimated accuracy on the training dataset was 76.91%. Using the finalModel in the fit, we can see that the accuracy on the hold out validation dataset was 77.78%, very similar to our estimate.

Resampling results Accuracy Kappa Accuracy SD Kappa SD 0.7691169 0.45993 0.06210884 0.1537133 ... Confusion Matrix and Statistics Reference Prediction neg pos neg 85 19 pos 15 34 Accuracy : 0.7778 95% CI : (0.7036, 0.8409) No Information Rate : 0.6536 P-Value [Acc > NIR] : 0.000586 Kappa : 0.5004 Mcnemar's Test P-Value : 0.606905 Sensitivity : 0.8500 Specificity : 0.6415 Pos Pred Value : 0.8173 Neg Pred Value : 0.6939 Prevalence : 0.6536 Detection Rate : 0.5556 Detection Prevalence : 0.6797 Balanced Accuracy : 0.7458 'Positive' Class : neg

In this example, we have tuned a random forest with 3 different values for *mtry* and *ntree* set to 2000. By printing the fit and the finalModel, we can see that the most accurate value for *mtry* was 2.

Now that we know a good algorithm (random forest) and the good configuration (mtry=2, *ntree=2000*) we can create the final model directly using all of the training data. We can lookup the “*rf*” random forest implementation used by caret in the Caret List of Models and note that it is using the *randomForest* package and in turn the *randomForest()* function.

The example creates a new model directly and uses it to make predictions on the new data, this case simulated as the verification dataset.

# load libraries library(caret) library(mlbench) library(randomForest) # load dataset data(Sonar) set.seed(7) # create 80%/20% for training and validation datasets validation_index <- createDataPartition(Sonar$Class, p=0.80, list=FALSE) validation <- Sonar[-validation_index,] training <- Sonar[validation_index,] # train a model and summarize model set.seed(7) control <- trainControl(method="repeatedcv", number=10, repeats=3) fit.rf <- train(Class~., data=training, method="rf", metric="Accuracy", trControl=control, ntree=2000) print(fit.rf) print(fit.rf$finalModel) # create standalone model using all training data set.seed(7) finalModel <- randomForest(Class~., training, mtry=2, ntree=2000) # make a predictions on "new data" using the final model final_predictions <- predict(finalModel, validation[,1:60]) confusionMatrix(final_predictions, validation$Class)

We can see that the estimated accuracy of the optimal configuration was 85.07%. We can see that the accuracy of the final standalone model trained on all of the training dataset and predicting for the validation dataset was 82.93%.

Random Forest 167 samples 60 predictor 2 classes: 'M', 'R' No pre-processing Resampling: Cross-Validated (10 fold, repeated 3 times) Summary of sample sizes: 151, 150, 150, 150, 151, 150, ... Resampling results across tuning parameters: mtry Accuracy Kappa Accuracy SD Kappa SD 2 0.8507353 0.6968343 0.07745360 0.1579125 31 0.8064951 0.6085348 0.09373438 0.1904946 60 0.7927696 0.5813335 0.08768147 0.1780100 Accuracy was used to select the optimal model using the largest value. The final value used for the model was mtry = 2. ... Call: randomForest(x = x, y = y, ntree = 2000, mtry = param$mtry) Type of random forest: classification Number of trees: 2000 No. of variables tried at each split: 2 OOB estimate of error rate: 14.37% Confusion matrix: M R class.error M 83 6 0.06741573 R 18 60 0.23076923 ... Confusion Matrix and Statistics Reference Prediction M R M 20 5 R 2 14 Accuracy : 0.8293 95% CI : (0.6794, 0.9285) No Information Rate : 0.5366 P-Value [Acc > NIR] : 8.511e-05 Kappa : 0.653 Mcnemar's Test P-Value : 0.4497 Sensitivity : 0.9091 Specificity : 0.7368 Pos Pred Value : 0.8000 Neg Pred Value : 0.8750 Prevalence : 0.5366 Detection Rate : 0.4878 Detection Prevalence : 0.6098 Balanced Accuracy : 0.8230 'Positive' Class : M

Some simpler models, like linear models can output their coefficients. This is useful, because from these, you can implement the simple prediction procedure in your language of choice and use the coefficients to get the same accuracy. This gets more difficult as the complexity of the representation increases.

You can save your best models to a file so that you can load them up later and make predictions.

In this example we split the Sonar dataset into a training dataset and a validation dataset. We take our validation dataset as new data to test our final model. We train the final model using the training dataset and our optimal parameters, then save it to a file called final_model.rds in the local working directory.

The model is serialized. It can be loaded at a later time by calling readRDS() and assigning the object that is loaded (in this case a random forest fit) to a variable name. The loaded random forest is then used to make predictions on new data, in this case the validation dataset.

# load libraries library(caret) library(mlbench) library(randomForest) library(doMC) registerDoMC(cores=8) # load dataset data(Sonar) set.seed(7) # create 80%/20% for training and validation datasets validation_index <- createDataPartition(Sonar$Class, p=0.80, list=FALSE) validation <- Sonar[-validation_index,] training <- Sonar[validation_index,] # create final standalone model using all training data set.seed(7) final_model <- randomForest(Class~., training, mtry=2, ntree=2000) # save the model to disk saveRDS(final_model, "./final_model.rds") # later... # load the model super_model <- readRDS("./final_model.rds") print(super_model) # make a predictions on "new data" using the final model final_predictions <- predict(super_model, validation[,1:60]) confusionMatrix(final_predictions, validation$Class)

We can see that the accuracy on the validation dataset was 82.93%.

Confusion Matrix and Statistics Reference Prediction M R M 20 5 R 2 14 Accuracy : 0.8293 95% CI : (0.6794, 0.9285) No Information Rate : 0.5366 P-Value [Acc > NIR] : 8.511e-05 Kappa : 0.653 Mcnemar's Test P-Value : 0.4497 Sensitivity : 0.9091 Specificity : 0.7368 Pos Pred Value : 0.8000 Neg Pred Value : 0.8750 Prevalence : 0.5366 Detection Rate : 0.4878 Detection Prevalence : 0.6098 Balanced Accuracy : 0.8230 'Positive' Class : M

In this post you discovered three recipes for working with final predictive models:

- How to make predictions using the best model from caret tuning.
- How to create a standalone model using the parameters found during caret tuning.
- How to save and later load a standalone model and use it to make predictions.

You can work through these recipes to understand them better. You can also use them as a template and copy-and-paste them into your current or next machine learning project.

Did you try out these recipes?

- Start your R interactive environment.
- Type or copy-paste the recipes above and try them out.
- Use the built-in help in R to learn more about the functions used.

Do you have a question. Ask it in the comments and I will do my best to answer it.

The post Save And Finalize Your Machine Learning Model in R appeared first on Machine Learning Mastery.

]]>The post Get Your Data Ready For Machine Learning in R with Pre-Processing appeared first on Machine Learning Mastery.

]]>In this post you will discover how to transform your data in order to best expose its structure to machine learning algorithms in R using the caret package.

You will work through 8 popular and powerful data transforms with recipes that you can study or copy and paste int your current or next machine learning project.

Let’s get started.

You want to get the best accuracy from machine learning algorithms on your datasets.

Some machine learning algorithms require the data to be in a specific form. Whereas other algorithms can perform better if the data is prepared in a specific way, but not always. Finally, your raw data may not be in the best format to best expose the underlying structure and relationships to the predicted variables.

It is important to prepare your data in such a way that it gives various different machine learning algorithms the best chance on your problem.

You need to pre-process your raw data as part of your machine learning project.

It is hard to know which data-preprocessing methods to use.

You can use rules of thumb such as:

- Instance based methods are more effective if the input attributes have the same scale.
- Regression methods can work better of the input attributes are standardized.

These are heuristics, but not hard and fast laws of machine learning, because sometimes you can get better results if you ignore them.

You should try a range of data transforms with a range of different machine learning algorithms. This will help you discover both good representations for your data and algorithms that are better at exploiting the structure that those representations expose.

It is a good idea to spot check a number of transforms both in isolation as well as combinations of transforms.

In the next section you will discover how you can apply data transforms in order to prepare your data in R using the caret package.

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The caret package in R provides a number of useful data transforms.

These transforms can be used in two ways.

**Standalone**: Transforms can be modeled from training data and applied to multiple datasets. The model of the transform is prepared using the*preProcess()*function and applied to a dataset using the*predict()*function.**Training**: Transforms can prepared and applied automatically during model evaluation. Transforms applied during training are prepared using the*preProcess()*and passed to the*train()*function via the preProcess argument.

A number of data preprocessing examples are presented in this section. They are presented using the standalone method, but you can just as easily use the prepared preprocessed model during model training.

All of the preprocessing examples in this section are for numerical data. Note that the preprocessing functions will skip over non-numeric data without raising an error.

You can learn more about the data transforms provided by the caret package by reading the help for the preProcess function by typing ?preProcess and by reading the Caret Pre-Processing page.

The data transforms presented are more likely to be useful for algorithms such as regression algorithms, instance-based methods (like kNN and LVQ), support vector machines and neural networks. They are less likely to be useful for tree and rule based methods.

Below is a quick summary of all of the transform methods supported in the method argument of the *preProcess()* function in caret.

- “
*BoxCox*“: apply a Box–Cox transform, values must be non-zero and positive. - “
*YeoJohnson*“: apply a Yeo-Johnson transform, like a BoxCox, but values can be negative. - “
*expoTrans*“: apply a power transform like BoxCox and YeoJohnson. - “
*zv*“: remove attributes with a zero variance (all the same value). - “
*nzv*“: remove attributes with a near zero variance (close to the same value). - “
*center*“: subtract mean from values. - “
*scale*“: divide values by standard deviation. - “
*range*“: normalize values. - “
*pca*“: transform data to the principal components. - “
*ica*“: transform data to the independent components. - “
*spatialSign*“: project data onto a unit circle.

The following sections will demonstrate some of the more popular methods.

The scale transform calculates the standard deviation for an attribute and divides each value by that standard deviation.

# load libraries library(caret) # load the dataset data(iris) # summarize data summary(iris[,1:4]) # calculate the pre-process parameters from the dataset preprocessParams <- preProcess(iris[,1:4], method=c("scale")) # summarize transform parameters print(preprocessParams) # transform the dataset using the parameters transformed <- predict(preprocessParams, iris[,1:4]) # summarize the transformed dataset summary(transformed)

Running the recipe, you will see:

Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Created from 150 samples and 4 variables Pre-processing: - ignored (0) - scaled (4) Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :5.193 Min. : 4.589 Min. :0.5665 Min. :0.1312 1st Qu.:6.159 1st Qu.: 6.424 1st Qu.:0.9064 1st Qu.:0.3936 Median :7.004 Median : 6.883 Median :2.4642 Median :1.7055 Mean :7.057 Mean : 7.014 Mean :2.1288 Mean :1.5734 3rd Qu.:7.729 3rd Qu.: 7.571 3rd Qu.:2.8890 3rd Qu.:2.3615 Max. :9.540 Max. :10.095 Max. :3.9087 Max. :3.2798

The center transform calculates the mean for an attribute and subtracts it from each value.

# load libraries library(caret) # load the dataset data(iris) # summarize data summary(iris[,1:4]) # calculate the pre-process parameters from the dataset preprocessParams <- preProcess(iris[,1:4], method=c("center")) # summarize transform parameters print(preprocessParams) # transform the dataset using the parameters transformed <- predict(preprocessParams, iris[,1:4]) # summarize the transformed dataset summary(transformed)

Running the recipe, you will see:

Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Created from 150 samples and 4 variables Pre-processing: - centered (4) - ignored (0) Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :-1.54333 Min. :-1.05733 Min. :-2.758 Min. :-1.0993 1st Qu.:-0.74333 1st Qu.:-0.25733 1st Qu.:-2.158 1st Qu.:-0.8993 Median :-0.04333 Median :-0.05733 Median : 0.592 Median : 0.1007 Mean : 0.00000 Mean : 0.00000 Mean : 0.000 Mean : 0.0000 3rd Qu.: 0.55667 3rd Qu.: 0.24267 3rd Qu.: 1.342 3rd Qu.: 0.6007 Max. : 2.05667 Max. : 1.34267 Max. : 3.142 Max. : 1.3007

Combining the scale and center transforms will standardize your data. Attributes will have a mean value of 0 and a standard deviation of 1.

# load libraries library(caret) # load the dataset data(iris) # summarize data summary(iris[,1:4]) # calculate the pre-process parameters from the dataset preprocessParams <- preProcess(iris[,1:4], method=c("center", "scale")) # summarize transform parameters print(preprocessParams) # transform the dataset using the parameters transformed <- predict(preprocessParams, iris[,1:4]) # summarize the transformed dataset summary(transformed)

Notice how we can list multiple methods in a list when defining the preProcess procedure in caret. Running the recipe, you will see:

Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Created from 150 samples and 4 variables Pre-processing: - centered (4) - ignored (0) - scaled (4) Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :-1.86378 Min. :-2.4258 Min. :-1.5623 Min. :-1.4422 1st Qu.:-0.89767 1st Qu.:-0.5904 1st Qu.:-1.2225 1st Qu.:-1.1799 Median :-0.05233 Median :-0.1315 Median : 0.3354 Median : 0.1321 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 3rd Qu.: 0.67225 3rd Qu.: 0.5567 3rd Qu.: 0.7602 3rd Qu.: 0.7880 Max. : 2.48370 Max. : 3.0805 Max. : 1.7799 Max. : 1.7064

Data values can be scaled into the range of [0, 1] which is called normalization.

# load libraries library(caret) # load the dataset data(iris) # summarize data summary(iris[,1:4]) # calculate the pre-process parameters from the dataset preprocessParams <- preProcess(iris[,1:4], method=c("range")) # summarize transform parameters print(preprocessParams) # transform the dataset using the parameters transformed <- predict(preprocessParams, iris[,1:4]) # summarize the transformed dataset summary(transformed)

Running the recipe, you will see:

Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Created from 150 samples and 4 variables Pre-processing: - ignored (0) - re-scaling to [0, 1] (4) Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 1st Qu.:0.2222 1st Qu.:0.3333 1st Qu.:0.1017 1st Qu.:0.08333 Median :0.4167 Median :0.4167 Median :0.5678 Median :0.50000 Mean :0.4287 Mean :0.4406 Mean :0.4675 Mean :0.45806 3rd Qu.:0.5833 3rd Qu.:0.5417 3rd Qu.:0.6949 3rd Qu.:0.70833 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000

When an attribute has a Gaussian-like distribution but is shifted, this is called a skew. The distribution of an attribute can be shifted to reduce the skew and make it more Gaussian. The BoxCox transform can perform this operation (assumes all values are positive).

# load libraries library(mlbench) library(caret) # load the dataset data(PimaIndiansDiabetes) # summarize pedigree and age summary(PimaIndiansDiabetes[,7:8]) # calculate the pre-process parameters from the dataset preprocessParams <- preProcess(PimaIndiansDiabetes[,7:8], method=c("BoxCox")) # summarize transform parameters print(preprocessParams) # transform the dataset using the parameters transformed <- predict(preprocessParams, PimaIndiansDiabetes[,7:8]) # summarize the transformed dataset (note pedigree and age) summary(transformed)

Notice, we applied the transform to only two attributes that appear to have a skew. Running the recipe, you will see:

pedigree age Min. :0.0780 Min. :21.00 1st Qu.:0.2437 1st Qu.:24.00 Median :0.3725 Median :29.00 Mean :0.4719 Mean :33.24 3rd Qu.:0.6262 3rd Qu.:41.00 Max. :2.4200 Max. :81.00 Created from 768 samples and 2 variables Pre-processing: - Box-Cox transformation (2) - ignored (0) Lambda estimates for Box-Cox transformation: -0.1, -1.1 pedigree age Min. :-2.5510 Min. :0.8772 1st Qu.:-1.4116 1st Qu.:0.8815 Median :-0.9875 Median :0.8867 Mean :-0.9599 Mean :0.8874 3rd Qu.:-0.4680 3rd Qu.:0.8938 Max. : 0.8838 Max. :0.9019

For more on this transform see the Box-Cox transform Wikiepdia.

Another power-transform like the Box-Cox transform, but it supports raw values that are equal to zero and negative.

# load libraries library(mlbench) library(caret) # load the dataset data(PimaIndiansDiabetes) # summarize pedigree and age summary(PimaIndiansDiabetes[,7:8]) # calculate the pre-process parameters from the dataset preprocessParams <- preProcess(PimaIndiansDiabetes[,7:8], method=c("YeoJohnson")) # summarize transform parameters print(preprocessParams) # transform the dataset using the parameters transformed <- predict(preprocessParams, PimaIndiansDiabetes[,7:8]) # summarize the transformed dataset (note pedigree and age) summary(transformed)

Running the recipe, you will see:

pedigree age Min. :0.0780 Min. :21.00 1st Qu.:0.2437 1st Qu.:24.00 Median :0.3725 Median :29.00 Mean :0.4719 Mean :33.24 3rd Qu.:0.6262 3rd Qu.:41.00 Max. :2.4200 Max. :81.00 Created from 768 samples and 2 variables Pre-processing: - ignored (0) - Yeo-Johnson transformation (2) Lambda estimates for Yeo-Johnson transformation: -2.25, -1.15 pedigree age Min. :0.0691 Min. :0.8450 1st Qu.:0.1724 1st Qu.:0.8484 Median :0.2265 Median :0.8524 Mean :0.2317 Mean :0.8530 3rd Qu.:0.2956 3rd Qu.:0.8580 Max. :0.4164 Max. :0.8644

Transform the data to the principal components. The transform keeps components above the variance threshold (default=0.95) or the number of components can be specified (pcaComp). The result is attributes that are uncorrelated, useful for algorithms like linear and generalized linear regression.

# load the libraries library(mlbench) # load the dataset data(iris) # summarize dataset summary(iris) # calculate the pre-process parameters from the dataset preprocessParams <- preProcess(iris, method=c("center", "scale", "pca")) # summarize transform parameters print(preprocessParams) # transform the dataset using the parameters transformed <- predict(preprocessParams, iris) # summarize the transformed dataset summary(transformed)

Notice that when we run the recipe that only two principal components are selected.

Sepal.Length Sepal.Width Petal.Length Petal.Width Species Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50 Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Created from 150 samples and 5 variables Pre-processing: - centered (4) - ignored (1) - principal component signal extraction (4) - scaled (4) PCA needed 2 components to capture 95 percent of the variance Species PC1 PC2 setosa :50 Min. :-2.7651 Min. :-2.67732 versicolor:50 1st Qu.:-2.0957 1st Qu.:-0.59205 virginica :50 Median : 0.4169 Median :-0.01744 Mean : 0.0000 Mean : 0.00000 3rd Qu.: 1.3385 3rd Qu.: 0.59649 Max. : 3.2996 Max. : 2.64521

Transform the data to the independent components. Unlike PCA, ICA retains those components that are independent. You must specify the number of desired independent components with the *n.comp* argument. Useful for algorithms such as naive bayes.

# load libraries library(mlbench) library(caret) # load the dataset data(PimaIndiansDiabetes) # summarize dataset summary(PimaIndiansDiabetes[,1:8]) # calculate the pre-process parameters from the dataset preprocessParams <- preProcess(PimaIndiansDiabetes[,1:8], method=c("center", "scale", "ica"), n.comp=5) # summarize transform parameters print(preprocessParams) # transform the dataset using the parameters transformed <- predict(preprocessParams, PimaIndiansDiabetes[,1:8]) # summarize the transformed dataset summary(transformed)

Running the recipe, you will see:

pregnant glucose pressure triceps insulin mass pedigree Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. :0.0780 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 Median : 3.000 Median :117.0 Median : 72.00 Median :23.00 Median : 30.5 Median :32.00 Median :0.3725 Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54 Mean : 79.8 Mean :31.99 Mean :0.4719 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00 Max. :846.0 Max. :67.10 Max. :2.4200 age Min. :21.00 1st Qu.:24.00 Median :29.00 Mean :33.24 3rd Qu.:41.00 Max. :81.00 Created from 768 samples and 8 variables Pre-processing: - centered (8) - independent component signal extraction (8) - ignored (0) - scaled (8) ICA used 5 components ICA1 ICA2 ICA3 ICA4 ICA5 Min. :-5.7213 Min. :-4.89818 Min. :-6.0289 Min. :-2.573436 Min. :-1.8815 1st Qu.:-0.4873 1st Qu.:-0.48188 1st Qu.:-0.4693 1st Qu.:-0.640601 1st Qu.:-0.8279 Median : 0.1813 Median : 0.05071 Median : 0.2987 Median : 0.007582 Median :-0.2416 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.000000 Mean : 0.0000 3rd Qu.: 0.6839 3rd Qu.: 0.56462 3rd Qu.: 0.6941 3rd Qu.: 0.638238 3rd Qu.: 0.7048 Max. : 2.1819 Max. : 4.25611 Max. : 1.3726 Max. : 3.761017 Max. : 2.9622

Below are some tips for getting the most out of data transforms.

**Actually Use Them**. You are a step ahead if you are thinking about and using data transforms to prepare your data. It is an easy step to forget or skip over and often has a huge impact on the accuracy of your final models.**Use a Variety**. Try a number of different data transforms on your data with a suite of different machine learning algorithms.**Review a Summary**. It is a good idea to summarize your data before and after a transform to understand the effect it had. The*summary()*function can be very useful.**Visualize Data**. It is also a good idea to visualize the distribution of your data before and after to get a spatial intuition for the effect of the transform.

In this section you discovered 8 data preprocessing methods that you can use on your data in R via the caret package:

- Data scaling
- Data centering
- Data standardization
- Data normalization
- The Box-Cox Transform
- The Yeo-Johnson Transform
- PCA Transform
- ICA Transform

You can practice with the recipes presented in this section or apply them on your current or next machine learning project.

Did you try out these recipes?

- Start your R interactive environment.
- Type or copy-paste the recipes above and try them out.
- Use the built-in help in R to learn more about the functions used.

Do you have a question. Ask it in the comments and I will do my best to answer it.

The post Get Your Data Ready For Machine Learning in R with Pre-Processing appeared first on Machine Learning Mastery.

]]>The post Spot Check Machine Learning Algorithms in R (algorithms to try on your next project) appeared first on Machine Learning Mastery.

]]>But what algorithms should you spot check?

In this post you discover the 8 machine learning algorithms you should spot check on your data.

You also get recipes of each algorithm that you can copy and paste into your current or next machine learning project in R.

Let’s get started.

You cannot know which algorithm will work best on your dataset before hand.

You must use trial and error to discover a short list of algorithms that do well on your problem that you can then double down on and tune further. I call this process spot checking.

The question is not:

What algorithm should I use on my dataset?

Instead it is:

What algorithms should I spot check on my dataset?

You can guess at what algorithms might do well on your dataset, and this can be a good starting point.

I recommend trying a mixture of algorithms and see what is good at picking out the structure in your data.

- Try a mixture of algorithm representations (e.g. instances and trees).
- Try a mixture of learning algorithms (e.g. different algorithms for learning the same type of representation).
- Try a mixture of modeling types (e.g. linear and non-linear functions or parametric and non-parametric).

Let’s get specific. In the next section, we will look algorithms that you can use to spot check on your next machine learning project in R.

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

There are hundreds of machine learning algorithms available in R.

I would recommend exploring many of them, especially, if making accurate predictions on your dataset is important and you have the time.

Often you don’t have the time, so you need to know the few algorithms that you absolutely must test on your problem.

In this section you will discover the linear and nonlinear algorithms you should spot check on your problem in R. This excludes ensemble algorithms such as as boosting and bagging, that can come later once you have a baseline.

Each algorithm will be presented from two perspectives:

- The package and function used to train and make predictions for the algorithm.
- The caret wrapper for the algorithm.

You need to know which package and function to use for a given algorithm. This is needed when:

- You are researching the algorithm parameters and how to get the most from the algorithm.
- You have a discovered the best algorithm to use and need to prepare a final model.

You need to know how to use each algorithm with caret, so that you can efficiently evaluate the accuracy of the algorithm on unseen data using the preprocessing, algorithm evaluation and tuning capabilities of caret.

Two standard datasets are used to demonstrate the algorithms:

**Boston Housing dataset**for regression (BostonHousing from the*mlbench*library).**Pima Indians Diabetes dataset**for classification (PimaIndiansDiabetes from the*mlbench*library).

Algorithms are presented in two groups:

**Linear Algorithms**that are simpler methods that have a strong bias but are fast to train.**Nonlinear Algorithms**that are more complex methods that have a large variance but are often more accurate.

Each recipe presented in this section is complete and will produce a result, so that you can copy and paste it into your current or next machine learning project.

Let’s get started.

These are methods that make large assumptions about the form of the function being modeled. As such they are have a high bias but are often fast to train.

The final models are also often easy (or easier) to interpret, making them desirable as final models. If the results are suitably accurate, you may not need to move onto non-linear methods if a linear algorithm.

The *lm()* function is in the *stats* library and creates a linear regression model using ordinary least squares.

# load the library library(mlbench) # load data data(BostonHousing) # fit model fit <- lm(medv~., BostonHousing) # summarize the fit print(fit) # make predictions predictions <- predict(fit, BostonHousing) # summarize accuracy mse <- mean((BostonHousing$medv - predictions)^2) print(mse)

The lm implementation can be used in caret as follows:

# load libraries library(caret) library(mlbench) # load dataset data(BostonHousing) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.lm <- train(medv~., data=BostonHousing, method="lm", metric="RMSE", preProc=c("center", "scale"), trControl=control) # summarize fit print(fit.lm)

The glm function is in the stats library and creates a generalized linear model. It can be configured to perform a logistic regression suitable for binary classification problems.

# load the library library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # fit model fit <- glm(diabetes~., data=PimaIndiansDiabetes, family=binomial(link='logit')) # summarize the fit print(fit) # make predictions probabilities <- predict(fit, PimaIndiansDiabetes[,1:8], type='response') predictions <- ifelse(probabilities > 0.5,'pos','neg') # summarize accuracy table(predictions, PimaIndiansDiabetes$diabetes)

The glm algorithm can be used in caret as follows:

# load libraries library(caret) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.glm <- train(diabetes~., data=PimaIndiansDiabetes, method="glm", metric="Accuracy", preProc=c("center", "scale"), trControl=control) # summarize fit print(fit.glm)

The lda function is in the MASS library and creates a linear model of a classification problem.

# load the libraries library(MASS) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # fit model fit <- lda(diabetes~., data=PimaIndiansDiabetes) # summarize the fit print(fit) # make predictions predictions <- predict(fit, PimaIndiansDiabetes[,1:8])$class # summarize accuracy table(predictions, PimaIndiansDiabetes$diabetes)

The lda algorithm can be used in caret as follows:

# load libraries library(caret) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.lda <- train(diabetes~., data=PimaIndiansDiabetes, method="lda", metric="Accuracy", preProc=c("center", "scale"), trControl=control) # summarize fit print(fit.lda)

The glmnet function is in the glmnet library and can be used for classification or regression.

Classification Example:

# load the library library(glmnet) library(mlbench) # load data data(PimaIndiansDiabetes) x <- as.matrix(PimaIndiansDiabetes[,1:8]) y <- as.matrix(PimaIndiansDiabetes[,9]) # fit model fit <- glmnet(x, y, family="binomial", alpha=0.5, lambda=0.001) # summarize the fit print(fit) # make predictions predictions <- predict(fit, x, type="class") # summarize accuracy table(predictions, PimaIndiansDiabetes$diabetes)

Regression Example:

# load the libraries library(glmnet) library(mlbench) # load data data(BostonHousing) BostonHousing$chas <- as.numeric(as.character(BostonHousing$chas)) x <- as.matrix(BostonHousing[,1:13]) y <- as.matrix(BostonHousing[,14]) # fit model fit <- glmnet(x, y, family="gaussian", alpha=0.5, lambda=0.001) # summarize the fit print(fit) # make predictions predictions <- predict(fit, x, type="link") # summarize accuracy mse <- mean((y - predictions)^2) print(mse)

It can also be configured to perform three important types of regularization: lasso, ridge and elastic net by configuring the alpha parameter to 1, 0 or in [0,1] respectively.

The glmnet implementation can be used in caret for classification as follows:

# load libraries library(caret) library(mlbench) library(glmnet) # Load the dataset data(PimaIndiansDiabetes) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.glmnet <- train(diabetes~., data=PimaIndiansDiabetes, method="glmnet", metric="Accuracy", preProc=c("center", "scale"), trControl=control) # summarize fit print(fit.glmnet)

The glmnet implementation can be used in caret for regression as follows:

# load libraries library(caret) library(mlbench) library(glmnet) # Load the dataset data(BostonHousing) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.glmnet <- train(medv~., data=BostonHousing, method="glmnet", metric="RMSE", preProc=c("center", "scale"), trControl=control) # summarize fit print(fit.glmnet)

Thees are machine learning algorithms that make fewer assumptions about the function being modeled. As such, they have a higher variance but are often result in higher accuracy. They increased flexibility also can make them slower to train or increase their memory requirements.

The knn3 function is in the caret library and does not create a model, rather makes predictions from the training set directly. It can be used for classification or regression.

Classification Example:

# knn direct classification # load the libraries library(caret) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # fit model fit <- knn3(diabetes~., data=PimaIndiansDiabetes, k=3) # summarize the fit print(fit) # make predictions predictions <- predict(fit, PimaIndiansDiabetes[,1:8], type="class") # summarize accuracy table(predictions, PimaIndiansDiabetes$diabetes)

Regression Example:

# load the libraries library(caret) library(mlbench) # load data data(BostonHousing) BostonHousing$chas <- as.numeric(as.character(BostonHousing$chas)) x <- as.matrix(BostonHousing[,1:13]) y <- as.matrix(BostonHousing[,14]) # fit model fit <- knnreg(x, y, k=3) # summarize the fit print(fit) # make predictions predictions <- predict(fit, x) # summarize accuracy mse <- mean((BostonHousing$medv - predictions)^2) print(mse)

The knn implementation can be used within the caret train() function for classification as follows:

# load libraries library(caret) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.knn <- train(diabetes~., data=PimaIndiansDiabetes, method="knn", metric="Accuracy", preProc=c("center", "scale"), trControl=control) # summarize fit print(fit.knn)

The knn implementation can be used within the caret train() function for regression as follows:

# load libraries library(caret) data(BostonHousing) # Load the dataset data(BostonHousing) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.knn <- train(medv~., data=BostonHousing, method="knn", metric="RMSE", preProc=c("center", "scale"), trControl=control) # summarize fit print(fit.knn)

The naiveBayes function is in the e1071 library and models the probabilistic of each attribute to the outcome variable independently. It can be used for classification problems.

# load the libraries library(e1071) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # fit model fit <- naiveBayes(diabetes~., data=PimaIndiansDiabetes) # summarize the fit print(fit) # make predictions predictions <- predict(fit, PimaIndiansDiabetes[,1:8]) # summarize accuracy table(predictions, PimaIndiansDiabetes$diabetes)

A very similar naive bayes implementation (NaiveBayes from the klaR library) can be used with caret as follows:

# load libraries library(caret) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.nb <- train(diabetes~., data=PimaIndiansDiabetes, method="nb", metric="Accuracy", trControl=control) # summarize fit print(fit.nb)

The ksvm function is in the kernlab package and can be used for classification or regression. It is a wrapper for the LIBSVM library and provides a suite of kernel types and configuration options.

These example uses a Radial Basis kernel.

Classification Example:

load the libraries library(kernlab) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # fit model fit <- ksvm(diabetes~., data=PimaIndiansDiabetes, kernel="rbfdot") # summarize the fit print(fit) # make predictions predictions <- predict(fit, PimaIndiansDiabetes[,1:8], type="response") # summarize accuracy table(predictions, PimaIndiansDiabetes$diabetes)

Regression Example:

# load the libraries library(kernlab) library(mlbench) # load data data(BostonHousing) # fit model fit <- ksvm(medv~., BostonHousing, kernel="rbfdot") # summarize the fit print(fit) # make predictions predictions <- predict(fit, BostonHousing) # summarize accuracy mse <- mean((BostonHousing$medv - predictions)^2) print(mse)

The SVM with Radial Basis kernel implementation can be used with caret for classification as follows:

# load libraries library(caret) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.svmRadial <- train(diabetes~., data=PimaIndiansDiabetes, method="svmRadial", metric="Accuracy", trControl=control) # summarize fit print(fit.svmRadial)

The SVM with Radial Basis kernel implementation can be used with caret for regression as follows:

# load libraries library(caret) library(mlbench) # Load the dataset data(BostonHousing) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.svmRadial <- train(medv~., data=BostonHousing, method="svmRadial", metric="RMSE", trControl=control) # summarize fit print(fit.svmRadial)

The rpart function in the rpart library provides an implementation of CART for classification and regression.

Classification Example:

# load the libraries library(rpart) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # fit model fit <- rpart(diabetes~., data=PimaIndiansDiabetes) # summarize the fit print(fit) # make predictions predictions <- predict(fit, PimaIndiansDiabetes[,1:8], type="class") # summarize accuracy table(predictions, PimaIndiansDiabetes$diabetes)

Regression Example:

# load the libraries library(rpart) library(mlbench) # load data data(BostonHousing) # fit model fit <- rpart(medv~., data=BostonHousing, control=rpart.control(minsplit=5)) # summarize the fit print(fit) # make predictions predictions <- predict(fit, BostonHousing[,1:13]) # summarize accuracy mse <- mean((BostonHousing$medv - predictions)^2) print(mse)

The rpart implementation can be used with caret for classification as follows:

# load libraries library(caret) library(mlbench) # Load the dataset data(PimaIndiansDiabetes) # train set.seed(7) control <- trainControl(method="cv", number=5) fit.rpart <- train(diabetes~., data=PimaIndiansDiabetes, method="rpart", metric="Accuracy", trControl=control) # summarize fit print(fit.rpart)

The rpart implementation can be used with caret for regression as follows:

# load libraries library(caret) library(mlbench) # Load the dataset data(BostonHousing) # train set.seed(7) control <- trainControl(method="cv", number=2) fit.rpart <- train(medv~., data=BostonHousing, method="rpart", metric="RMSE", trControl=control) # summarize fit print(fit.rpart)

There are many other algorithms provided by R and available in caret.

I would advise you to explore them and add more algorithms to your own short list of must try algorithms on your next machine learning project.

You can find a mapping of machine learning functions and packages to their name in the caret package on this page:

This page is useful if you are using an algorithm in caret and want to know which package it belongs to so that you can read up on the parameters and get more out of it.

This page is also useful if you are using a machine learning algorithm directly in R and want to know how it can be used in caret.

In this post you discovered a diverse set of 8 algorithms that you can use to spot check on your datasets. Specifically:

- Linear Regression
- Logistic Regression
- Linear Discriminant Analysis
- Regularized Regression
- k-Nearest Neighbors
- Naive Bayes
- Support Vector Machine
- Classification and Regression Trees

You learned which packages and functions to use for each algorithm. You also learned how you can use each algorithm with the caret package that provides algorithm evaluation and tuning capabilities.

You can use these algorithms as a template for spot checking on your current or next machine learning project in R.

Did you try out these recipes?

- Start your R interactive environment.
- Type or copy-paste the recipes above and try them out.
- Use the built-in help in R to learn more about the functions used.

Do you have a question. Ask it in the comments and I will do my best to answer it.

The post Spot Check Machine Learning Algorithms in R (algorithms to try on your next project) appeared first on Machine Learning Mastery.

]]>The post Machine Learning Datasets in R (10 datasets you can use right now) appeared first on Machine Learning Mastery.

]]>In this short post you will discover how you can load standard classification and regression datasets in R.

This post will show you 3 R libraries that you can use to load standard datasets and 10 specific datasets that you can use for machine learning in R.

It is invaluable to load standard datasets in R so that you can test, practice and experiment with machine learning techniques and improve your skill with the platform.

Let’s get started.

There are hundreds of standard test datasets that you can use to practice and get better at machine learning.

Most of them are hosted for free on the UCI Machine Learning Repository. These datasets are useful because they are well understood, they are well behaved and they are small.

This last point is critical when practicing machine learning because:

- You can download them fast.
- You can fit them into memory easily.
- You can run algorithms on them quickly.

Learn more about practicing machine learning using datasets from the UCI Machine Learning Repository in the post:

You can load the standard datasets into R as CSV files.

There is a more convenient approach to loading the standard dataset. They have been packaged and are available in third party R libraries that you can download from the Comprehensive R Archive Network (CRAN).

Which libraries should you use and what datasets are good to start with.

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section you will discover the libraries that you can use to get access to standard machine learning datasets.

You will also discover specific classification and regression that you can load and use to practice machine learning in R.

The datasets library comes with base R which means you do not need to explicitly load the library. It includes a large number of datasets that you can use.

You can load a dataset from this library by typing:

data(DataSetName)

For example, to load the very commonly used iris dataset:

data(iris)

To see a list of the datasets available in this library, you can type:

# list all datasets in the package library(help = "datasets")

Some highlights datasets from this package that you could use are below.

- Description: Predict iris flower species from flower measurements.
- Type: Multi-class classification
- Dimensions: 150 instances, 5 attributes
- Inputs: Numeric
- Output: Categorical, 3 class labels
- UCI Machine Learning Repository: Description
- Published accuracy results: Summary

# iris flowers datasets data(iris) dim(iris) levels(iris$Species) head(iris)

You will see:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa

- Description: Predict number of people employed from economic variables
- Type: Regression
- Dimensions: 16 instances, 7 attributes
- Inputs: Numeric
- Output: Numeric

# Longley's Economic Regression Data data(longley) dim(longley) head(longley)

You will see:

GNP.deflator GNP Unemployed Armed.Forces Population Year Employed 1947 83.0 234.289 235.6 159.0 107.608 1947 60.323 1948 88.5 259.426 232.5 145.6 108.632 1948 61.122 1949 88.2 258.054 368.2 161.6 109.773 1949 60.171 1950 89.5 284.599 335.1 165.0 110.929 1950 61.187 1951 96.2 328.975 209.9 309.9 112.075 1951 63.221 1952 98.1 346.999 193.2 359.4 113.270 1952 63.639

Direct from the manual for the library:

A collection of artificial and real-world machine learning benchmark problems, including, e.g., several data sets from the UCI repository.

You can learn more about the *mlbench* library on the mlbench CRAN page.

If not installed, you can install this library as follows:

install.packages("mlbench")

You can load the library as follows:

# load the library library(mlbench)

To see a list of the datasets available in this library, you can type:

# list the contents of the library library(help = "mlbench")

Some highlights datasets from this library that you could use are:

- Description: Predict the house price in Boston from house details
- Type: Regression
- Dimensions: 506 instances, 14 attributes
- Inputs: Numeric
- Output: Numeric
- UCI Machine Learning Repository: Description

# Boston Housing Data data(BostonHousing) dim(BostonHousing) head(BostonHousing)

You will see:

crim zn indus chas nox rm age dis rad tax ptratio b lstat medv 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7

- Description: Predict whether a cancer is malignant or benign from biopsy details.
- Type: Binary Classification

Dimensions: 699 instances, 11 attributes - Inputs: Integer (Nominal)
- Output: Categorical, 2 class labels
- UCI Machine Learning Repository: Description
- Published accuracy results: Summary

# Wisconsin Breast Cancer Database data(BreastCancer) dim(BreastCancer) levels(BreastCancer$Class) head(BreastCancer)

You will see:

Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class 1 1000025 5 1 1 1 2 1 3 1 1 benign 2 1002945 5 4 4 5 7 10 3 2 1 benign 3 1015425 3 1 1 1 2 2 3 1 1 benign 4 1016277 6 8 8 1 3 4 3 7 1 benign 5 1017023 4 1 1 3 2 1 3 1 1 benign 6 1017122 8 10 10 8 7 10 9 7 1 malignant

- Description: Predict the glass type from chemical properties.
- Type: Classification
- Dimensions: 214 instances, 10 attributes
- Inputs: Numeric
- Output: Categorical, 7 class labels
- UCI Machine Learning Repository: Description
- Published accuracy results: Summary

# Glass Identification Database data(Glass) dim(Glass) levels(Glass$Type) head(Glass)

You will see:

RI Na Mg Al Si K Ca Ba Fe Type 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1

- Description: Predict high-energy structures in the atmosphere from antenna data.
- Type: Classification
- Dimensions: 351 instances, 35 attributes
- Inputs: Numeric
- Output: Categorical, 2 class labels
- UCI Machine Learning Repository: Description
- Published accuracy results: Summary

# Johns Hopkins University Ionosphere database data(Ionosphere) dim(Ionosphere) levels(Ionosphere$Class) head(Ionosphere)

You will see:

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 1 1 0 0.99539 -0.05889 0.85243 0.02306 0.83398 -0.37708 1.00000 0.03760 0.85243 -0.17755 0.59755 -0.44945 0.60536 -0.38223 0.84356 -0.38542 0.58212 2 1 0 1.00000 -0.18829 0.93035 -0.36156 -0.10868 -0.93597 1.00000 -0.04549 0.50874 -0.67743 0.34432 -0.69707 -0.51685 -0.97515 0.05499 -0.62237 0.33109 3 1 0 1.00000 -0.03365 1.00000 0.00485 1.00000 -0.12062 0.88965 0.01198 0.73082 0.05346 0.85443 0.00827 0.54591 0.00299 0.83775 -0.13644 0.75535 4 1 0 1.00000 -0.45161 1.00000 1.00000 0.71216 -1.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 -1.00000 0.14516 0.54094 -0.39330 -1.00000 5 1 0 1.00000 -0.02401 0.94140 0.06531 0.92106 -0.23255 0.77152 -0.16399 0.52798 -0.20275 0.56409 -0.00712 0.34395 -0.27457 0.52940 -0.21780 0.45107 6 1 0 0.02337 -0.00592 -0.09924 -0.11949 -0.00763 -0.11824 0.14706 0.06637 0.03786 -0.06302 0.00000 0.00000 -0.04572 -0.15540 -0.00343 -0.10196 -0.11575 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 Class 1 -0.32192 0.56971 -0.29674 0.36946 -0.47357 0.56811 -0.51171 0.41078 -0.46168 0.21266 -0.34090 0.42267 -0.54487 0.18641 -0.45300 good 2 -1.00000 -0.13151 -0.45300 -0.18056 -0.35734 -0.20332 -0.26569 -0.20468 -0.18401 -0.19040 -0.11593 -0.16626 -0.06288 -0.13738 -0.02447 bad 3 -0.08540 0.70887 -0.27502 0.43385 -0.12062 0.57528 -0.40220 0.58984 -0.22145 0.43100 -0.17365 0.60436 -0.24180 0.56045 -0.38238 good 4 -0.54467 -0.69975 1.00000 0.00000 0.00000 1.00000 0.90695 0.51613 1.00000 1.00000 -0.20099 0.25682 1.00000 -0.32382 1.00000 bad 5 -0.17813 0.05982 -0.35575 0.02309 -0.52879 0.03286 -0.65158 0.13290 -0.53206 0.02431 -0.62197 -0.05707 -0.59573 -0.04608 -0.65697 good 6 -0.05414 0.01838 0.03669 0.01519 0.00888 0.03513 -0.01535 -0.03240 0.09223 -0.07859 0.00732 0.00000 0.00000 -0.00039 0.12011 bad

- Description: Predict the onset of diabetes in female Pima Indians from medical record data.
- Type: Binary Classification
- Dimensions: 768 instances, 9 attributes
- Inputs: Numeric
- Output: Categorical, 2 class labels
- UCI Machine Learning Repository: Description
- Published accuracy results: Summary

# Pima Indians Diabetes Database data(PimaIndiansDiabetes) dim(PimaIndiansDiabetes) levels(PimaIndiansDiabetes$diabetes) head(PimaIndiansDiabetes)

You will see:

pregnant glucose pressure triceps insulin mass pedigree age diabetes 1 6 148 72 35 0 33.6 0.627 50 pos 2 1 85 66 29 0 26.6 0.351 31 neg 3 8 183 64 0 0 23.3 0.672 32 pos 4 1 89 66 23 94 28.1 0.167 21 neg 5 0 137 40 35 168 43.1 2.288 33 pos 6 5 116 74 0 0 25.6 0.201 30 neg

- Description: Predict metal or rock returns from sonar return data.
- Type: Binary Classification
- Dimensions: 208 instances, 61 attributes
- Inputs: Numeric
- Output: Categorical, 2 class labels
- UCI Machine Learning Repository: Description
- Published accuracy results: Summary

# Sonar, Mines vs. Rocks data(Sonar) dim(Sonar) levels(Sonar$Class) head(Sonar)

You will see:

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111 0.1609 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797 0.5783 0.5071 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872 0.4918 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818 0.5212 0.4052 3 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 0.6194 0.6333 0.7060 0.5544 0.5320 0.6479 0.6931 0.6759 0.7551 0.8929 0.8619 0.7974 0.6737 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264 0.0881 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973 0.2741 0.3690 5 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 0.4459 0.4152 0.3952 0.4256 0.4135 0.4528 0.5326 0.7306 0.6193 0.2032 0.4636 0.4148 0.4292 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039 0.2988 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122 0.2074 0.3985 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 1 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857 0.1307 0.2604 0.5121 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744 0.0510 0.2834 0.2825 0.4256 2 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028 0.3788 0.2947 0.1984 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970 0.1674 0.0583 0.1401 0.1628 3 0.4293 0.3648 0.5331 0.2413 0.5070 0.8533 0.6036 0.8514 0.8512 0.5045 0.1862 0.2709 0.4232 0.3043 0.6116 0.6756 0.5375 0.4719 0.4647 0.2587 0.2129 0.2222 4 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559 0.6260 0.7340 0.6120 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167 0.6121 0.5006 0.3210 0.3202 5 0.5730 0.5399 0.3161 0.2285 0.6995 1.0000 0.7262 0.4724 0.5103 0.5459 0.2881 0.0981 0.1951 0.4181 0.4604 0.3217 0.2828 0.2430 0.1979 0.2444 0.1847 0.0841 6 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067 0.5580 0.4778 0.3299 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296 0.2707 0.2650 0.0723 0.1238 V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60 Class 1 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324 0.0232 0.0027 0.0065 0.0159 0.0072 0.0167 0.0180 0.0084 0.0090 0.0032 R 2 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061 0.0125 0.0084 0.0089 0.0048 0.0094 0.0191 0.0140 0.0049 0.0052 0.0044 R 3 0.2111 0.0176 0.1348 0.0744 0.0130 0.0106 0.0033 0.0232 0.0166 0.0095 0.0180 0.0244 0.0316 0.0164 0.0095 0.0078 R 4 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294 0.0241 0.0121 0.0036 0.0150 0.0085 0.0073 0.0050 0.0044 0.0040 0.0117 R 5 0.0692 0.0528 0.0357 0.0085 0.0230 0.0046 0.0156 0.0031 0.0054 0.0105 0.0110 0.0015 0.0072 0.0048 0.0107 0.0094 R 6 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081 0.0104 0.0045 0.0014 0.0038 0.0013 0.0089 0.0057 0.0027 0.0051 0.0062 R

- Description: Predict problems with soybean crops from crop data.
- Type: Multi-Class Classification
- Dimensions: 683 instances, 26 attributes
- Inputs: Integer (Nominal)
- Output: Categorical, 19 class labels
- UCI Machine Learning Repository: Description

# Soybean Database data(Soybean) dim(Soybean) levels(Soybean$Class) head(Soybean)

You will see:

Class date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread 1 diaporthe-stem-canker 6 0 2 1 0 1 1 1 0 0 1 1 0 2 2 0 2 diaporthe-stem-canker 4 0 2 1 0 2 0 2 1 1 1 1 0 2 2 0 3 diaporthe-stem-canker 3 0 2 1 0 1 0 2 1 2 1 1 0 2 2 0 4 diaporthe-stem-canker 3 0 2 1 0 1 0 2 0 1 1 1 0 2 2 0 5 diaporthe-stem-canker 6 0 2 1 0 2 0 1 0 2 1 1 0 2 2 0 6 diaporthe-stem-canker 5 0 2 1 0 3 0 1 0 1 1 1 0 2 2 0 leaf.malf leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth 1 0 0 1 1 3 1 1 1 0 0 0 0 4 0 0 2 0 0 1 0 3 1 1 1 0 0 0 0 4 0 0 3 0 0 1 0 3 0 1 1 0 0 0 0 4 0 0 4 0 0 1 0 3 0 1 1 0 0 0 0 4 0 0 5 0 0 1 0 3 1 1 1 0 0 0 0 4 0 0 6 0 0 1 0 3 0 1 1 0 0 0 0 4 0 0 seed.discolor seed.size shriveling roots 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 5 0 0 0 0 6 0 0 0 0

Many books that use R also include their own R library that provides all of the code and datasets used in the book.

The excellent book Applied Predictive Modeling has its own library called AppliedPredictiveModeling.

If not installed, you can install this library as follows:

install.packages("AppliedPredictiveModeling")

You can load the library as follows:

# load the library library(AppliedPredictiveModeling)

To see a list of the datasets available in this library, you can type:

# list the contents of the library library(help = "AppliedPredictiveModeling")

One highlight datasets from this library that you could use is:

- Description: Predict abalone age from abalone measurement data.
- Type: Regression or Classification
- Dimensions: 4177 instances, 9 attributes
- Inputs: Numerical and categorical
- Output: Integer
- UCI Machine Learning Repository: Description

# Abalone Data data(abalone) dim(abalone) head(abalone)

You will see:

Type LongestShell Diameter Height WholeWeight ShuckedWeight VisceraWeight ShellWeight Rings 1 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15 2 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7 3 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9 4 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10 5 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7 6 I 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.120 8

In this post you discovered that you do not need to collect or load your own data in order to practice machine learning in R.

You learned about 3 different libraries that provide sample machine learning datasets that you can use:

*datasets*library*mlbench*library*AppliedPredictiveModeling*library

You also discovered 10 specific standard machine learning datasets that you can use to practice classification and regression machine learning techniques.

- Iris flowers datasets (multi-class classification)
- Longley’s Economic Regression Data (regression)
- Boston Housing Data (regression)
- Wisconsin Breast Cancer Database (binary classification)
- Glass Identification Database (multi-class classification)
- Johns Hopkins University Ionosphere database (binary classification)
- Pima Indians Diabetes Database (binary classification)
- Sonar, Mines vs. Rocks (binary classification)
- Soybean Database (multi-class classification)
- Abalone Data (regression or classification)

Did you try out these recipes?

- Start your R interactive environment.
- Type or copy-and-paste the recipes above and try them out.
- Use the built-in help in R to learn more about the functions used.

Do you have a question. Ask it in the comments and I will do my best to answer it.

The post Machine Learning Datasets in R (10 datasets you can use right now) appeared first on Machine Learning Mastery.

]]>