How To Use R For Machine Learning

By Jason Brownlee on August 22, 2019 in R Machine Learning 2

There are a ton of packages for R. Which ones are best to use for your machine learning project?

In this post you will discover the exact R functions and packages recommended for each sub task in a machine learning journey.

This is useful. Bookmark this page. I’m sure you will be checking back time and again.

If you’re an R user and know a better way, share it in the comments and I will update the list.

Kick-start your project with my new book Machine Learning Mastery With R, including step-by-step tutorials and the R source code files for all examples.

Let’s get started.

How To Use R For Machine Learning
Photo by Neil Cummings, some rights reserved.

What R Packages Should You Use?

There are more than 6,000 third party packages for R. The vast number of packages available is one of the benefits of the R platform. It is also the frustration.

Which packages should you use?

There are specific tasks that you need to perform as part of a machine learning project. Tasks like loading data, evaluating algorithms and improving accuracy. You can use multiple techniques for each task and multiple packages may provide those techniques.

Given that there are so many different ways to complete a given subtask, you need to discover those functions and packages that best meet your needs.

Map Best-Of-Breed Packages Onto Project Tasks

The way to solve this problem is to create a mapping of all of the sub-tasks you are likely to work on during a machine learning project and find the best-of-breed packages and functions that you can use.

You start by listing all of the sub-tasks in a machine learning project. You can take a close look at the process of applied machine learning and the machine learning project checklist.

Given that R is a statistical language, it provides a lot of tools that you can use for data analysis as well as predictive models that you can train and use to generate predictions.

Using your favorite search engine, you can locate all of the packages and functions in packages that you can use to complete each task. This can be exhaustive and you can end up with many different candidate solutions.

You need to reduce each list of options down to the one preferred way of completing a task. You could experiment with each and see what works for you. You could also carefully review you search results and tease out the most popular functions used by practitioners.

Next up is a mapping from R packages and functions to the tasks of a machine learning project that you can use to get started today using R for machine learning.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

How To Use R For Machine Learning Projects

This section lists many of the main sub-tasks of a generic machine learning project. Each task lists the specific function and parent package that you can use in R to complete the task.

Some properties of the chosen functions are:

Minimum: the list is a bare minimum of both the machine learning tasks in a project and only the function and package name that you can use. More homework is required to actually use each of the functions listed.
Simple: functions were chosen for simplicity in delivering a direct result for the task. One function was preferred over multiple function calls.
Preference: functions were chosen based on my preferences and best estimation. Other practitioners may have different alternatives (share in the comments!).

Tasks are organized into three broad groups:

Data preparation tasks for getting data ready for modeling.
Evaluating algorithm tasks for racing and evaluating predictive modeling algorithms.
Improve results tasks for getting more out of well performing algorithms.

1. Data Preparation Tasks

Data Loading

Load a dataset from your file.

CSV: read.csv function from the utils package

Data Cleaning

Clean up a dataset to ensure that the data is reasonable and consistent ready for analysis and modeling.

Imputing: impute from the Hmisc package.
Outliers: various functions from the outliers package.
Rebalance: SMOTE function from the DMwR package.

Data Summary

Summarize a dataset using descriptive statistics.

Summarize Distributions: summary function from base package.
Summarize Correlations: cor function from the stats package

Data Visualization

Summarize a dataset visually.

Scatterplot Matrix: pairs function from the graphics package.
Histogram: hist function from the graphics package.
Density Plot: densityplot function from the lattice package.
Box and Whisker Plot: boxplot function from the graphics package

Honorable mentions:

ggpairs function from the GGally package which can do it all on one plot
ggplot2 and lattice packages in general are excellent for plotting

Feature Selection

Select those features in the dataset that are most relevant for building a predictive model.

RFE: rfe function from the caret package
Correlated: findCorrelation function from the caret package

The caret package provides a suite of feature selection methods, see Evaluate Algorithm Tasks.

Honorable mentions:

FSelector package.

Data Transformation

Create transforms of the dataset to best expose the structure of the problem to the learning algorithms.

Normalize: custom written function
Standardize: scale function from the base package.

The caret package provides data transforms as part of the test harness, see the next section.

2. Evaluate Algorithm Tasks

Functions from the caret package should be used to evaluate models on your dataset.

The caret package supports various performance measures and test options such as data splits and cross validation. Pre-processing can also be configured as part of the test harness.

Model Evaluation

Model Evaluation: train function from the caret package.
Test Options: trainControl function from the caret package.
Preprocessing Options: preProcess function from the caret package.

Note that many modern predictive models (such as flavors of advanced decision trees) provide some form of feature selection, parameter tuning and ensembling built in.

Predictive Models

The caret package provides access to all of the best of breed predictive modeling algorithms.

3. Improve Result Tasks

Techniques for getting the most out of well performing models in service of making accurate predictions.

Algorithm Tuning

The caret package provides algorithm tuning as part of the test harness and includes techniques such as random, grid and adaptive search.

Model Ensembles

Many modern predictive modeling algorithms provide ensembling built-in. A suite of bagging and boosting functions are provided in the caret package.

Blend: caretEnsemble from the caretEnsemble package.
Stacking: caretStack from the caretEnsemble package.
Bagging: bagging function from the ipred package.

Summary

In this post you discovered that the best way to use R for machine learning is to map specific R functions and packages onto the tasks of a machine learning project.

You discovered the specific packages and functions that you can use for the most common tasks of a machine learning project, including links to further documentation.

Your Next Step

Get started using R for machine learning. Use the suggestions above on your current or next machine learning project.

Did I miss an important package? Did I miss a key task in a machine learning project? Leave a comment and let me know what I missed.

Do you have a question? Email me or leave a comment.

2 Responses to How To Use R For Machine Learning

Samuel Ninsiima November 25, 2019 at 5:45 pm #

This was a nice read. So you’ve trained and tested a ML algorithm model in R. How do you put it in production?

- Jason Brownlee November 26, 2019 at 5:59 am #
  
  Thanks, perhaps this will help:
  https://machinelearningmastery.com/finalize-machine-learning-models-in-r/

Navigation