Last Updated on August 22, 2019
There are a ton of packages for R. Which ones are best to use for your machine learning project?
In this post you will discover the exact R functions and packages recommended for each sub task in a machine learning journey.
This is useful. Bookmark this page. I’m sure you will be checking back time and again.
If you’re an R user and know a better way, share it in the comments and I will update the list.
Kick-start your project with my new book Machine Learning Mastery With R, including step-by-step tutorials and the R source code files for all examples.
Let’s get started.
What R Packages Should You Use?
There are more than 6,000 third party packages for R. The vast number of packages available is one of the benefits of the R platform. It is also the frustration.
Which packages should you use?
There are specific tasks that you need to perform as part of a machine learning project. Tasks like loading data, evaluating algorithms and improving accuracy. You can use multiple techniques for each task and multiple packages may provide those techniques.
Given that there are so many different ways to complete a given subtask, you need to discover those functions and packages that best meet your needs.
Map Best-Of-Breed Packages Onto Project Tasks
The way to solve this problem is to create a mapping of all of the sub-tasks you are likely to work on during a machine learning project and find the best-of-breed packages and functions that you can use.
Given that R is a statistical language, it provides a lot of tools that you can use for data analysis as well as predictive models that you can train and use to generate predictions.
Using your favorite search engine, you can locate all of the packages and functions in packages that you can use to complete each task. This can be exhaustive and you can end up with many different candidate solutions.
You need to reduce each list of options down to the one preferred way of completing a task. You could experiment with each and see what works for you. You could also carefully review you search results and tease out the most popular functions used by practitioners.
Next up is a mapping from R packages and functions to the tasks of a machine learning project that you can use to get started today using R for machine learning.
Need more Help with R for Machine Learning?
Take my free 14-day email course and discover how to use R on your project (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
How To Use R For Machine Learning Projects
This section lists many of the main sub-tasks of a generic machine learning project. Each task lists the specific function and parent package that you can use in R to complete the task.
Some properties of the chosen functions are:
- Minimum: the list is a bare minimum of both the machine learning tasks in a project and only the function and package name that you can use. More homework is required to actually use each of the functions listed.
- Simple: functions were chosen for simplicity in delivering a direct result for the task. One function was preferred over multiple function calls.
- Preference: functions were chosen based on my preferences and best estimation. Other practitioners may have different alternatives (share in the comments!).
Tasks are organized into three broad groups:
- Data preparation tasks for getting data ready for modeling.
- Evaluating algorithm tasks for racing and evaluating predictive modeling algorithms.
- Improve results tasks for getting more out of well performing algorithms.
1. Data Preparation Tasks
Load a dataset from your file.
- CSV: read.csv function from the utils package
Clean up a dataset to ensure that the data is reasonable and consistent ready for analysis and modeling.
- Imputing: impute from the Hmisc package.
- Outliers: various functions from the outliers package.
- Rebalance: SMOTE function from the DMwR package.
Summarize a dataset using descriptive statistics.
- Summarize Distributions: summary function from base package.
- Summarize Correlations: cor function from the stats package
Summarize a dataset visually.
- Scatterplot Matrix: pairs function from the graphics package.
- Histogram: hist function from the graphics package.
- Density Plot: densityplot function from the lattice package.
- Box and Whisker Plot: boxplot function from the graphics package
- ggpairs function from the GGally package which can do it all on one plot
- ggplot2 and lattice packages in general are excellent for plotting
Select those features in the dataset that are most relevant for building a predictive model.
- RFE: rfe function from the caret package
- Correlated: findCorrelation function from the caret package
The caret package provides a suite of feature selection methods, see Evaluate Algorithm Tasks.
- FSelector package.
Create transforms of the dataset to best expose the structure of the problem to the learning algorithms.
- Normalize: custom written function
- Standardize: scale function from the base package.
The caret package provides data transforms as part of the test harness, see the next section.
2. Evaluate Algorithm Tasks
Functions from the caret package should be used to evaluate models on your dataset.
The caret package supports various performance measures and test options such as data splits and cross validation. Pre-processing can also be configured as part of the test harness.
- Model Evaluation: train function from the caret package.
- Test Options: trainControl function from the caret package.
- Preprocessing Options: preProcess function from the caret package.
Note that many modern predictive models (such as flavors of advanced decision trees) provide some form of feature selection, parameter tuning and ensembling built in.
The caret package provides access to all of the best of breed predictive modeling algorithms.
3. Improve Result Tasks
Techniques for getting the most out of well performing models in service of making accurate predictions.
The caret package provides algorithm tuning as part of the test harness and includes techniques such as random, grid and adaptive search.
Many modern predictive modeling algorithms provide ensembling built-in. A suite of bagging and boosting functions are provided in the caret package.
- Blend: caretEnsemble from the caretEnsemble package.
- Stacking: caretStack from the caretEnsemble package.
- Bagging: bagging function from the ipred package.
In this post you discovered that the best way to use R for machine learning is to map specific R functions and packages onto the tasks of a machine learning project.
You discovered the specific packages and functions that you can use for the most common tasks of a machine learning project, including links to further documentation.
Your Next Step
Get started using R for machine learning. Use the suggestions above on your current or next machine learning project.
Did I miss an important package? Did I miss a key task in a machine learning project? Leave a comment and let me know what I missed.
Do you have a question? Email me or leave a comment.