You should use R for machine learning.
R is one of the most powerful machine learning platforms and is used by the top data scientists in the world.
In this post you will learn why you should use R for machine learning.
Kick-start your project with my new book Machine Learning Mastery With R, including step-by-step tutorials and the R source code files for all examples.
Let’s get started.
Why You Should Care About R
R is used by the best data scientists in the world. In surveys on Kaggle (the competitive machine learning platform), R is by far the most used machine learning tool. When professional machine learning practitioners were surveyed in 2015, again the most popular machine learning tool was R.
R is powerful because of the breadth of techniques it offers. Any techniques that you can think of for data analysis, visualization, sampling, supervised learning and model evaluation are provided in R. The platform has more techniques than any other that you will come across.
R is state-of-the-art because it is used by academics. One of the reasons why R has so many techniques is because academics that develop new algorithms are developing them in R and releasing them as R packages. This means that you can get access to state-of-the-art algorithms in R before other platforms. It also means that you can only access some algorithms in R until someone ports them to other platforms.
R is free because it is open source software. You can download it right now for free and it runs on any workstation platform you are likely to use.
Need more Help with R for Machine Learning?
Take my free 14-day email course and discover how to use R on your project (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
So What Is R?
R is a language, an interpreter and a platform.
R is a computer language. It can be difficult to learn but is familiar and you will figure it out quickly if you have used other scripting languages like Python, Ruby or BASH.
R is an interpreter. You can write scripts and save them as files. Like other scripting languages, you can then use the interpret to run those scripts any time. R also provides a REPL environment where you can type in commands and see the output immediately.
R is also a platform. You can use it to create and display graphics, to save and load state and to interface with other systems. You can do all of your exploration and development in the REPL environment if you so wish.
Want more, check out my previous post What is R?
Power Is In The Packages
The power of R is in the packages.
R itself is very simple. It provides built in commands for basic statistics and data handing. The machine learning features of R that you will use come from third party packages. Packages are plug-ins to the R platform. You can search for, download and install them within the R environment.
Because packages are created by third parties, their quality can vary. It is a good idea to search for the best-of-breed packages that provide a specific technique you want to use. Packages provide documentation in the form of help for each package function and often vignettes that demonstrate how to use the package.
Before you write a line of code, always search to see if there is a package that can do what you need.
You can search for packages on the Comprehensive R Archive Network or CRAN.
How Do You Use R For Machine Learning?
The R platform is not suitable for all types of machine learning projects. The sweet spot is to use R for exploration and for building one-off models.
Interactive Environment for Exploration
The R interactive environment is very useful for exploring and learning how to use packages and functions. You should spend a lot of time in the interactive environment when you are just starting out.
The environment is also very good if you are exploring a new problem. Not a systematic working of the problem, but more of trying what-if scenarios.
It is also great if you want to use a systematic process and come up with a prototype model very quickly without the full rigmarole.
You can start the interactive environment on the command line by typing:
You can get help on any function by typing:
You can close the interactive environment by calling the quit function:
Use Scripts for One-Off Models
I recommend that if you have a machine learning project that you develop scripts.
Each task in your project could be described in a new script which can be documented, updated and tracked in revision control.
R scripts can be run from the command line, called from shell scripts and (my personal favorite) called from targets in a Makefile.
For example, here is how you can call the R executable from the command line, shell script or make file to run your script file:
R CMD BATCH your_script.R your_script.log
This runs the script your_script.R using R in a batch mode (non-iteratively) and saves any results of the script in the file your_script.log.
Not For Production
R is probably not the best solution for building a production model.
The techniques may be state-of-the-art but they may not use the best software engineering principles, have tests or be scalable to the size of datasets that you may need to work with.
That being said, R may be the best solution to discover what model to actually use in production.
The landscape is changing and people are writing R scripts to run operationally and services are emerging to support larger dataset.
General Tips When Using R
Below are tips for making the most of R for machine learning.
- Stick with basic R. Don’t write functions and serious code until you are comfortable with the environment. Stick to calling functions in packages.
- Learn from help and vignettes. Packages come with help in the form of documentation for each function and vignettes that give you usage information. If in doubt, search for the package in your favorite search engine to find the home page of the package on CRAN. Running examples from vignettes can teach you a lot about the expected usage of a function.
- Tabular data. Because R was built by statisticians for statisticians, it is suited for tabular data, e.g. a matrix of data as you would see in a spreadsheet.
- Small data. R is more suited to smaller datasets, e.g. tens- or hundreds of thousands, but not millions of rows.
- Don’t program. Focus on packages and functions and how to use them well. I do not recommend learning “how to program in R” unless you want to create your own packages.
You Can Use R For Machine Learning
You do not need to be a good programmer. Getting good at using R is not about being a good programmer, it is about knowing which packages to use and how to use them well. Read up on the packages and practice using them. Don’t study how to program well in R, it is a waste of time.
You do not need to be a machine learning expert. There are hundreds of machine learning packages and thousands of techniques that you can use. Take your time, read the documentation and practice.
In this post you discovered that you should use R for machine learning.
It is one of the most widely used platforms for machine learning by professionals and the best data scientists in the world.
You discovered the sweet spot for R:
- Using R for exploration and prototyping in the interactive environment.
- Using R to develop one-off models by writing scripts.
Your Next Step
Do you want to use R for machine learning?
Get started Right Now!
Do you have a question? Send me an email or post a comment below.