Last Updated on
R is perhaps one of the most powerful and most popular platforms for statistical programming and applied machine learning.
When you get serious about machine learning, you will find your way into R.
In this post, you will discover what R is, where it came from and some of its most important features.
Discover how to prepare data, fit machine learning models and evaluate their predictions in R with my new book, including 14 step-by-step tutorials, 3 projects, and full source code.
Let’s get started.
What is R?
R is an an open source environment for statistical programming and visualization.
R is a number of things, which might be confusing at first.
- R is a computer language. It is a variant of Lisp and you can write programs in it.
- R is an interpreter. It can parse and execute R scripts (programs) that are typed in directly or loaded from a file with a .R extension.
- R is a platform. It can create graphics to be displayed on the screen or saved to file. It can also prepare models that can be queried and updated.
You may want to write R scripts in files and run them in batch mode using the R interpreter to get results such as tables or graphics. You may want to open the R interpreter and type in commands to load data, explore and model it in an ad hoc manner.
There are graphical environments, but the simplest and most common usage of R is from the R console (like a REPL). If you are just starting out with R, I would recommend learning R on the console.
Need more Help with R for Machine Learning?
Take my free 14-day email course and discover how to use R on your project (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Where R Came From
R was created by Ross Ihaka and Robert Gentleman at University of Auckland, New Zealand as an implementation of the S programming language. Development started in 1993. A version was made available on FTP released under the GNU GPL in 1995. The larger core group and open source project was setup in 1997.
It started as an experiment by the authors to implement a statistical test bed in Lisp using a syntax like that provided in S. As it developed, it took on more of the syntax and features of S, eventually surpassing it in capability and scope.
For an interesting and detailed treatment of the history of R, check out the technical report R: Past and Future History (PDF).
The Key Features of R
R is a tool to use when you need to analyze data, plot data or build a statistical model for data. It is ideal for one-off analyses prototyping and academic work, but not suited to building models to be deployed in scalable or operational environments.
Benefits of R
There are three key benefits to R:
- Open Source: R is free and open source. You can download it right now and start using it at no cost. You can read the source code, learn from it and modify it to meet your needs. Simply Amazing.
- Packages: R is popular because it has a vast number of very powerful algorithms implemented as third party libraries called packages. It is common for academics in statistical fields to release their methods as R packages, meaning that you have direct access to some state-of-the-art methods.
- Maturity: R is inspired by propriety statistical language S, using and improving the idioms and metaphors useful for statistical computing, like working in matrices, vectors and data frames.
For more information about R packages, checkout the CRAN (Comprehensive R Archive Network.) and browse by package, or views. The Machine Learning & Statistical Learning view that lists packages for machine learning will be of great interest.
Difficulties with R
Three key difficulties with the platform are:
- Inconsistency: Each algorithm is implemented with their own parameters, naming conventions and parameters. Some try to stick to rough conventions (like a predict function for making predictions) but even the results of standard function names can vary in their complexity. This can be very frustrating and requires deep reading of the documentation with each new package that you use.
- Documentation: There is a lot of documentation but it is generally direct and terse. The built in help is rarely helpful enough for your needs, driving you constantly to the web for complete working examples from which you must derive your use case.
- Scalability: R is intended for use on data that fits into memory on one machine. It is not intended for use with streaming data, big data or working across multiple machines.
The language is a little obtuse, but as a programmer, you will have little difficulty in picking it up and adapting examples to your needs. Many packages leverage mathematical code written in C, C++, FORTRAN and Java, providing a connivence interface within the R environment.
Who is Using R?
Commercial companies now support R. For example, Revolution R is a commercially supported version of R with extensions useful for enterprise such as an IDE. Oracle, IBM, Mathematica, MATLAB, SPSS, SAS, and others provide integration with R and their platforms.
The Revolution Analytics blog also provides a long list of companies publicly declaring their adoption of the platform.
The Kaggle platform for data science competitions and the KDnuggets polls both point out R as the most popular platform for successful practicing data scientist. Learn more about this in the post The Best Programming Language for Machine Learning.
In this post you got an overview of what R is, it’s key features, where it came from and who is using it.
It is an exciting and powerful platform. If you are thinking of getting started using R for applied machine learning, you might like to check out this list of 7 books on R for machine learning.
For more information about R, checkout the homepage at The R Project for Statistical Computing. There you will find download links, documentation and manuals, email lists and much more.