Best Programming Language for Machine Learning

A question I get asked a lot is:

What is the best programming language for machine learning?

I’ve replied to this question many times now it’s about time to explore this further in a blog post.

Ultimately, the programming language you use for machine learning should consider your own requirements and predilections. No one can meaningfully address those concerns for you.

No one can meaningfully address those concerns for you.

What Languages Are Being Used

Before I give you my opinion, it is good to have  a look around to see what languages and platforms are popular in self-selected communities of data analysis and machine learning professionals.

KDnuggets has had language polls forever. A recent poll is titled “What programming/statistics languages you used for an analytics / data mining / data science work in 2013“. The trends are almost identical to the previous year. The results suggest heavy use of R and Python and SQL for data access. SAS and MATLAB rank higher than I would have expected. I’d expect SAS accounts for larger corporate (Fortune 500) data analysis and MATLAB for engineering, research and student use.

kdnuggets popular programming languages

The most popular platforms for machine learning, taken from the KDnuggets 2013 poll.

Kaggle offer machine learning competitions and have polled their user base as to the tools and programming languages used by participants in competitions. They posted results in 2011 titled Kagglers’ Favorite Tools (also see the forum discussion). The results suggested the abundant use of R. The results also show good use of MATLAB and SAS with much lower Python representation. I can attest that I prefer R over Python for competition work. It just feels though it has more on offer in terms of data analysis and algorithm selection.

kaggle most popular tools

The most popular tools used on Kaggle, the machine learning competition website.

Ben Hamner, Kaggle Admin and author of the blog post above on the Kaggle blog goes into more detail on the options when it comes to programming languages for machine learning in a forum post titled “What tools do people generally use to solve problems“.

Ben comments that MATLAB/Octave is a good language for matrix operations and can be good when working with a well defined feature matrix. Python is fragmented by comprehensive and can be very slow unless you drop into C. He prefers Python when not working with a well defined feature matrix and uses Pandas and NLTK. Ben comments that “As a general rule, if it’s found to be interesting for statisticians, it’s been implemented in R” (well said). He also complains about the language itself being ugly and painful to work with. Finally, Ben comments on Julia that doesn’t have much to offer in the way of libraries but is his new favorite language. He comments that it has the conciseness of languages like MATLAB and Python with the speed of C.

Anthony Goldbloom, the CEO of Kaggle gave a presentation to the Bay Area R user group in 2011 on the popularity of R in Kaggle competitions titled Predictive modeling competitions: making data science a sport (see the powerpoint slides). The presentation slides give more detail on the use of programming languages and suggest an Other category that is as close to as large as large as the usage of R. It would be nice to have the raw data that was collected (why didn’t they release it to their own data community, seriously!?).

popular languages on kaggle

Popular programming languages on Kaggle, taken from Kaggle presentation.

John Langford on his blog Hunch has an excellent article on the properties of a programming language to consider when working with machine learning algorithms titled “Programming Languages for Machine Learning Implementations“. He divides the properties into concerns of speed and the concerns of programability (programming ease). He points to powerful industry standard implementations of algorithms, all in C and comments that he has not used R or MATLAB (the post was written 8 years ago). Take some time and read some of the comments by academics and industry specialists alike. This is a deep and nuanced problem that really comes down to the specifics of the problem you are solving and the environment in which you are solving it.

Machine Learning Languages

I think of programming languages in the context of the machine learning activities I want to perform.

MATLAB/Octave

I think MATLAB is excellent for representing and working with matrices. As such, I think it’s an excellent language or platform to use when climbing into the linear algebra of a given method. I think it’s suited to learning about algorithms both superficially the first time around and deeply when you are trying to figure something out or go deep into the method. For example, it’s popular in university courses for beginners, like Andrew Ng’s Coursera Machine Learning course.

R

R is a workhorse for statistical analysis and by extension machine learning. Much talk is given to the learning curve, I didn’t really see the problem. It is the platform to use to understand and explore your data using statistical methods and graphs. It has an enormous number of machine learning algorithms, and advanced implementations too written by the developers of the algorithm.

I think you can explore, model and prototype with R. I think it suits one-off projects with an artifact like a set of predictions, report or research paper. For example, it is the most popular platform for machine learning competitors such as Kaggle.

Python

Python if a popular scientific language and a rising star for machine learning. I’d be surprised if it can take the data analysis mantle from R, but matrix handling in NumPy may challenge MATLAB and communication tools like IPython are very attractive and a step into the future of reproducibility.

I think the SciPy stack for machine learning and data analysis can be used for one-off projects (like papers), and frameworks like scikit-learn are mature enough to be used in production systems.

Java-family/C-family

Implementing a system that uses machine learning is an engineering challenge like any other. You need good design and developed requirements. Machine learning is algorithms, not magic. When it comes to serious production implementations, you need a robust library or you customize an implementation of the algorithm for your needs.

There are robust libraries, for example, Java has Weka and Mahout. Also, note that the deeper implementations of core algorithms like regression (LIBLINEAR) and SVM (LIBSVM) are written in C and leveraged by Python and other toolkits. I think you are serious you may prototype in R or Python, but you will implement in a heavier language for reasons such as execution speed and system reliability. For example, the backend of BigML is implemented in Clojure.

Other Concerns

  • Not a Programmer: If you are not a programmer (or not a confident programmer) I recommend playing machine learning via a GUI interface like Weka.
  • One Language for Research and Ops: You may want to use the same language for prototyping and for production to reduce risk of not effectively transferring the results.
  • Pet Language: You may have a pet language of favorite language and want to stick to that. You can implement algorithms yourself or leverage libraries. Most languages have some form of machine learning package, however primitive.

The question of machine learning programming language is popular on blogs and question and answer sites. A few choice discussions include:

What programming language do you use for machine learning and data analysis why do you recommend it?

I’m keen to hear your thoughts, leave a comment.

15 Responses to Best Programming Language for Machine Learning

  1. jmgore75 June 6, 2014 at 11:49 pm #

    I am admittedly new to ML but have recently had the opportunity to try it with R, python, and Matlab. You can divide up the problem into different parts. In all cases, it’s a good idea to go beyond the basic installation: for R, you want RStudio as an IDE; for python, IPython notebooks and several major libraries are a must; and Matlab is much nicer to work in than Octave.

    1. Data input, output, preprocessing, and postprocessing: Python, hands down. It’s all fine and good if you are just dealing with CSVs but that is often not the case, so in the real world python is quite handy. Frankly, there are few languages better at this than python, and it is surely a big part of its popularity.

    2. Pre-built algorithms: Looks like R, although python’s scikit-learn is better organized.

    3. Novel algorithms: Still probably R.

    4. Plotting: All have multiple excellent plotting packages. R is particularly broad.

    5. Exploration: R (with RStudio) or IPython are both very good. R is probably a bit better, since it handles matrices better. IPython makes it easy to record and rerun your efforts.

    6. Teaching: Matlab/octave has the most concise expression of matrix operations, so for many algorithms it is the one of choice. I kind of wonder about tree structures though.

    7. Sharing and dissemination: IPython notebooks are pretty nice and don’t require viewers to install anything. R vignettes are good if they have R and the proper libraries installed.

    8. Performance: I can’t really say for sure, as I have not properly tested. Python is the only one of the three in which out-of-core or online processing is particularly natural to express, thanks to generators, as far as I can tell. There are many interesting code performance initiatives in place for Python. Other languages should obviously perform better (C, java; as noted Julia is particularly interesting).

    • jasonb June 7, 2014 at 7:03 am #

      Really great comments, thanks. R is my go-to platform when I’m looking to get the most out of a problem.

      I’ve explored using theano with Python on GPUs and played a lot with various parallel packages on R to get speed-ups. In the end, I’ve found rolling my own implementation the best when speed is the highest priority.

  2. Lifestyle Service Agency November 18, 2014 at 4:39 pm #

    Thanks for the article! Weka is now a part of our toolkit.

  3. Mark Szlazak March 1, 2015 at 8:25 am #

    Another language to consider is Lua. Specifically the LuaJIT implementation with Touch7. This is what Google and Facebook AI groups use, probably because they hired a folks from Yann LaCun’s lab. Torch7 has been extended further with more ML stuff produced at Facebook and they have made it available to the public. Probably check out stuff on why Lua/LuaJit over Python and LuaJIT’s interface with c-code. Also, LuaJIT is used a lot by gamers and I heard LuaJIT (or was it Lua) will replace Action Script in Adobe’s products.

  4. Frank August 26, 2015 at 1:26 pm #

    Hi Jason, thanks for your nice introduction. Do you have any good books on machine learning in C?

    • Jason Brownlee August 26, 2015 at 6:58 pm #

      Sorry, no, not off the top of my head. I can say that there are great libs written in c like libsvm that are often used via wrappers in python or R. Learning the native lib in c might be a fun experience!

  5. Portella October 8, 2015 at 12:18 am #

    Hi Jason. I’m new to machine learning. I’ve gone through the AI online course from Berkeley and plan to go through Yaser Abu-Mostafa “Learning from Data”. It is a language agnostic course however, which, according to what was stated in some reviews, demands intense effort in implementing algorithms by ourselves, without guidance. I like this approach, since it really forces one to research and deal with real challenges of implementation, not just concepts. The problem is that my language of choice, for other reasons, is C#, which I don’t see listed, here and elsewhere, among used languages for machine learning. I have limited experience with python, from the AI and linear algebra courses, which made most of the framework available.
    The question is: how far apart is C# from Python, in terms of libraries useful for machine learning? How would it compare to Java, in the same terms?
    Should I use a language like Python to develop machine learning code and make it interact with C# code, considering it will continue to be my main developing language? What about Accord.Net? Is it any good?

  6. Will Dwinnell April 2, 2016 at 3:11 pm #

    You make several good points about context. I would add that there is a dimension which runs from “scripting” (summoning existing machine learning routines) to “programming” (writing the machine learning routines oneself). Some languages lend themselves more to one of these operations more than the other. In SAS, for instance, analysts tend to call existing SAS “procs”: They are not writing logistic regression from scratch.

    If a script-writing analyst and I fit such the same model form to the same data, we will get the same model parameters. The differences are that I know how and why that modeling process works (and when it won’t), and I can modify it directly when needed.

  7. Victor October 21, 2016 at 7:11 pm #

    Without a doubt – Python.

    • Jason Brownlee October 22, 2016 at 6:56 am #

      The Python ecosystem is growing fast and seeing great adoption.

      I tend to agree that Python is a force Victor.

  8. Steeve Brechmann January 24, 2017 at 1:23 am #

    A little update to this question 😉

    Python is leading the way.

    http://www.kdnuggets.com/2017/01/most-popular-language-machine-learning-data-science.html

  9. Nandhini February 19, 2017 at 11:59 am #

    Hi Jason,
    When I started a Data Science course, I had two choices Python or R. As always I have a passion on programming, I chose Python and worked on it through out the course. Though in the course series, they preferred R for Time series, I was following your Blog on Time Series using Python.
    Some friends are suggesting AndrewsNg course in Coursera as a next step. But I felt as a newbie to Machine Learning field, I would stick to one language and get used to various Algorithms using it. Once comfortable, then i can explore more into R and MatLab.

    What do you Suggest?

    • Jason Brownlee February 20, 2017 at 9:26 am #

      Sounds fine Nandhini, generally, you should get comfortable jumping from tool to tool or platform to platform, but not when starting out.

      Regardless of tool, the skill to focus on is working through predictive modeling problems end to end and delivering a result (model or set of predictions).

Leave a Reply