Last Updated on September 27, 2016
A question I get asked a lot is:
What is the best programming language for machine learning?
I’ve replied to this question many times now it’s about time to explore this further in a blog post.
Ultimately, the programming language you use for machine learning should consider your own requirements and predilections. No one can meaningfully address those concerns for you.
No one can meaningfully address those concerns for you.
What Languages Are Being Used
Before I give you my opinion, it is good to have a look around to see what languages and platforms are popular in self-selected communities of data analysis and machine learning professionals.
KDnuggets has had language polls forever. A recent poll is titled “What programming/statistics languages you used for an analytics / data mining / data science work in 2013“. The trends are almost identical to the previous year. The results suggest heavy use of R and Python and SQL for data access. SAS and MATLAB rank higher than I would have expected. I’d expect SAS accounts for larger corporate (Fortune 500) data analysis and MATLAB for engineering, research and student use.
Kaggle offer machine learning competitions and have polled their user base as to the tools and programming languages used by participants in competitions. They posted results in 2011 titled Kagglers’ Favorite Tools (also see the forum discussion). The results suggested the abundant use of R. The results also show good use of MATLAB and SAS with much lower Python representation. I can attest that I prefer R over Python for competition work. It just feels though it has more on offer in terms of data analysis and algorithm selection.
Ben Hamner, Kaggle Admin and author of the blog post above on the Kaggle blog goes into more detail on the options when it comes to programming languages for machine learning in a forum post titled “What tools do people generally use to solve problems“.
Ben comments that MATLAB/Octave is a good language for matrix operations and can be good when working with a well defined feature matrix. Python is fragmented by comprehensive and can be very slow unless you drop into C. He prefers Python when not working with a well defined feature matrix and uses Pandas and NLTK. Ben comments that “As a general rule, if it’s found to be interesting for statisticians, it’s been implemented in R” (well said). He also complains about the language itself being ugly and painful to work with. Finally, Ben comments on Julia that doesn’t have much to offer in the way of libraries but is his new favorite language. He comments that it has the conciseness of languages like MATLAB and Python with the speed of C.
Anthony Goldbloom, the CEO of Kaggle gave a presentation to the Bay Area R user group in 2011 on the popularity of R in Kaggle competitions titled Predictive modeling competitions: making data science a sport (see the powerpoint slides). The presentation slides give more detail on the use of programming languages and suggest an Other category that is as close to as large as large as the usage of R. It would be nice to have the raw data that was collected (why didn’t they release it to their own data community, seriously!?).
John Langford on his blog Hunch has an excellent article on the properties of a programming language to consider when working with machine learning algorithms titled “Programming Languages for Machine Learning Implementations“. He divides the properties into concerns of speed and the concerns of programability (programming ease). He points to powerful industry standard implementations of algorithms, all in C and comments that he has not used R or MATLAB (the post was written 8 years ago). Take some time and read some of the comments by academics and industry specialists alike. This is a deep and nuanced problem that really comes down to the specifics of the problem you are solving and the environment in which you are solving it.
Machine Learning Languages
I think of programming languages in the context of the machine learning activities I want to perform.
I think MATLAB is excellent for representing and working with matrices. As such, I think it’s an excellent language or platform to use when climbing into the linear algebra of a given method. I think it’s suited to learning about algorithms both superficially the first time around and deeply when you are trying to figure something out or go deep into the method. For example, it’s popular in university courses for beginners, like Andrew Ng’s Coursera Machine Learning course.
R is a workhorse for statistical analysis and by extension machine learning. Much talk is given to the learning curve, I didn’t really see the problem. It is the platform to use to understand and explore your data using statistical methods and graphs. It has an enormous number of machine learning algorithms, and advanced implementations too written by the developers of the algorithm.
I think you can explore, model and prototype with R. I think it suits one-off projects with an artifact like a set of predictions, report or research paper. For example, it is the most popular platform for machine learning competitors such as Kaggle.
Python if a popular scientific language and a rising star for machine learning. I’d be surprised if it can take the data analysis mantle from R, but matrix handling in NumPy may challenge MATLAB and communication tools like IPython are very attractive and a step into the future of reproducibility.
I think the SciPy stack for machine learning and data analysis can be used for one-off projects (like papers), and frameworks like scikit-learn are mature enough to be used in production systems.
Implementing a system that uses machine learning is an engineering challenge like any other. You need good design and developed requirements. Machine learning is algorithms, not magic. When it comes to serious production implementations, you need a robust library or you customize an implementation of the algorithm for your needs.
There are robust libraries, for example, Java has Weka and Mahout. Also, note that the deeper implementations of core algorithms like regression (LIBLINEAR) and SVM (LIBSVM) are written in C and leveraged by Python and other toolkits. I think you are serious you may prototype in R or Python, but you will implement in a heavier language for reasons such as execution speed and system reliability. For example, the backend of BigML is implemented in Clojure.
- Not a Programmer: If you are not a programmer (or not a confident programmer) I recommend playing machine learning via a GUI interface like Weka.
- One Language for Research and Ops: You may want to use the same language for prototyping and for production to reduce risk of not effectively transferring the results.
- Pet Language: You may have a pet language of favorite language and want to stick to that. You can implement algorithms yourself or leverage libraries. Most languages have some form of machine learning package, however primitive.
The question of machine learning programming language is popular on blogs and question and answer sites. A few choice discussions include:
- Machine learning and Programming Languages, 2012
- Which programming language has the best repository of machine learning libraries? on Quora, 2012
- Which programming language has the best repository of machine learning libraries? on MetaOptimize, 2010
- What programming language do you recommend to prototype a machine learning problem?, CrossValidated, 2011
What programming language do you use for machine learning and data analysis why do you recommend it?
I’m keen to hear your thoughts, leave a comment.
I am admittedly new to ML but have recently had the opportunity to try it with R, python, and Matlab. You can divide up the problem into different parts. In all cases, it’s a good idea to go beyond the basic installation: for R, you want RStudio as an IDE; for python, IPython notebooks and several major libraries are a must; and Matlab is much nicer to work in than Octave.
1. Data input, output, preprocessing, and postprocessing: Python, hands down. It’s all fine and good if you are just dealing with CSVs but that is often not the case, so in the real world python is quite handy. Frankly, there are few languages better at this than python, and it is surely a big part of its popularity.
2. Pre-built algorithms: Looks like R, although python’s scikit-learn is better organized.
3. Novel algorithms: Still probably R.
4. Plotting: All have multiple excellent plotting packages. R is particularly broad.
5. Exploration: R (with RStudio) or IPython are both very good. R is probably a bit better, since it handles matrices better. IPython makes it easy to record and rerun your efforts.
6. Teaching: Matlab/octave has the most concise expression of matrix operations, so for many algorithms it is the one of choice. I kind of wonder about tree structures though.
7. Sharing and dissemination: IPython notebooks are pretty nice and don’t require viewers to install anything. R vignettes are good if they have R and the proper libraries installed.
8. Performance: I can’t really say for sure, as I have not properly tested. Python is the only one of the three in which out-of-core or online processing is particularly natural to express, thanks to generators, as far as I can tell. There are many interesting code performance initiatives in place for Python. Other languages should obviously perform better (C, java; as noted Julia is particularly interesting).
Really great comments, thanks. R is my go-to platform when I’m looking to get the most out of a problem.
I’ve explored using theano with Python on GPUs and played a lot with various parallel packages on R to get speed-ups. In the end, I’ve found rolling my own implementation the best when speed is the highest priority.
Thanks for the article! Weka is now a part of our toolkit.
Glad to here it!
Another language to consider is Lua. Specifically the LuaJIT implementation with Touch7. This is what Google and Facebook AI groups use, probably because they hired a folks from Yann LaCun’s lab. Torch7 has been extended further with more ML stuff produced at Facebook and they have made it available to the public. Probably check out stuff on why Lua/LuaJit over Python and LuaJIT’s interface with c-code. Also, LuaJIT is used a lot by gamers and I heard LuaJIT (or was it Lua) will replace Action Script in Adobe’s products.
Hi Jason, thanks for your nice introduction. Do you have any good books on machine learning in C?
Sorry, no, not off the top of my head. I can say that there are great libs written in c like libsvm that are often used via wrappers in python or R. Learning the native lib in c might be a fun experience!
Hi Jason. I’m new to machine learning. I’ve gone through the AI online course from Berkeley and plan to go through Yaser Abu-Mostafa “Learning from Data”. It is a language agnostic course however, which, according to what was stated in some reviews, demands intense effort in implementing algorithms by ourselves, without guidance. I like this approach, since it really forces one to research and deal with real challenges of implementation, not just concepts. The problem is that my language of choice, for other reasons, is C#, which I don’t see listed, here and elsewhere, among used languages for machine learning. I have limited experience with python, from the AI and linear algebra courses, which made most of the framework available.
The question is: how far apart is C# from Python, in terms of libraries useful for machine learning? How would it compare to Java, in the same terms?
Should I use a language like Python to develop machine learning code and make it interact with C# code, considering it will continue to be my main developing language? What about Accord.Net? Is it any good?
You make several good points about context. I would add that there is a dimension which runs from “scripting” (summoning existing machine learning routines) to “programming” (writing the machine learning routines oneself). Some languages lend themselves more to one of these operations more than the other. In SAS, for instance, analysts tend to call existing SAS “procs”: They are not writing logistic regression from scratch.
If a script-writing analyst and I fit such the same model form to the same data, we will get the same model parameters. The differences are that I know how and why that modeling process works (and when it won’t), and I can modify it directly when needed.
Without a doubt – Python.
The Python ecosystem is growing fast and seeing great adoption.
I tend to agree that Python is a force Victor.
A little update to this question 😉
Python is leading the way.
When I started a Data Science course, I had two choices Python or R. As always I have a passion on programming, I chose Python and worked on it through out the course. Though in the course series, they preferred R for Time series, I was following your Blog on Time Series using Python.
Some friends are suggesting AndrewsNg course in Coursera as a next step. But I felt as a newbie to Machine Learning field, I would stick to one language and get used to various Algorithms using it. Once comfortable, then i can explore more into R and MatLab.
What do you Suggest?
Sounds fine Nandhini, generally, you should get comfortable jumping from tool to tool or platform to platform, but not when starting out.
Regardless of tool, the skill to focus on is working through predictive modeling problems end to end and delivering a result (model or set of predictions).
At this moment I’m using scikit-learn in production and it’s working with very good performance.
I recommend scikit-learn.
Nice work Paulo! Thanks for the tip.
Even me (I am the author of Accord.NET mentioned a few comments above) I use scikit-learn on a daily basis for production use at work. However, if for any reason you or any of your blog readers would like to use machine learning in contexts where Python just wouldn’t be available (such as embedded devices through Xamarin, UWP apps or even Java), please give Accord.NET a try.
If you find issues in your application, or something that you believe should have been done better, register it at the project’s issue tracker and it should be taken care of in no time. The goal of this project is also to address platforms which have not been historically been served very well by Python-only implementations.
Hi Jason, how to implement competitive learning algorithm using R? thanks for your time.
Sorry, I do not have an example. I would recommend coding an example yourself from scratch in any language.
A SOM (Kohonen self-organizing map) is technically competitive learning so you could use an existing SOM implementation. I’ve coded a few in my time, for example:
More on competitive learning here:
I guess we should take a look at the latest poll (to my best knowledge) from Kaggle: http://www.kdnuggets.com/2017/01/most-popular-language-machine-learning-data-science.html
And notice that, yes, Python did take the lead.
Also check out this post on why I recommend Python:
In c+*, I found dlib and it comes with tons of examples very well commented. You can also run them on GPU.
What about the Microsoft Azure Machine language? I am new in the ML domain. How about if I start with Azure ML. I do not have any knowledge of R or python. Please suggest
Sorry, I have not used it.
I think that as happens with many other computer science fields, there’s an excessive focus on the language and the tools when the important stuff, really, is to know the theory well. I agree that there are languages with a richer ecosystem for data science and machine learning, as is the case of Python and R, but I think the domain of the project you want to start will lead you to a particular set of tools that are better suited for the demands of that endeavor. For instance, if you need to work with massive amounts of data, you’ll be better off using Apache Spark and Spark MLLib than, say, sklearn 🙂
What do you think? I’d love to know!
Thanks for the article!
shall i implement machine learning algorithms in java script? plz guide me .. how can i implement it?
Perhaps for fun and for learning, but I would not recommend using Java Script to implement machine learning to solve business problems. I cannot see how it could be justified.
I think this depends on the business. When you’re talking about Enterprise, I absolutely agree.
But with tensorflow.js now available, having AI functionality available on phones without network dependency opens up lots of new applications for machine learning.
Heavy compute is not something you want on a hand held. It kills battery.
I’d probably stick to R and Java.
As a software developer i know IT world is itself quite dynamic. With new and upcoming changes in computer programming languages, frameworks and technologies language trends are ever changing. We developers must remain with updated changes. So i was looking to learn some languages which will be beneficial for me future.Thank You.
I would recommend Python given the demand for skills in the area:
Java, Python, Lisp, Prolog, and C++ are major AI programming language used for artificial intelligence capable of satisfying different needs in development and designing of different software. It is up to a developer to choose which of the AI languages will gratify the desired functionality and features of the application requirements.
I have just started exploring ML but i am planning to prepare for the new offering from Oracle AI Platform for which many details are not available but it is mentioned that it will be supporting Keras, Caffe and TensorFlow.
Shall I start exploring Python or R?
I recommend Python, here’s why:
Thanks Jason !!
Am new to programming, and I want to know what programming language will help me grow in artificial intelligence and also as a Web developer…
I want a guide as I will start learning from scratch
its been 5 years, entirely different picture now.Python has replaced R in the above images.
Thanks for sharing this information.The programming languages are very important to improve machine learning.
Python is a great place to start.
can you plz give me detailed infomation of machine learning and data scientist using python which one is better
Here’s more on the relationship between data science and machine learning:
Julia, She crushes them all. 🙂
It is a good language.
Hi, your blogs are really help full. I have a question I want to know how can i compare the machine learning results out put from different plateforms. For example suppose i have results of some models in python, I dont know python but I want to compare my own model results which is written in Java. Is there any way to do that?
Perhaps output predictions to a file, then use a new application to load predictions from each model/platform and perform comparisons?
Thanks for your answer but here I wanted to ask different question.I mean when we use ML models we need to use random numbers for sampling or initialization purpose. Is it okay to just use different languages then. Is there any big role of random number generators?. If yes how can we achieve the same results on different plateforms. Thanks in advance.
Small differences in the implementation of an algorithm across libraries can result in differences of results.
I recommend using one tool to prepare the data that is then used across different languages, for consistency.
Thank you so much Jason. Have a nice day.
Just a little request: five years later since you did this article there have be many changes. For instance Matlab has evolved substantially, python has become a standard (even, I think, in your blog)… Do you think that this article needs a new version? Many thanks!!
I think it does not matter what language you use while learning ML.
If you want to be productive and get a job, Python is the winner, at least for now:
Dear Dr Jason,
Suppose you write a program in Python involving libraries for machine learning. Is there a way of converting a Python-written program into another language like C or Java in order to improve the execution speed of the program?
Anthony of Sydney
There are many options.
You can use cython to speed up python code.
You can use c against the same or similar enough backend libs.
You can re-write everything from scratch.
I am a bioinformatician, i m interested to learn machine learning through genomics, but i have to know where am i need to begin, is biopython packages are best or not? And is this the valuable one to have a job in genomics via machine learning?
You can start here:
Swift is becoming big with S4TF becoming popular.
Not sure about machine learning, but I know that learning Python was far more useful to me than R.
R was limited in available resources whereas Python’s community has produce more reusable code that you can build upon.
Also, Python is more flexible. Once you learn it you can build websites, automate processes, even build robots if you want. R might let you do that, but you’ll have to build most of it because the community is not as active.
Also, Python forces structure upon you. It might seems annoying at first, but it will help write better code over time.
Having learned R first, and Python after, I highly recommend Python.
I would recommend Python too at this stage.
Is Python still the undisputed King? How do all these automated and NoCode ML tools fit into everything now?
Hi Shaun…Python is still the way to go. We beleive that it is important to continue to build skill in Python for machine learning, however it is also beneficial to understand emerging tools to increase efficiency.