Microsoft recently launched support for machine learning in their Azure cloud computing platform.
Buried in some of their technical documentation for the platform are some resources that you may find useful for thinking about what machine learning algorithm to use in different situations.
In this post we take a look at the Microsoft recommendations for machine learning algorithms and the lessons that we can use when working through machine learning problems on any platform.
Machine Learning Algorithm Cheatsheet
Microsoft released a PDF cheatsheet of what machine learning algorithms to use, when.
The one-pager lists various problem types as groups and the algorithms supported by Azure in each group.
These groups are:
- Regression: for predicting values.
- Anomaly detection: for finding unusual data points.
- Clustering: for discovering structure.
- Two-class classification: for predicting two categories.
- Multi-class classification: for predicting three or more categories.
The first problem with this approach is that the algorithm names seemingly map onto the Azure API documentation, and are not standard. A few common names jump out but others are just names for standard algorithms given a spin for simplicity (or I suspect to avoid some kind of name infringement).
Along with the algorithm names are a few words on why you might choose a given algorithm. It’s a nice idea and given that it is a cheatsheet, it’s brief but concise.
Download the cheatsheet (PDF) from its companion blog post titled “Machine learning algorithm cheat sheet for Microsoft Azure Machine Learning Studio“.
Get your FREE Algorithms Mind Map
I've created a handy mind map of 60+ algorithms organized by type.
Download it, print it and use it.
Also get exclusive access to the machine learning algorithms email mini-course.
How To Choose Machine Learning Algorithms
The goal of the cheatsheet is to help you quickly select an algorithm for your problem.
Is it? Perhaps not.
The reason is, you probably should never analytically select one algorithm for your problem. You should spot check a number of algorithms and evaluate them using whatever your requirements are for the problem.
For more information on spot checking algorithms, see the post “Why you should be Spot-Checking Algorithms on your Machine Learning Problems“.
I think the cheatsheet is best used to get an idea of what algorithms to throw into your spot check, viewed through the lens of your problem requirements.
In a sister blog in the same Azure documentation, we are given more context that aligns with these ideas, titled “How to choose algorithms for Microsoft Azure Machine Learning“.
The post starts out by posing the question: “What machine learning algorithms should I use?” and answers it correctly with “it depends“. They comment:
Even the most experienced data scientists can’t tell which algorithm will perform best before trying them.
The valuable take-away from this post is the considerations they provide for thinking about algorithm selection in the context of your requirements. These algorithm selection considerations are:
- Accuracy: Whether getting the best score is the goal or an approximate (“good enough”)solution while trading off overfitting.
- Training time: The amount of time available to train the model (I would guess, to verify and tune as well).
- Linearity: An aspect of model complexity in terms of how the problem is modelled. Contrasted to non-linear models which are often more complex to understand and tune.
- Number of parameters: Another aspect of model complexity impacting time and expertise in tuning and sensitivity.
- Number of features: Really, the problem of having more attributes than instances, the p>>n problem. This often requires specialized handling or specialized techniques.
The post also provides a cute table of the algorithms supported by Azure and their mapping onto some of the considerations listed above.
I think this is good.
I also think that it is very expensive to create (requires an expert), does not scale to the hundreds (thousands?) of machine learning algorithms available and would require constant updating as new and more powerful algorithms are developed and released.
How Do We Choose Algorithms Efficiently?
Often the goal of predictive modelling is to create the most accurate models given reasonable time and resources.
Concerns of algorithm complexity in terms of the linearity of the model and number of parameters often are only a concern if the model is for descriptive purposes only, not for actually making predictions.
With a well designed problem test harness, the selection of which algorithm and what parameter values to set becomes a combinatorial problem for the computer to figure out, not the data scientist. In fact, like intuition in A/B testing, algorithm selection is biased and probably crippling performance.
This is the spot checking approach to machine learning and is only possible because of the large number of algorithms that have been implemented, because of powerful systematic testing methods (like cross validation) and because of cheap and plentiful computation.
What are the odds of your favorite machine learning algorithm performing well on a problem you have not worked on before? Not very good (unless you’re using random forest! – I’m joking).
The point I’m making is that we can study machine learning algorithms and get a feeling for how they work and what they are suited for, but I would argue that this level of selection comes later. It comes when you are trying to choose between 3-to-4 high performing models. It comes when you have good results and you need to dive deeper to get better results.
Check out the cheatsheet and take some notes and think about how you can use the ideas in your own process.
Take a look at my tour of machine learning algorithms to get a firm idea of the most popular machine learning algorithms and how they relate to each other.
If you want to know more about how to systematically work a machine learning problem from end-to-end, check out my post “Process for working through Machine Learning Problems“.