Last Updated on
How do you choose the best algorithm for your dataset?
Machine learning is a problem of induction where general rules are learned from specific observed data from the domain.
It infeasible (impossible?) to know what representation or what algorithm to use to best learn from the data on a specific problem before hand, without knowing the problem so well that you probably don’t need machine learning to begin with.
So what algorithm should you use on a given problem? It’s a question of trial and error, or searching for the best representation, learning algorithm and algorithm parameters.
In this post, you will discover the simple 3-step methodology for finding the best algorithm for your problem proposed by some of the best predictive modelers in the business.
Max Kuhn is the creator and owner of the caret package for that provides a suite of tools for predictive modeling in R. It might be the best R package and the one reason why R is the top choice for serious competitive and applied machine learning.
In their excellent book, “Applied Predictive Modeling“, Kuhn and Johnson outline a process to select the best model for a given problem.
I paraphrase their suggested approach as:
- Start with the least interpretable and most flexible models.
- Investigate simpler models that are less opaque.
- Consider using the simplest model that reasonably approximates the performance of the more complex models.
Using this methodology, the modeler can discover the “performance ceiling” for the data set before settling on a model. In many cases, a range of models will be equivalent in terms of performance so the practitioner can weight the benefits of different methodologies…
For example, here is a general interpretation of this methodology that you could use on your next one-off modeling project:
- Investigate a suite of complex models and establish a performance ceiling, such as:
- Support Vector Machines
- Gradient Boosting Machines
- Random Forest
- Bagged Decision Trees
- Neural Networks
- Investigate a suite of simpler more interpretable models, such as:
- Generalized Linear Models
- LASSO and Elastic-Net Regularized Generalized Linear Models
- Multivariate Adaptive Regression Splines
- k-Nearest Neighbors
- Naive Bayes
- Select the model from (2) that best approximates the accuracy from (1).
Quick One-Off Models
I think this is a great methodology to use for a one-off project where you need a good result quickly, such as within minutes or hours.
- You have a good idea of the spread of accuracy on a problem across models
- You have a model that is easier to understand and explain to others.
- You have a reasonably high quality model very quickly (maybe top 10-to-25% of what is achievable on the problem if you spent days or weeks)
I don’t think this is the best methodology for all problems. Perhaps some down-sides to methodology are:
- More complex methods are slower to run and return a result.
- Sometimes you want the complex mode over the simpler models (e.g. domains where accuracy trumps explainability).
- The performance ceiling is pursued first, rather than last when there might be time and pressure and motivation to extract the most from the best methods.
For more information on this strategy, checkout Section 4.8 Choosing Between Models, page 78 of Applied Predictive Modeling. A must have book for any serious machine learning practitioners using R.
Do you have a methodology for finding the best machine learning algorithm for a problem? Leave a comment and share the broader strokes.
Have you used this methodology? Did it work for you?
Any questions? Leave a comment.