How do you choose the best algorithm for your dataset?

Machine learning is a problem of induction where general rules are learned from specific observed data from the domain.

It infeasible (impossible?) to know what representation or what algorithm to use to best learn from the data on a specific problem before hand, without knowing the problem so well that you probably don’t need machine learning to begin with.

So what algorithm should you use on a given problem? It’s a question of trial and error, or searching for the best representation, learning algorithm and algorithm parameters.

In this post, you will discover the simple 3-step methodology for finding the best algorithm for your problem proposed by some of the best predictive modelers in the business.

## 3-Step Methodology

Max Kuhn is the creator and owner of the caret package for that provides a suite of tools for predictive modeling in R. It might be the best R package and the one reason why R is the top choice for serious competitive and applied machine learning.

In their excellent book, “Applied Predictive Modeling“, Kuhn and Johnson outline a process to select the best model for a given problem.

I paraphrase their suggested approach as:

- Start with the least interpretable and most flexible models.
- Investigate simpler models that are less opaque.
- Consider using the simplest model that reasonably approximates the performance of the more complex models.

They comment:

Using this methodology, the modeler can discover the “performance ceiling” for the data set before settling on a model. In many cases, a range of models will be equivalent in terms of performance so the practitioner can weight the benefits of different methodologies…

For example, here is a general interpretation of this methodology that you could use on your next one-off modeling project:

- Investigate a suite of complex models and establish a performance ceiling, such as:
- Support Vector Machines
- Gradient Boosting Machines
- Random Forest
- Bagged Decision Trees
- Neural Networks

- Investigate a suite of simpler more interpretable models, such as:
- Generalized Linear Models
- LASSO and Elastic-Net Regularized Generalized Linear Models
- Multivariate Adaptive Regression Splines
- k-Nearest Neighbors
- Naive Bayes

- Select the model from (2) that best approximates the accuracy from (1).

## Quick One-Off Models

I think this is a great methodology to use for a one-off project where you need a good result quickly, such as within minutes or hours.

- You have a good idea of the spread of accuracy on a problem across models
- You have a model that is easier to understand and explain to others.
- You have a reasonably high quality model very quickly (maybe top 10-to-25% of what is achievable on the problem if you spent days or weeks)

I don’t think this is the best methodology for all problems. Perhaps some down-sides to methodology are:

- More complex methods are slower to run and return a result.
- Sometimes you want the complex mode over the simpler models (e.g. domains where accuracy trumps explainability).
- The performance ceiling is pursued first, rather than last when there might be time and pressure and motivation to extract the most from the best methods.

For more information on this strategy, checkout Section 4.8 Choosing Between Models, page 78 of Applied Predictive Modeling. A must have book for any serious machine learning practitioners using R.

Do you have a methodology for finding the best machine learning algorithm for a problem? Leave a comment and share the broader strokes.

Have you used this methodology? Did it work for you?

Any questions? Leave a comment.

Some applications of Machine Learning and tutorials can be found at http://www.data-blogger.com

John, can you please provide two examples to elaborate this.

Are there templates available for 1. and 2.?

After several weeks with your stuff Jason, now I see light at the end of the tunnel 🙂

Learned from your R-tutorials a lot:

http://machinelearningmastery.com/evaluate-machine-learning-algorithms-with-r/

http://machinelearningmastery.com/compare-models-and-select-the-best-using-the-caret-r-package/

This is indeed a complete new universe!

On the caret website there are 233 Models available:

https://topepo.github.io/caret/available-models.html

Is there a way to gather only the ones able for time series prediction?

I’m glad to hear it.

For time series, you can frame it as either regression or classification. You can therefore gather all classification and regression problems depending on how you frame the problem.

In practice, many of the algorithms will not be worth it or will require special data preparation.

A)

In R and Caret we can even predict unseen data.

And the R-code seems much more compact compared to the Python ML-stack.

Why or in which situation should we choose the whole ‘Python-Enchilada’ over R and Caret?

B) Is there anywhere a top 100 of times series forecast – algorithms.?

Or which are currently the hot ones (newly invented)?

I love R, but Python is in demand so that is why I am focusing on it:

http://machinelearningmastery.com/python-growing-platform-applied-machine-learning/

I recommend R for deep one off projects and R&D. I recommend the Python stack for code that needs to be developed for reliability/maintainability (e.g. classical software eng for production environments).

Machine learning is the new innovative way of learning and communication. The way it is being taken by the organizations is very progressive and the steps that are described well are also very useful for the algorithm programmers.

Thanks.

which machine lean=rning algorrithm best fits for predictive analysis which means identifying illegal activities

This is a common question that I answer here:

https://machinelearningmastery.com/faq/single-faq/what-algorithm-config-should-i-use

Thank you very much for this insight. I have spent months searching for the the best methodology to apply in my PhD research. This is so educative.

Thanks, I’m glad it helps.