Simple 3-Step Methodology To The Best Machine Learning Algorithm

By Jason Brownlee on August 15, 2020 in Machine Learning Process 13

How do you choose the best algorithm for your dataset?

Machine learning is a problem of induction where general rules are learned from specific observed data from the domain.

It infeasible (impossible?) to know what representation or what algorithm to use to best learn from the data on a specific problem before hand, without knowing the problem so well that you probably don’t need machine learning to begin with.

So what algorithm should you use on a given problem? It’s a question of trial and error, or searching for the best representation, learning algorithm and algorithm parameters.

In this post, you will discover the simple 3-step methodology for finding the best algorithm for your problem proposed by some of the best predictive modelers in the business.

Steps To The Best Machine Learning Algorithm
Photo by David Goehring, some rights reserved.

3-Step Methodology

Max Kuhn is the creator and owner of the caret package for that provides a suite of tools for predictive modeling in R. It might be the best R package and the one reason why R is the top choice for serious competitive and applied machine learning.

In their excellent book, “Applied Predictive Modeling“, Kuhn and Johnson outline a process to select the best model for a given problem.

I paraphrase their suggested approach as:

Start with the least interpretable and most flexible models.
Investigate simpler models that are less opaque.
Consider using the simplest model that reasonably approximates the performance of the more complex models.

They comment:

Using this methodology, the modeler can discover the “performance ceiling” for the data set before settling on a model. In many cases, a range of models will be equivalent in terms of performance so the practitioner can weight the benefits of different methodologies…

For example, here is a general interpretation of this methodology that you could use on your next one-off modeling project:

Investigate a suite of complex models and establish a performance ceiling, such as:
1. Support Vector Machines
2. Gradient Boosting Machines
3. Random Forest
4. Bagged Decision Trees
5. Neural Networks
Investigate a suite of simpler more interpretable models, such as:
1. Generalized Linear Models
2. LASSO and Elastic-Net Regularized Generalized Linear Models
3. Multivariate Adaptive Regression Splines
4. k-Nearest Neighbors
5. Naive Bayes
Select the model from (2) that best approximates the accuracy from (1).

Quick One-Off Models

I think this is a great methodology to use for a one-off project where you need a good result quickly, such as within minutes or hours.

You have a good idea of the spread of accuracy on a problem across models
You have a model that is easier to understand and explain to others.
You have a reasonably high quality model very quickly (maybe top 10-to-25% of what is achievable on the problem if you spent days or weeks)

I don’t think this is the best methodology for all problems. Perhaps some down-sides to methodology are:

More complex methods are slower to run and return a result.
Sometimes you want the complex mode over the simpler models (e.g. domains where accuracy trumps explainability).
The performance ceiling is pursued first, rather than last when there might be time and pressure and motivation to extract the most from the best methods.

For more information on this strategy, checkout Section 4.8 Choosing Between Models, page 78 of Applied Predictive Modeling. A must have book for any serious machine learning practitioners using R.

Do you have a methodology for finding the best machine learning algorithm for a problem? Leave a comment and share the broader strokes.

Have you used this methodology? Did it work for you?

Any questions? Leave a comment.

13 Responses to Simple 3-Step Methodology To The Best Machine Learning Algorithm

Kevin January 22, 2016 at 11:23 pm #

Some applications of Machine Learning and tutorials can be found at http://www.data-blogger.com

Reply
Pranav Verma January 23, 2016 at 3:15 am #

John, can you please provide two examples to elaborate this.

Reply
Hans May 2, 2017 at 10:27 am #

Are there templates available for 1. and 2.?

Reply
- Hans May 4, 2017 at 10:10 pm #
  
  After several weeks with your stuff Jason, now I see light at the end of the tunnel 🙂
  
  Learned from your R-tutorials a lot:
  
  https://machinelearningmastery.com/evaluate-machine-learning-algorithms-with-r/
  
  https://machinelearningmastery.com/compare-models-and-select-the-best-using-the-caret-r-package/
  
  This is indeed a complete new universe!
  
  On the caret website there are 233 Models available:
  
  https://topepo.github.io/caret/available-models.html
  
  Is there a way to gather only the ones able for time series prediction?
  
  Reply
  - Jason Brownlee May 5, 2017 at 7:30 am #
    
    I’m glad to hear it.
    
    For time series, you can frame it as either regression or classification. You can therefore gather all classification and regression problems depending on how you frame the problem.
    
    In practice, many of the algorithms will not be worth it or will require special data preparation.
    
    Reply
Hans May 5, 2017 at 10:16 am #

A)
In R and Caret we can even predict unseen data.
And the R-code seems much more compact compared to the Python ML-stack.
Why or in which situation should we choose the whole ‘Python-Enchilada’ over R and Caret?

B) Is there anywhere a top 100 of times series forecast – algorithms.?
Or which are currently the hot ones (newly invented)?

Reply
- Jason Brownlee May 5, 2017 at 11:27 am #
  
  I love R, but Python is in demand so that is why I am focusing on it:
  https://machinelearningmastery.com/python-growing-platform-applied-machine-learning/
  
  I recommend R for deep one off projects and R&D. I recommend the Python stack for code that needs to be developed for reliability/maintainability (e.g. classical software eng for production environments).
  
  Reply
Riberto mark May 29, 2018 at 8:05 pm #

Machine learning is the new innovative way of learning and communication. The way it is being taken by the organizations is very progressive and the steps that are described well are also very useful for the algorithm programmers.

Reply
- Jason Brownlee May 30, 2018 at 6:40 am #
  
  Thanks.
  
  Reply
Ganga Keerthi February 28, 2019 at 5:55 pm #

which machine lean=rning algorrithm best fits for predictive analysis which means identifying illegal activities

Reply
- Jason Brownlee March 1, 2019 at 6:14 am #
  
  This is a common question that I answer here:
  https://machinelearningmastery.com/faq/single-faq/what-algorithm-config-should-i-use
  
  Reply
Blessing Iduh April 22, 2019 at 3:50 pm #

Thank you very much for this insight. I have spent months searching for the the best methodology to apply in my PhD research. This is so educative.

Reply
- Jason Brownlee April 23, 2019 at 7:50 am #
  
  Thanks, I’m glad it helps.
  
  Reply

Navigation

Simple 3-Step Methodology To The Best Machine Learning Algorithm

3-Step Methodology

Quick One-Off Models

More On This Topic

13 Responses to Simple 3-Step Methodology To The Best Machine Learning Algorithm

Leave a Reply Click here to cancel reply.