Simple 3-Step Methodology To The Best Machine Learning Algorithm

How do you choose the best algorithm for your dataset?

Machine learning is a problem of induction where general rules are learned from specific observed data from the domain.

It infeasible (impossible?) to know what representation or what algorithm to use to best learn from the data on a specific problem before hand, without knowing the problem so well that you probably don’t need machine learning to begin with.

So what algorithm should you use on a given problem? It’s a question of trial and error, or searching for the best representation, learning algorithm and algorithm parameters.

In this post, you will discover the simple 3-step methodology for finding the best algorithm for your problem proposed by some of the best predictive modelers in the business.

Steps To The Best Machine Learning Algorithm

Steps To The Best Machine Learning Algorithm
Photo by David Goehring, some rights reserved.

3-Step Methodology

Max Kuhn is the creator and owner of the caret package for that provides a suite of tools for predictive modeling in R. It might be the best R package and the one reason why R is the top choice for serious competitive and applied machine learning.

In their excellent book, “Applied Predictive Modeling“, Kuhn and Johnson outline a process to select the best model for a given problem.

I paraphrase their suggested approach as:

  1. Start with the least interpretable and most flexible models.
  2. Investigate simpler models that are less opaque.
  3. Consider using the simplest model that reasonably approximates the performance of the more complex models.

They comment:

Using this methodology, the modeler can discover the “performance ceiling” for the data set before settling on a model. In many cases, a range of models will be equivalent in terms of performance so the practitioner can weight the benefits of different methodologies…

For example, here is a general interpretation of this methodology that you could use on your next one-off modeling project:

  1. Investigate a suite of complex models and establish a performance ceiling, such as:
    1. Support Vector Machines
    2. Gradient Boosting Machines
    3. Random Forest
    4. Bagged Decision Trees
    5. Neural Networks
  2. Investigate a suite of simpler more interpretable models, such as:
    1. Generalized Linear Models
    2. LASSO and Elastic-Net Regularized Generalized Linear Models
    3. Multivariate Adaptive Regression Splines
    4. k-Nearest Neighbors
    5. Naive Bayes
  3. Select the model from (2) that best approximates the accuracy from (1).

Quick One-Off Models

I think this is a great methodology to use for a one-off project where you need a good result quickly, such as within minutes or hours.

  • You have a good idea of the spread of accuracy on a problem across models
  • You have a model that is easier to understand and explain to others.
  • You have a reasonably high quality model very quickly (maybe top 10-to-25% of what is achievable on the problem if you spent days or weeks)

I don’t think this is the best methodology for all problems. Perhaps some down-sides to methodology are:

  • More complex methods are slower to run and return a result.
  • Sometimes you want the complex mode over the simpler models (e.g. domains where accuracy trumps explainability).
  • The performance ceiling is pursued first, rather than last when there might be time and pressure and motivation to extract the most from the best methods.

For more information on this strategy, checkout Section 4.8 Choosing Between Models, page 78 of Applied Predictive Modeling. A must have book for any serious machine learning practitioners using R.

Do you have a methodology for finding the best machine learning algorithm for a problem? Leave a comment and share the broader strokes.

Have you used this methodology? Did it work for you?

Any questions? Leave a comment.

13 Responses to Simple 3-Step Methodology To The Best Machine Learning Algorithm

  1. Avatar
    Kevin January 22, 2016 at 11:23 pm #

    Some applications of Machine Learning and tutorials can be found at http://www.data-blogger.com

  2. Avatar
    Pranav Verma January 23, 2016 at 3:15 am #

    John, can you please provide two examples to elaborate this.

  3. Avatar
    Hans May 2, 2017 at 10:27 am #

    Are there templates available for 1. and 2.?

  4. Avatar
    Hans May 5, 2017 at 10:16 am #

    A)
    In R and Caret we can even predict unseen data.
    And the R-code seems much more compact compared to the Python ML-stack.
    Why or in which situation should we choose the whole ‘Python-Enchilada’ over R and Caret?

    B) Is there anywhere a top 100 of times series forecast – algorithms.?
    Or which are currently the hot ones (newly invented)?

  5. Avatar
    Riberto mark May 29, 2018 at 8:05 pm #

    Machine learning is the new innovative way of learning and communication. The way it is being taken by the organizations is very progressive and the steps that are described well are also very useful for the algorithm programmers.

  6. Avatar
    Ganga Keerthi February 28, 2019 at 5:55 pm #

    which machine lean=rning algorrithm best fits for predictive analysis which means identifying illegal activities

  7. Avatar
    Blessing Iduh April 22, 2019 at 3:50 pm #

    Thank you very much for this insight. I have spent months searching for the the best methodology to apply in my PhD research. This is so educative.

Leave a Reply