Last Updated on
If you don’t know what algorithm to use on your problem, try a few.
Alternatively, you could just try Random Forest and maybe a Gaussian SVM.
In a recent study these two algorithms were demonstrated to be the most effective when raced against nearly 200 other algorithms averaged over more than 100 data sets.
In this post we will review this study and consider some implications for testing algorithms on our own applied machine learning problems.
Discover how machine learning algorithms work including kNN, decision trees, naive bayes, SVM, ensembles and much more in my new book, with 22 tutorials and examples in excel.
Do We Need Hundreds of Classifiers
The title of the paper is “Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?” and it was published in Journal of Machine Learning Research on October 2014.
In the paper, the authors evaluate 179 classifiers arising from 17 families across 121 standard datasets from the UCI machine learning repository.
As a taste, here is a list of the families of algorithms investigated and the number of algorithms in each family.
- Discriminant analysis (DA): 20 classifiers
- Bayesian (BY) approaches: 6 classifiers
- Neural networks (NNET): 21 classifiers
- Support vector machines (SVM): 10 classifiers
- Decision trees (DT): 14 classifiers.
- Rule-based methods (RL): 12 classifiers.
- Boosting (BST): 20 classifiers
- Bagging (BAG): 24 classifiers
- Stacking (STC): 2 classifiers.
- Random Forests (RF): 8 classifiers.
- Other ensembles (OEN): 11 classifiers.
- Generalized Linear Models (GLM): 5 classifiers.
- Nearest neighbor methods (NN): 5 classifiers.
- Partial least squares and principal component regression (PLSR): 6
- Logistic and multinomial regression (LMR): 3 classifiers.
- Multivariate adaptive regression splines (MARS): 2 classifiers
- Other Methods (OM): 10 classifiers.
This is a huge study.
Some algorithms were tuned before contributing their final score and algorithms were evaluated using a 4-fold cross validation.
Cutting to the chase they found that Random Forest (specifically parallel random forest in R) and Gaussian Support Vector Machines (specifically from libSVM) performed the best overall.
From the abstract of the paper:
The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets.
This is consistent with our experience running hundreds of Kaggle competitions: for most classification problems, some variation on ensembles decision trees (random forests, gradient boosted machines, etc.) performs the best.
Get your FREE Algorithms Mind Map
I've created a handy mind map of 60+ algorithms organized by type.
Download it, print it and use it.
Also get exclusive access to the machine learning algorithms email mini-course.
Be Very Careful Preparing Data
Some algorithms only work with categorical data and others require numerical data. A few can handle whatever you throw at them. The datasets in the UCI machine are generally standardized, but not enough to be used in their raw state for a study like this.
This has been pointed out in the post “A Comment on Preparing Data for Classifiers“.
In this commentary, the author points out that categorical data in relevant datasets that were tested was systematically transformed into numerical values, but in a way that likely hindered some algorithms being tested.
The Gaussian SVM likely performed well because of the transformation of categorical attributes to numerical and the standardization of the datasets that was performed.
Nevertheless, I commend the courage the authors had in taking on this challenge and the problems may be addressed in those willing to take on the follow-up studies.
The authors also note the OpenML project that looks like a citizen science effort to take on the same challenge.
Why Do Studies Like This?
It is easy to snipe at this study with arguments along the lines of No Free Lunch Theorem (NFLT). That the performance of all algorithms is equivalent when averaged over all problems.
I dislike this argument. The NFLT requires that you have no prior knowledge. That you don’t know what problem you are working on or what algorithms you are trying. These conditions are not practical.
In the paper, the authors list four goals for the project:
- To select the globally best classifier for the selected data set collection
- To rank each classifier and family according to its accuracy
- To determine, for each classifier, its probability of achieving the best accuracy, and the difference between its accuracy and the best one
- To evaluate the classifier behavior varying the data set properties (complexity, number of patterns, number of classes and number of inputs)
The authors of the study acknowledge that practical problems we want to solve are a subset of all possible problems and that the number of effective algorithms is not infinite but manageable.
The paper is a statement that indeed we may have something to say about the capability of the suite of most used (or implemented) algorithms on a suite of known (but small) problems. (much like the STATLOG project from the mid 1990s)
In Practice: Choose a Middle Ground
You cannot know which algorithm (or algorithm configuration) will perform well or even best on your problem before you get started.
You must try multiple algorithms and double down your efforts on those few that demonstrate their ability to pick out the structure in the problem.
In the context of this study, spot checking is a middle ground between going with your favorite algorithm on one hand and testing all known algorithms on the other hand.
- Pick your favorite algorithm. Fast but limited to whatever your favorite algorithm or library happens to be.
- Spot check a dozen algorithms. A balanced approach that allows better performing algorithms to rise to the top for you to focus on.
- Test all known/implemented algorithms. Time consuming exhaustive approach that can sometimes deliver surprising results.
Where you land on this spectrum is dependent on the time and resources you have at your disposal. And remember, that trialling algorithms on a problem is but one step in the process of working through a problem.
Testing all algorithms requires a robust test harness. This cannot be overstated.
When I have attempted this in the past I find that most algorithms pick out most of the structure in the problem. It is a bunched distribution of results with a fat head a long tail and the difference in the fat head is often very minor.
It is this minor difference that you would like to be meaningful. Hence the need for you to invest a lot of upfront time in designing a robust test harness (cross validation, a good number of folds, perhaps a separate validation dataset) without data leakage (data scaling/transforms within cross validation folds, etc.)
I take this for granted on applied problems now. I don’t even care which algorithms rise up. I focus my efforts on data preparation and on ensembling the results of a diverse set of good enough models.
Where do you fall on the spectrum when working a machine learning problem?
Do you stick with a favorite or favorite set of algorithms? Do you spot check or do you try to be exhaustive and test everything that your favorite libraries have to offer?