Do you want some tips and tricks that are useful in developing successful machine learning applications?
This is the subject of a journal article from 2012 titled “A Few Useful Things to Know about Machine Learning” (PDF) by University of Washing professor Pedro Domingos.
It’s an in interesting read with a great opening hook:
developing successful machine learning applications requires a substantial amount of “black art” that is hard to find in textbooks
This post summarizes the 12 key lessons learned by machine learning researchers and practitioners outlined in his article.
1. Learning = representation + evaluation + optimization
When faced with a machine learning algorithm, don’t get lost in the weeds with the hundreds of possible machine learning algorithms you could be using.
Focus on three key components:
- Representation. The classify you pick defines the representation the solution will take and the space of all learnable classifiers called the hypothesis space. Examples include instances, hyperplanes, decision trees, rule sets and neural networks.
- Evaluation. The evaluation function that you will use to judge a good classifier from a bad classifier. This may be the loss function used internally by the algorithm. Examples include accurate, squared error, likelihood and information gain.
- Optimization. The method to search for a good classifier. This is internally how the algorithm traverses the hypothesis space in the context of the evaluation function to a final classifier capable of making accurate predictions. Examples include combinatorial optimization and continuous optimization, and subsets thereof.
2. It’s generalization that counts
The fundamental goal of machine learning is to generalize beyond the examples in the training set.
We must expect that will not encounter samples the same as those in the training set again. Evaluating the accuracy of a model on the training set alone gives little idea of the usefulness of the model on unseen data.
Be careful not to contaminate the training process with test data. Use cross validation on the training dataset and create a hold-out set for final validation.
We have to use training error as a surrogate for test error, and this is fraught with danger.
3. Data alone is not enough
Data alone is not enough regardless of how much you have.
Every learner must embody some knowledge or assumptions beyond the data it’s given in order to generalize beyond it.
We must make assumptions. These create biases, but allow us to avoid the trap of no free lunch and assuming nothing about the problem we are working.
The simple assumptions we make in our classifiers take us a long way, such as:
- smoothness of the error function
- similar examples have similar classes
- limited dependencies
- limited complexity
Induction (the type of learning used by machine learning methods) turns a small amount of data into a large amount of output knowledge. It is more powerful than deduction. Induction requires a small amount of knowledge as input and we must use this effectively.
- If we have a lot of knowledge about what makes examples similar in our domain we could use instance methods.
- If we have knowledge about probabilistic dependencies we could use graphical models.
- If we have knowledge about what kinds of preconditions are required by each class we could use rule sets.
Machine learning is not magic; it can’t get something from nothing. What it does is get more from less.
4. Overfitting has many faces
We must be waring of learning the random fluctuations in training data. This is called overfitting and can be recognized when accuracy is high on the training data and low on the testing dataset.
Generalization error can be decomposed into bias and variance:
- Bias is a learners tendency to learn the wrong thing. A linear learner has high bias because it is limited to separating classes using a hyperplane.
- Variance is the learners tendency to learn random things regardless of the real signal. Decision trees have high variance as they are highly influenced by the specifics in the training data.
Sometimes a strong false assumptions (read bias) can be better than weak true assumptions, explaining why naive Bayes with strong independence assumptions can do better than powerful decision trees like C4.5 that require more data to avoid overfitting.
- Cross validation helps, but can cause problems if we check too often and end up overfitting the entire training dataset.
- Regularization can help by penalizing more complex classifiers.
- Statistical significance tests can help to decide if changes are a meaningful change or not.
It’s easy to avoid overfitting (variance) by falling into the opposite error of underfitting (bias). Simultaneously avoiding both requires learning a perfect classifier, and short of knowing it in advance there is no single technique that will always do best (no free lunch).
For more, see the Wikipedia entry on the Bias-variance tradeoff.
5. Intuition fails in high dimensions
The second biggest problem in machine learning is the curse of dimensionality.
Domingos states it very well:
Generalizing correctly becomes exponentially harder as the dimensionality (number of features) of the examples grows, because a fixed-size training set covers a dwindling fraction of the input space.
Similarity-based reasoning breaks down quickly in high-dimensionality space. In high dimensions all examples look alike. Our intuitions also break down, such as trying to understand the mean in a multivariate Gaussian distribution.
The effect that counter acts this problems is called the “blessing of non-uniformity” (Domingos’s term I think). This refers to the fact that observations from real-world domains are often not distributed uniformly, but grouped or clustered in useful and meaningful ways.
6. Theoretical guarantees are not what they seem
Theoretical guarantees should be taken with a large grain of salt.
- The number of samples needed for an algorithm to ensure good generalization.
- Given infinite data, the algorithm sis guaranteed to output the correct classifier.
If you are not from an theoretical machine learning background, this lesson may seem esoteric. Domongos summarizes it well:
The main role of theoretical guarantees in machine learning is not as a criterion for practical decisions, but as a source of understanding and driving force for algorithm design.
7. Feature engineering is the key
The factor that makes the biggest difference between machine learning projects that fail and succeed it is the features used.
Learning is easy when all of the features correlate with the class, but more often the class is a complex function of the features.
machine learning is not a one-shot process of building a data set and running a learner, but rather an iterative process of running the learner, analyzing the results, modifying the data and/or the learner, and repeating
Raw data is often does not contain enough structure for the learning algorithms, features must be constructed from the data available to better expose the structure to the algorithms. As such feature engineering is often domain specific.
One approach is to generate a large number of features and select those that best correlate with the class. This can work well, but a trap is to ignore the possibility of useful intra-feature non-linear relationships with the output variable.
For more on feature engineering see the post: Discover Feature Engineering, How to Engineer Features and How to Get Good at It
8. More data beats a cleverer algorithm
When you reach a limit and still need better results, you have two options:
- Design a better learning algorithm
- Gather more data (more observations and/or more features)
The quickest path to better results is often to get more data.
As a rule of thumb, a dumb algorithm with lots and lots of data beats a clever one with modest amounts of it.
Computer science is constrained by time and memory, machine learning adds a third constraint which is training data.
Today we often have more data than we can use. Complex classifiers can take too long to train or do not work well at scale. This means more often than not simpler classifiers are used in practice. Also, at scale, most classifiers achieve very similar results.
All learners essentially work by grouping nearby examples into the same class; the key difference is in the meaning of “nearby.”
As a rule, use simpler algorithms before more complex algorithms. Look at complexity in terms of the number of parameters or terms used by the algorithm.
9. Learn many models, not just one
Don’t pick a favorite algorithm and optimize it to hell on your problem. Try lots of different algorithms then ensemble them together to get the best results.
In the early days of machine learning, everyone had their favorite learner, together with some a priori reasons to believe in its superiority.
Consider taking a closer took at the three most popular ensemble methods:
- Bagging: generate different samples of the training data, prepare a learner on each and combine the predictions using voting.
- Boosting: weight training instances by their difficulty during training to put special focus on those difficult to classify instances.
- Stacking: use a higher-level classifier to learn how to best combine the predictions of other classifiers.
10. Simplicity does not imply accuracy
Choosing the simpler of two classifiers with the same training error is a good rule of thumb. The simpler classifier does not always have the best accuracy on the test dataset.
A perhaps more interesting perspective is to consider complexity in terms of the size of the hypothesis space for each classifier. That is the space of possible classifiers that each algorithm could generate. A larger space is likely to be sampled less and the resulting classifier may be less likely to have been overfit to the training data.
The conclusion is that simpler hypotheses should be preferred because simplicity is a virtue in its own right, not because of a hypothetical connection with accuracy.
11. Representable does not imply learnable
Related to picking favorite algorithms, practitioners can fall into the trap of picking favorite representations and justify it with theoretical claims of universal approximation (e.g. it can be used to approximate any arbitrary target function).
Given finite data, time and memory, standard learners can learn only a tiny subset of all possible functions, and these subsets are different for learners with different representations.
Focus on the problem of can the target function be learned, not can it be represented.
12. Correlation does not imply causation
Classifiers can only learn correlations. They are statistical in nature.
The predictions made by predictive models are intended to aid human decision making in complex domains where only historical observations are available and controlled experiments are not possible.
Interestingly, correlation can be a guide to causation, and could be used as a starting point for investigation.
In this post we looked at the 12 lessons learned by machine learning researchers and practitioners outlined in Domingos’ 2012 paper.
Those lessons again were:
- Learning = representation + evaluation + optimization
- It is generalization that counts
- Data alone is not enough
- Overfitting has many faces
- Intuition fails in high dimensions
- Theoretical guarantees are not what they seem
- Feature engineering is the key
- More data beats a cleverer algorithm
- Learn many models, not just one
- Simplicity does not imply accuracy
- Representable does not imply learnable
- Correlation does not imply causation
You can download a PDF the original paper titled “A Few Useful Things to Know about Machine Learning“.
Domingos is also the author of an online machine learning course on coursera titled “Machine Learning“, presumably recorded at the University of Washington. All of the videos to the course can be viewed for free by clicking the “Preview Lectures” button.
Finally, Domingos has a new book titled “The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World“. My copy arrived today.