Machine Learning Q&A: Concept Drift, Better Results and Learning Faster

By Jason Brownlee on June 7, 2016 in Start Machine Learning 0

I get a lot of questions about machine learning via email and I love answering them.

I get to see what real people are doing and help to make a difference. (Do you have a question about machine learning? Contact me).

In this post I highlight a few of the interesting questions I have received recently and summarize my answers.

Machine Learning Q&A
Photo by Angelo Amboldi, some rights reserved

Why does my spam classifier get worse when I train it on all lots of old emails?

This is a great question as it highlights an important concept in machine learning called concept drift.

The content of emails change through time. The user will change who they converse with and on which topics. Email spammers will send different offers and will actively change their tactics within emails to avoid email spam detection.

These changes affect the modeling.

The best source of information about which emails are spam and which are not spam are the emails received most recently. The further you go back in time, the less useful the emails will be to the modeling problem.

The idea of what is and is not spam is captured in the model and is based on the data you used to train that model. If the idea or concept of what is and is not spam changes, then you need to collect more examples and update your model.

This is an important property of a problem and can influence decisions that you make about modeling the problem. For example, you may want to select a model that can be easily updated incrementally rather than being rebuilt from scratch.

How can I get better results on my machine learning problem?

Like a piece of software or a piece of art, it is never done. One day you just stop working on it.

There are a lot of things you can try, some broad areas include:

Work the data: Look into feature engineering in an attempt to expose more and more useful structure of the problem to the modeling algorithms. See if you can collect additional data that can inform the problem. Investigate data preparation such as scaling and other data transforms that may better expose structure in the problem.
Work other algorithms: Are there other algorithms you can spot check? There are always more algorithms, and there are often very powerful algorithms that you can seek out and try.
Work the algorithm: Have you gotten the most from the algorithms you have tried? Tune the algorithm parameters using grid or random search.
Combine predictions: Try combining the predictions from multiple well-performing but different algorithms. Use ensemble methods like bagging, boosting and blending.

The further you push accuracy, the higher the likelihood that you are overfitting your model to the training data and limiting the applicability to unseen data.

Revisit your problem definition and set minimum accuracy thresholds. Often a “good enough” model is more practically applicable than a finely tuned (and fragile) model.

See this post titled “Model Prediction Accuracy Versus Interpretation in Machine Learning“.

How can I learn machine learning faster?

Practice. A lot.

Read books, take courses, study and leverage what others have figured out.
Get good at the process of working problems end-to-end.
Study machine learning algorithms.
Work problems, reproduce results from papers and competitions.
Design and execute small self-study projects and build up a portfolio of results.

Learning new things is not good enough.

To learn faster you need to work harder. You need to put the things you are learning into action. You need to work and rework problems.

What are some problems to work on?

Start with the datasets in the UCI machine learning repository. They are small, they fit into memory and they are used by academics to demonstrate algorithm properties behaviors, so they are somewhat well understood.

The list of the most popular datasets would be a good place to start.

Move onto competition data sets. Get good enough results, then attempt to reproduce results on competition winners (in broad strokes, often there is insufficient information).

Datasets from the most recent KDDCup and Kaggle competitions would be a good place to start.

Finally, move into raising your own questions (or taking on others) and defining your own problems, collecting the data, and generally work problems end-to-end.

More information:

How to go beyond driving machine learning tools?

I advise beginners to learn how to drive machine learning tools and libraries and get good at working machine learning problems end-to-end.

I do this because this is the bread and butter of applied machine learning and there is a lot to learn in this process, from data preparation to algorithms, to communicating results.

Going deeper involves specialization. For example, you can go deeper into machine learning algorithms. You can study them, make lists, describe them and implement them from scratch. In fact, there’s no limit to how deep you can dive, but you do want to pick an area that you find compelling.

A general framework I suggest for going deeper via self-study is my small projects methodology. That is where you define a small project (5 to 10 hours of effort), execute them and share results, then repeat.

I suggest four classes of project: investigate a tool, investigate an algorithm, investigate a problem and implement an algorithm. The latter three projects may appeal if you are eager to move beyond driving a machine learning tool or library.

Ask a Question

If you have a machine learning question, contact me.

If you are interested in my approach to machine learning, take a look at my start-here page. It links to lots of useful blog posts and resources.

Navigation