There was a recent question that asked “How to not waste-time/procrastinate while ml scripts are running?“. I think this is an important question. I think answers to this question show a level of organization or maturity in your approach to work. I left a small comment on this question, but in this post I elaborate […]
Archive | Machine Learning Process
Common Pitfalls In Machine Learning Projects
In a recent presentation, Ben Hamner described the common pitfalls in machine learning projects he and his colleagues have observed during competitions on Kaggle. The talk was titled “Machine Learning Gremlins” and was presented in February 2014 at Strata. In this post we take a look at the pitfalls from Ben’s talk, what they look like and how […]
How To Work Through A Problem Like A Data Scientist
In a 2010 post Hilary Mason and Chris Wiggins described the OSEMN process as a taxonomy of tasks that a data scientist should feel comfortable working on. The title of the post was “A Taxonomy of Data Science” on the now defunct dataists blog. This process has also been used as the structure of a […]
Lessons Learned from Building Machine Learning Systems
In a recent presentation at MLConf, Xavier Amatriain described 10 lessons that he has learned about building machine learning systems as the Research/Engineering Manager at Netflix. In this you will discover these 10 lessons in a summary from his talk and slides. 10 Lessons Learned The 10 lessons that Xavier presents can be summarized as […]
Assessing and Comparing Classifier Performance with ROC Curves
The most commonly reported measure of classifier performance is accuracy: the percent of correct classifications obtained. This metric has the advantage of being easy to understand and makes comparison of the performance of different classifiers trivial, but it ignores many of the factors which should be taken into account when honestly assessing the performance of […]
Understand Your Problem and Get Better Results Using Exploratory Data Analysis
You often jump from problem-to-problem in applied machine learning and you need to get up to speed on a new dataset, fast. A classical and under-utilised approach that you can use to quickly build a relationship with a new data problem is Exploratory Data Analysis. In this post you will discover Exploratory Data Analysis (EDA), […]
Data Management Matters And Why You Need To Take It Seriously
We live in a world drowning in data. Internet tracking, stock market movement, genome sequencing technologies and their ilk all produce enormous amounts of data. Most of this data is someone else’s responsibility, generated by someone else, stored in someone else’s database, which is maintained and made available by… you guessed it… someone else. But. […]
Why Aren’t My Results As Good As I Thought? You’re Probably Overfitting
We all know the satisfaction of running an analysis and seeing the results come back the way we want them to: 80% accuracy; 85%; 90%? The temptation is strong just to turn to the Results section of the report we’re writing, and put the numbers in. But wait: as always, it’s not that straightforward. Succumbing […]
How To Get Baseline Results And Why They Matter
In my courses and guides, I teach the preparation of a baseline result before diving into spot checking algorithms. A student of mine recently asked: If a baseline is not calculated for a problem, will it make the results of other algorithms questionable? He went on to ask: If other algorithms do not give better accuracy […]
Model Selection Tips From Competitive Machine Learning
After spot checking algorithms on your problem and tuning the better few, you ultimately need to select one or two best models with which to proceed. This problem is called model selection and can be vexing because you need to make a choice given incomplete information. This is where the test harness you create and […]