Phil Brierley won the Heritage Health Prize Kaggle machine learning competition. Phil was trained as a mechanical engineer and has a background in data mining with his company Tiberius Data Mining. He is heavily into R these days and keeps a blog at Another Data Mining Blog.
In October 2013 he presented to the Melbourne Users of R special interest group. The title of his talk was “Techniques to improve the accuracy of your Predictive Models” and you can watch it below:
This is a great presentation if you would like insight into how a highly pragmatic and effective machine learning practitioner approaches problem solving. I want to highlight three points I took away from this presentation.
Phil opens the presentation with a comment that “the proof of the pudding is in the eating” – you can only indicate that something is successful after you have tried it out. Phil is not interested in the great theory, he want’s evidence that a model works by looking at it’s result.
He comments that most problems involve data that relates to humans rather than laws of nature, which can make the problems complex. He also comments that he is not interested in inventing new algorithms, but instead in getting the best out of the algorithms that are available. R has a lot of algorithms so that is why he is using it.
Phil is a huge proponent of ensembles. He used them in his Heratage Health Prize, he demonstrates their power with a simple football tipping example and even uses crowdsourcing to guess the weight of people in the room as an example.
Phil comments: Don’t build one great model, ask 10 people to create 1 model each and average them.
Phil comments that bad modes should not be thrown away exactly, but what you should be looking for is diversity of model results that you can recombine into an improved solution. Diversity of predictions is evaluated by looking for a lack of correlation between predictions, which should be maximized.
Phil comments that visualization is an important and underutilized tool. He stresses the utility of eye-balling distributions of attributes to get a feeling for how sensible they are and to highlight issues with the data. He comments that a visual inspection can help you pick-up on strangeness in the data that a statistical summary will not.
This is a great talk and I highly recommend watching it. Also keep an eye out for an insightful comment on data calibration across years in the Heratage Health Prize.