What does it take to do well in competitive machine learning? To really dig into this question, you need to dig into the people that do well.
In 2010 I participated in a Kaggle competition to predict the outcome of chess games in the future. It was a fascinating problem because it required you to model the rating of the players from historical games and propagating those ratings into the future to make predictions.
I did so-so on the competition (17th or 10%), but Diogo Ferreira took 4th place. Diogo is a fascinating guy and an excellent communicator, and in 2011 I had the opportunity to interview him about his participation in the competition. This post is an edited version of that interview.
The interview is divided into four parts:
- Part 1 explores Diogo’s background and methodology for problem solving
- Part 2 on the Kaggle chess ratings competition
- Part 3 dives into Diogo’s solution to the chess ratings competition
- Part 4 leaves us with some final thoughts for being competitive machine learning practitioners (best part!)
Grab a drink, sit back and Enjoy!
Part 1: Background and Methodology
Jason: Could you please introduce yourself?
Diogo: My name is Diogo R. Ferreira and I’m the professor of information systems at the Technical University of Lisbon (Portugal), where I teach Database Systems, Enterprise Integration, and Business Process Management (BPM). I’m also an active researcher in the area of BPM, with a particular focus on Process Mining, i.e., the techniques associated with extracting process models from event logs recorded by information systems.
Jason: What experiences from your education and work history, if any, aided you in your participation and success in the chess rating competition?
The fact that Process Mining, my research area, is related to the areas of Business Intelligence and Data Mining puts me in contact with a range of techniques that can often be adapted or serve as inspiration to address different problems.
As for the chess rating competition, I have a long-time interest in chess that made me become particularly motivated for this competition.
Jason: Your description of process mining provokes thoughts of mining web server logs and relating the results to flow through a website.
Precisely, this is the kind of things we do, only with event logs (i.e. events recorded by an information system) either from process-aware systems (such as workflow/BPM systems) or otherwise (such as application server logs)
Jason: What is an example of process mining the uninitiated like me may be familiar with, for example, is there a canonical case study or use case?
For an introduction, I can perhaps direct you to the Process Mining Manifesto (PDF) which has been recently released.
Jason: What are some recent examples of BI and Data Mining techniques that you have adapted to your efforts, if any, and the the type of problems you were trying to tackle?
Sequence clustering, Expectation-Maximization, Sequence analysis, Graph analysis, are techniques that I have used or adapted for process mining applications.
Jason: Generally, when presented with a new dataset, ratings or otherwise, where do you start and what is your methodology?
First, I usually develop a set of simple programs to extract features from the dataset in order to understand it better. It is often the case that datasets are too large for manual inspection, so having some basic tools to begin with is extremely useful in order to find out in which direction one should go further.
Then, after I have a feeling for the data, I usually try some rather simple, perhaps even trivial approaches just to perform a “sanity check”, i.e., just to verify that the outcome is what I expect in order to gain confidence with the dataset.
After I feel confident enough, I start trying more elaborate approaches. These approaches are often of my own doing. At a later stage, I usually compare my results with well-known recipes or techniques from others.
Jason: I find your methodology description fascinating, thank-you. What tools or techniques have you used on recent efforts to extract features from a new dataset?
To extract features from a new dataset, I mostly use ad-hoc, custom code, and I find it very productive to use Python and associated libraries such as numpy, matplotlib, networkx, etc.
Jason: What do you mean you when talk about features in this case? (summary statistics? structure in the data? predictive rules?)
Summary statistics, precisely, and/or ways of finding commonalities and frequent behavior in a dataset.
Jason: If you can inspect a dataset manually, what are some examples or tools you have used to do this?
I just use simple search-and-find/find all/find-in-files/grep capabilities, such as those available in most text editors.
Jason: When working on data problems, what types of tasks would you prepare tailored code and scripts and what types of tasks would you turn to tools and libraries?
I often code as much as possible myself, even if there is similar code available. This is because developing code incrementally is also a way to learn more about the data.
I turn to tools and libraries especially when I want to compare my results with those of others, or with competing approaches.
Jason: You mentioned: “developing code incrementally is a way to learn about the data” which is very interesting. Could you please elaborate, if possible? (for example, are you referring to anything specific such as: prototyping ideas? making mistakes? exploring dead-ends? time to think? discovering edge cases?)
As I start processing the data, I usually take the time to check that every intermediate step (of coding) is working as expected, and in that process one learns a lot about the data itself. I usually do not explore dead-ends or edge cases, I just take the opportunity to make a few stops on the way (i.e. if coding were compared to driving, I would say I would stop once in a while just to look around and see the landscape, rather than just driving directly to the destination)
Jason: To what degree do you have a general interest in Puzzles, Machine Learning, Algorithms, Programming, Statistics, Mathematics, Psychology?
I have a general interest in all these areas, but I’m often motivated by the particular problem at hand, regardless of area. I’m not interested in every problem in these areas, but I may be interested in a few particular problems coming from different areas.
Jason: In general, which methods do you find yourself returning to for common data tasks?
I would say I get back recurrently to text/string processing and graph analysis, by virtue of my research area (Process Mining), and I find these skills often useful in different problems.
Jason: What techniques, tools, and/or libraries do you find useful for a) text/string processing and b) graph analysis in your efforts?
For text/spring processing I use simple functions available in the standard Python modules (find, split, etc.)
For graph analysis it is often useful to use specialized libraries such as networkx and Graphviz.
Jason: What are some creative ways or recent examples you have used text/string processing and/or graph analysis in different problems?
Recently I have been analyzing event logs from two different hospitals and I’m using precisely those tools/techniques.
Jason: Generally, how do you maintain motivation when your approach does not give the results you were expecting?
I may feel that there is something wrong with the approach, and in that case I try to fix it. Or, if there’s nothing wrong with it, then I just start thinking about a completely different approach to the same problem. Either way, this often keeps me motivated.
Jason: What skills do you think make a competitive participant in such data competitions?
Rather than skill, I think a competitive participant must have some resolute and unshakable motivation. I’ve seen very skilled people not even putting in an effort, and less skilled people resorting to all sort of things, and some of them just prove effective in the end. So it’s not a matter of how much one knows, but how much effort one is willing to put in.
Jason: I’m interested in exploring your motivation-vs-skill philosophy a little more. Can you please talk to ‘creativity and/or the diversity of ideas’ and ‘adaptability to try different things’ being more useful than ‘raw skill, intelligence, experience’ (forgive my terse paraphrasing)?
A basic fact is that, in the end, what matters is whether things get done or not. Skill means potential to do it, but only motivation makes one start doing it.
The fact that things are evaluated by their outcome means that in some cases there are unfortunate consequences, such as someone who has invested a lot of time and did not get results in the end. This is where skill plays a role: it makes it more likely that one is to have some fruits out of the investment.
But skill alone is not enough to deal with the complexity of some challenges. Even with skill, the task at hand may be difficult, so this is where motivation comes in again. It makes one start, and it makes one keep going. Of course, a Ferrari (more skill) will get one there faster, but that doesn’t mean it’s not possible to get there with a less-equipped vehicle, or even a vehicle that cannot drive on the same road/path. If getting there is all that matters, then it doesn’t matter which car or path one is taking. Having gas (motivation) to get there becomes more important than having a fast car (skill).
Jason: Tuning a model may have a place somewhere in the process, when addressing a data problem, what could one do to seek a fruitful balance between trying new models and tuning a model that has promise?
The rate of improvement tells us when to change strategy. When the rate of improvement starts falling, it’s time to start thinking about a new model.
Jason: Generally, do you prefer to spend your time tuning a given model for a data problem or in trying different models?
Personally, I would prefer to spend time trying different models, but in practice what happens is when a model starts looking promising I often invest a lot of time tuning it, which leaves me with less time than I would desire to try out different models. This happened in the chess competition, where despite having thought about trying other approaches, in the end I spent most of my time tuning and improving a single model.
Jason: Why do you think model tuning is less satisfying than exploring new models (for example, incremental gains vs alternative paradigms)?
Because one may spend a lot of time tuning, and in the end still be limited by the assumptions of the original model. Say, by tuning one gets a 20% improvement, and with a new model one could get 50%. Breakthrough thinking, if possible, often brings more return – at least that is my personal feeling.
Part 2: Chess Rating Competition
Jason: You mentioned a long term interest in chess, what is the long term interest and how long have you had it?
In my teens I used to play in chess tournaments, and today I still keep an eye on what’s going on in the chess world, although I’m not an active player anymore. Once in a while, I still play online, which is something that I wish was available when I was in my teens.
Jason: In your own words, what was the problem that needed to be solved in the chess rating competition?
In my view, the purpose of the chess rating competition was to devise a rating system to provide more accurate predictions for the outcome of games between rated players.
Jason: I read that you started out with pencil and paper, what were you thinking about or exploring when you first started with the competition and how did you resist jumping on the computer and working on a script?
When I first started thinking about the competition, I was looking for a suitable probability model and I scratched some things on paper, sometimes the same formulas would appear over and over again.
At the same time, I started to code some simple tasks such as reading/parsing the data and getting some summary statistics.
After some time, I was switching regularly from coding to scratching things on paper, and things started rolling fast in both camps.
Jason: From a high-level, please summarise your linear steps over the course of the competition?
The underlying model for my approach was rather quick to develop, and throughout the course of the competition I used basically the same model, with some variations now and then.
What took me time to develop was a suitable weighting function to account for time factor (i.e. the fact that more recent games are more important than older ones)
I kept trying several options, and in the private dataset I was observing some improvements, although the public score provided little confidence of whether I was going on the right track or not.
Jason: What collaborations were you engaged in over the course of the competition and with whom? (email exchanges, form discussions, etc.)
I just had a couple of short conversations with a colleague, about the competition and about data mining, overfitting, etc., in general. In these conversations we mostly discussed the way the competition was set up, rather than particular approaches. I remember also getting some advice on using a normal distribution for the weighting function, but eventually I devised my own function.
Jason: I read that you expected to spend a few days on the competition, what kept you involved for so long?
There were two strong reasons for this:
- Seeing the progress of others everyday on the public leaderboard, and comparing that to my own, made me really thrilled about this competition.
- The fact that there was little correlation between my own cross-validation results and the results on the public leaderboard. This kept me busy for some time, as I was unsure about how my approach would actually perform in the end.
Jason: In simple words what was your (most successful) approach to solving the problem presented by the competition?
I think it was the simplest one, i.e., when I reduced the code to what was really fundamental and what I thought was a solid base. Other ideas, such as smart fixes to get some improvements, proved to be very time-consuming and provided little additional reward.
Jason: What data summarization and data visualisation did you perform over the course of the competition, if any?
I used the networkx Python library to study the connections (games) between players in the dataset.
Other than that, I used summary statistics such as games-per-month, games-per-player-per-month, etc.
Jason: What classical methods to chess rating did you apply to the dataset, if any, and when over the course of the competition?
I used none. My idea was always to get a normalized rating between 0.0 and 1.0 for every player.
Jason: You have released a detailed description of your method and your source code – why did you put in such an effort into the description and release the code for free?
Well, I’m used to writing scientific papers and it makes sense to me to present a description in that form. As for the source code, I thought hardly anyone would go through the trouble of implementing the approach, and even if they did perhaps they would not exert the necessary care to ensure that everything has been implemented as originally intended; this could originate reports on my approach with results that are worse than what I obtained with my own code. So why not just release the source code as a reference implementation, and that’s it.
Jason: Looking back, do you think compute time or computer hardware was a limitation for you or your method in this competition?
Definitely yes. I remember having my laptop running for days on end. At some point I even tried to use the servers at our university, only to find that it would run faster on my laptop. Parameter tuning usually took hours, even days, to complete. I usually found myself checking how things were going every hour, even in the middle of the night, and wishing I would have more machines at my disposal to run things in parallel.
The only reason that this did not become absolutely critical was the fact the we were limited to 2 submissions a day, which I found to be quite ok as a rule. I remember dividing the number of my actual submissions by the total number of submissions I would have been allowed during the same period, and I got a number close to 50%. So overall I think I managed to make good use of the number of submissions I had at my disposal.
Jason: Discuss working with the leaderboard and the differences you were seeing in your private dataset?
If I remember right, I divided the public dataset for months 1-to-100 into a cross-validation dataset: 1-to-95 (training) + 96-to-100 (test). (I also worked with 1-to-90 + 91-to-95, as well as others)
What happened was that a better result on the cross-validation dataset often meant a worse result in the public leaderboard. Perhaps a previous submission was better due to overfitting or just luck. In any case, this did not help to build confidence on the things I was trying. The cross-validation results indicated that it should definitely be better, but the public leaderboard said otherwise. At some point, I (as well as may other participants) started feeling that there was something very different about the public dataset and the leaderboard. In the end, I think it was just due to the fact that the leaderboard score was being calculated on a rather small amount of data (20%).
Jason: What statistics did you explore in your private results compared to the public leaderboard, if any (for example, did you explore correlations between the two)?
I did not explore any correlation between the two since it appeared there was none.
As for the statistics, I remember the competition used the month-aggregated RMSE as the error measure. I remember trying to use other measures, such as absolute deviation per game, RMSE per game , etc. So I did some parameter tuning to try to minimize these other error measures. However, in the end it seemed like using the month-aggregated RMSE was better.
Jason: What are some examples of ‘smart fixes‘ based on your observations in the data that were successful or that you later removed?
Here is an example: in some games the prediction of either white or black win was close to 1.0 (e.g. 0.98 probability). So what I did was to set a threshold (e.g. 0.98) above which the result would be converted to 1.0. This helped to slightly reduce the overall prediction error, the problem was that the choice of threshold was very sensitive, so eventually I gave up using such tricks.
Jason: Can you recall any interesting observations from reviewing summary statistics and exploration of the data?
One thing I remember is that there were many more games in recent months than in older months. I think this also explains the need for such a fast-declining function of time. With few games it’s hard to draw conclusions, so the old games were not very useful. However, even the most recent games were not that much (in total number), so the old games still proved useful for prediction purposes. This, I think, is another reason of why it was so hard to choose a time-weighting function.
Jason: Why were you focused on a normalised rating throughout your participation?
Because I was working mostly with probabilities, and intuitively I wanted to express the rating of a player as a probability measure as well (more on this below).
Jason: Did you keep notes or a log book of things you were thinking about or trying?
I saved the code for every submission in a separate file. However, I had a separate program for parameter tuning, and I did not save every version of it. In the midst of so many submissions (100+ in the end) there were times when I was unsure whether I had tried something before already.
Part 3: Questions on Your Specific Solution
Jason: Your method is just over 100 lines of Python code (without comments) which I think is incredible. Could you please describe your algorithm/s (in plain language)?
Basically, I estimate the strength of each player X as the expected score against a common reference player Z. (This reference player Z is an abstract entity, it does not exist, its strength does not need to be computed, and its purpose is just to bring the strength of all players to comparable terms.)
If we had the results of each player X against the reference player Z, we would have the rating of player X and that’s it. But we don’t have the results of X against Z, what we have is the results of X against a set of opponents, whose strength is also unknown.
In my approach I devised a way to compute the strength of X based on the strength of its opponents. And we compute the strength of X’s opponents based on the opponents of X’s opponents, and on the opponents of the opponents of X’s opponents, and so on; it becomes an iterative algorithm where the strength of every player must be computed at the same time and across several iterations, until it finally converges.
On top of this I added a time-weighting function to make older games have less impact on the results.
Jason: Your method focused on a weighting function, where did you start, and how did it change over time?
I started with an exponential function, then normal, then linear, etc., until I finally settled for the simple function presented in my reported and in the figure (above).
Jason: Your method makes use of the Bradley-Terry model, what is this model and how did you make use of it?
The Bradley-Terry model is perhaps one of the oldest things in the field of “paired comparisons” (i.e. estimating the strength of players/teams based on games between them)
It is really simple: given the strength two players X and Y, a simple formula provides a number between 0 and 1, which can be taken as a “probability” of X winning over Y. See formula (1) in my report.
In my approach, I used the Bradley-Terry formula directly, where I took the strength of each player to be the probability of winning over a fictitious reference player Z, as explained above.
Jason: How did you go about exploring the convergence of initial ratings for players before making predictions?
Convergence was slow and this was the major computational problem in my approach. I wanted to do parameter tuning, but each time I tried a different parameter value I had to wait a long time for the whole thing (the ratings) to converge again. Anyway, I tried to live with it, for lack of a faster method.
Jason: Given that a players predicted rating is based on their estimated rating, did you spend more time exploring the effects of you model(s) on rating estimation or rating prediction?
I spent all my time trying to devise a method to estimate player ratings in order to achieve the lowest error when predicting the outcome of games. I did not study any aspect regarding the uncertainty of player ratings. However, I’m inclined to think that TrueSkill works well because it incorporates uncertainty about player ratings.
Jason: What were some ideas or strategies you tried that didn’t pan out?
Extending the Bradley-Terry model with draws didn’t work. Despite reading a lot about it (work by other authors) and implementing it, and despite the fact that chess results are plagued with draws, using a Bradley-Terry model that incorporates the possibility of draws didn’t work for me. I think that was due to the fact that, all-in-all, the training data that we had was relatively scarce for that purpose.
Jason: What issues were you concerned with in preparing your private cross-validation train/test dataset?
I tried several cross-validation datasets because I was unsure of which one would be most similar to the actual test data, and I was worried that my parameter tuning would just overfit to a cross-validation dataset and perform worse-than-desirable in the actual test data.
At some point, I remember doing parameter tuning as an average of the results over 6 cross-validation datasets!
Jason: Did you filter or exclude any data in the preparation of your model, if not, did you consider excluding the older data?
I tried throwing away half of past history (the oldest half) but the predictions for future games didn’t get better.
Jason: I read that you used a hill-climbing local search to tune the model parameters, did you choose the initial values by hand?
Yes, but I often chose these values based on previous results.
Jason: Did you consider other model tuning methods (global search algorithms, non-linear interdependence assumptions, stratified search)?
No, I didn’t have time to read about and implement other optimization algorithms, I was just trying to get to a (local) maximum, even if this required using the crudest of ways.
Jason: I found it fascinating that you went back and derived new model parameters to result in a score above first place in the final standings. Why did you do this (what if the model couldn’t do better), and did you consider exploring this further (change in assumptions of the cross-validation suite, back-engineering further improvements to the model)?
I did this because I was confident that if my model was so close to first place, then it could certainly surpass it with a different choice of parameters. But note that this better-than-first result is overfitted to the actual test data. I’m sure that other participants would be able to improve their score as well once they got their hands on the full test data.
On the other hand, in an e-mail after the competition, Jeff Sonas wrote me that there was actually an issue in the way the final scores were calculated. Because of how the test data was prepared and aggregated by player-month, some games were shared between public and private test data. This may have influenced the final results (e.g. favouring some participants who had overfitted to the public test data)
On that same e-mail, Jeff wrote that when he took away the shared games from the test data, my approach hit first place.
Part 4: Finale
Jason: What changes to your general problem-solving methodology did your participation in the Chess Rating Competition have, if any?
Overall, I feel like participating in this contest made me become much more careful about analyzing and drawing conclusions from data.
Being able to predict future results is perhaps the best way to discover how much one really knows about the data. It’s rather easy to fit models to data and jump into conclusions; but it is considerably more difficult to acquire enough knowledge to make accurate predictions.
Developing a feeling for the amount of uncertainty in the data (versus the amount of consistency) is not that easy, and it may require a significant amount of time and hard work.
In the end, when one realizes the amount of uncertainty there is, one can also develop an idea for the uncertainty that will be inherent to any prediction. Beyond a certain point, a very accurate prediction can only be the product of sheer luck.
So seeing others with better predictions doesn’t necessarily mean they have better approaches (even if it’s a good indicator). Somehow, in the midst of uncertainty, they just happened to hit closer to the spot.
In this competition, the change of positions from the public leaderboard to the final standings shows just how far uncertainty played a role in this contest. I was shocked when I first started suspecting this (even before the contest ended) and that is why I became more careful about making definite statements about data.
Jason: What advice do you have for someone getting started with data competitions?
I think one should not be demeaned in his/her own motivation until one grabs a feeling for what is exactly at stake, and what is going on with the data.
When getting started in such competition one may see a quick succession of results, and it’s better not to think about those results until one gets his/her own ideas matured. It’s only when we’ve made a decent try that we should start looking at what others are doing.
Jason: Given practically unlimited resources (time, money, a team of PhDs), what is your idealised general strategy or methodology for a data competition? (how would you make good use of resources?)
The way I see it, parallel computing is only worthwhile if the message-passing between processors takes significantly less computational power than the actual computations at each node.
So having several minds (or machines) working together at the same problem doesn’t help if they need to be kept synchronizing all the time.
What I find really useful is the exchange of ideas. So when different people are working on the same problem on their own, and after they have thought seriously about it for some time, they meet to exchange ideas, this often brings new lights into the discussion, and becomes the source of breakthroughs.
So, as a strategy for a team effort, I would just let everyone work on their own, and as soon as they are ready or in need of it, they should meet and exchange experiences and results. These meetings could be repeated for several rounds, where I would expect some noticeable improvements in the results from one round to the next.
Jason: Are there any final comments you would like to make?
The competition was great fun, and although its practical interest was limited to chess ratings, I felt I was putting my effort to good use.
Given the amount of interest that this type of competition can generate, I would definitely invest time in double- and triple-checking the way such competition is set up in order to ensure that all the effort everyone is putting into it can bring the best benefit to the community.
- The website for the competition titled “Chess ratings – Elo versus the Rest of the World” including the final private leaderboard
- Diogo’s how-I-did-it post titled “How I did it: Diogo Ferreira on 4th place in Elo chess ratings competition“.
- Diogo has a page hosting his details and final submission, including the paper titled “Predicting the Outcome of Chess Games based on Historical Data” and Python code.
- Diogo has a staff webpage at the Technical University of Lisbon including a list of publications