I first heard of kaggle.com six months ago when a friend who manages a team of statisticians commented, “My analysts are obsessed with Kaggle, they can’t get enough free time to work on the contests.” I created an account of my own the next day to see what is pulling analytics professionals to the site.
Anthony Goldbloom, CEO, launched this online data analytics contest site as a platform for any organization to host predictive analytics contests similar to the Netflix $1M Prize from 2009. They were recently featured in the Wall Street Journal with a new $3M prize in healthcare. I wanted to know more about Kaggle and the opportunities the analytics community has to not only participate in these contests, but to build our own successful contests using the site. Anthony gave us his take on all things Kaggle below:
Jeremy: Would you give us a quick lay of the land? How many contests are you hosting today? How many participants are working on them?
Anthony: Kaggle has hosted 17 predictive modeling competitions, ranging from predicting HIV progression to building chess rating systems and forecasting travel time.
We have around 10,000 members. Our user base is growing rapidly thanks to our latest competition, the $3 million Heritage Health Prize.
Participants range from the Netflix Prize winners and the engineers behind the Google Prediction API, through to students. The geographic spread is also really wide, with many strong performers coming from eastern European countries like Poland and Slovenia.
Jeremy: What’s the most interesting problem a Kaggle contest has solved so far?
Anthony: My two favorites are the chess rating competitions and the traffic forecasting competition.
The chess rating competition require(d) participants to build a chess rating system based on the results of past chess games. Participants then use the systems to predict the outcome of future chess games. The first version of the competition showed that the official Elo rating system was far from optimal.
Given this result, FIDE, the official chess governing body, donated all their data for another competition and Deloitte put up $10,000 in cash. This competition is currently live and has a really talented field including three of the Netflix Prize winners, the Microsoft researchers behind the XBox Trueskill rating system and Mark Glickman, the author of the “Glicko” rating system.
We also did a great competition for the New South Wales (NSW) government in Australia to develop a traffic forecasting system. The competition attracted over 350 teams and was won by a pair of North American PhD students. The NSW government plans to roll this out to give Sydney motorists information on how long it will take to get from A to B depending on when they leave.
Jeremy: Some contest platform sites like topcoders.com have yielded participants high profile jobs at Google and Microsoft. Has this happened yet with Kaggle?
Anthony: One reason companies have taken a strong interest in us is as a way to find talent. We recently met with several high profile technology companies, who like the idea of hosting a competition specifically to find talented recruits. They say they battle with Google, LinkedIn, Amazon etc. for the same group of Stanford, MIT, Caltech and CMU graduates. Meanwhile, competitions showcase people who are just as talented but may be missed in other recruiting efforts.
Jeremy: Let’s talk about the contests themselves. What makes a successful contest? What kind of problems do you recommend organizations post on Kaggle?
Anthony: We’ve never hosted a competition that hasn’t significantly outperformed the previous state of the art. Moreover, for just about every competition we’ve hosted, the best entries reach a plateau, which we interpret to be the limit of what’s possible given that amount of “information” available in the dataset. Given the diversity of problems we’ve tackled, this suggests that competitions are suitable to a huge range of problems.
For our most popular competitions, the competition host is active in the forums. Having hundreds of eyes on your data often brings up new insights and raises questions that had never been asked.
Jeremy: What do you know now about data mining or data miners that you didn’t know a year ago?
Anthony: Lots! First off, the best answers often come from unusual places. We find electrical engineers and physicists tend to do really well. My theory on this is that they spend more time on the preprocessing (or the common sense part of the problem). Meanwhile Statisticians and Computer Scientists spend too much time thinking about what algorithms to use.
Competitions can be more efficient in other ways too. For example, an internal modeler or researcher may not realize they have more opportunity to learn from a dataset. A competition drives participants to continue until they’ve managed to squeeze everything out of the problem.
Jeremy: What’s the roadmap for Kaggle?
Anthony: I would love for Kaggle rankings to be a recognized credential in our industry. Competitions are really powerful because everybody is being judged on a comparable basis, so they’re far more rigorous as a reputation system than a CV. We’re also working on several products that will allow companies to access to our talent.
Jeremy: Where do you see the next innovations coming from in data mining algorithms? Can Kaggle become a source of primary research ?
Anthony: Kaggle can do wonderful things in research. I believe competitions are superior to peer review for a certain class of problems because:
a. Competitions generate much faster feedback on what works and what doesn’t. b. Everyone is evaluated on a comparable dataset and using the same metric, so it’s easier to compare methods to find those that show promise.
Jeremy: Final thoughts – what do you want corporate managers of hard problems to know that they don’t know already about
Anthony: Whether they use Kaggle or set something up themselves, competitions are the only way (or at least the only way I know of) to get every bit of information out of a dataset. After hosting a competition, you can be confident that problem is solved and move onto the next one (at least until you get new data or there’s a structural break).
Have you used Kaggle either as a problem solver or as a host? Agree or disagree with Anthony on the power of competitions in our space? I’d love to hear your thoughts in the comments section.
Pingback: Interview with Kaggle.com posted on iianalytics.com « Measuring Human Capital is hard. Let's fix that…