Using Statistical Algorithms for Success in Kaggle’s Data Science Competitions

Features

  • Author: Lillian Pierson P.E.
  • Date: 19 Feb 2015
  • Copyright: Image appears courtesy of iStock Photo

Recently, predictive modeling platform Kaggle hosted a Big Data Combine competition to predict short term changes in the prices of stocks. The competition was hosted by the tournament platform BattleFin – a platform that’s dedicated to crowdsourcing investment analysis talent. Big Data Combine competitors were supplied news data and sentiment data by RavenPack, and then asked to use that data to build predictive models for forecasting the price changes. With these predictions in hand, traders and investors would have access to the information they need for improved risk management when making investment decisions.

Dr. Steve Donaho was the winner of the Big Data Combine competition, and three others that have been hosted by Kaggle. In fact, Dr. Donaho’s outstanding performance in Kaggle competitions has earned him a current rank of 3rd out of 250,987 total competitors. At one point in time, Donaho was the top ranked competitor on the entire Kaggle platform. This success speaks volumes about Donaho’s ingenuity, acumen, and agility in data science. In an exclusive interview for Statistics Views website, Donaho discusses his interest in and success with data science and Kaggle competitions.

thumbnail image: Using Statistical Algorithms for Success in Kaggle’s Data Science Competitions

1. What statistical machine learning algorithms have you found most useful throughout the course of Kaggle competitions and what are the biggest accomplishments you’ve had by using those specific methods?

Over the last few years I’ve found the GBM algorithm (Generalized Boosted Regression Models) in R to be very useful and broadly applicable to a wide variety of problems. I used GBM in a 2nd place finish in the Allstate Purchase Prediction competition and 3rd place in the Deloitte Insurance Churn Prediction competition. More recently, I’ve started to use the XGBoost (eXtreme Gradient Boosting) algorithm which is similar in nature to GBM, but is much faster and has some improved features. Most recently, I’ve also been intrigued by some of the online learning algorithms posted by the user tinrtgu and others in contests for Criteo, Tradeshift, and Avazu. For very large volumes of data, the online learning techniques give pretty good results quickly and without using a lot of memory.

2. What standard approach do you take when competing in Kaggle contests?

I usually spend quite a bit of time at the beginning of a contest just sifting through the data and getting to know it before I apply any learning algorithms. This has sometimes given me a competitive edge – for example in an Allstate competition, I found that certain combinations of products never occurred in certain U.S. states. Ruling out those combinations as a post-pass to our algorithm gave my partner and I an edge in the competition. Something else I do at the beginning of a competition is to try simple approaches first. I create what I call “improved baselines” where I choose a simple idea and tweak it in a few ways to see how much mileage I can get out of it. I do this for a few reasons: 1) Sometimes I find some relatively simple solutions that perform pretty well (complex is not necessarily better), 2) In practice I’ve found that customers prefer simple solutions that they are able to grasp, and 3) If a solution is doing well, I like to understand what is driving its success, and that is easier to do with simple solutions. If you jump straight to complex solutions, it is hard to know what is driving the success and if all the complexity is necessary.

3. What inspired you to get started in Kaggle competitions?

I first heard about Kaggle from an article in the Wall Street Journal in 2011. The data science competitions sounded like fun. I had a one week lull in my normal work so I entered a contest with about one week left until it finished. I used a pseudonym BreakfastPirate when I signed up because I thought I might not be any good. It turns out I got 10th place in that first contest, and the thrill of placing well got me addicted to Kaggle contests.

Maybe there are readers out there whose real passions are in analysis – in these cases, these people should be told that math, computers, etc. are simply supporting skills and tools that are available to help them in their analysis endeavors.

4. Why do you participate in Kaggle? What do you get out of it?

First of all, it’s fun! I’m an unabashed data-lover. I love to get my hands on a new set of data and start digging through it and analyzing it. It is fun to learn about industries that I have not worked on before such as retail sales, airline arrival times, soil composition in Africa, flu forecasting, click-through prediction, etc. Second, it forces me to learn new techniques and new algorithms. I always sift through the solutions posted by winners, and I often learn clever, new approaches. Kaggle has definitely become more competitive even in the last 12 months. If I see that people are winning using an algorithm that I have not used before, I’m forced to learn about that algorithm in order to stay competitive. That is how I started using XGBoost. Third, it is fun to be part of a community of data scientists where we are sharing ideas. Yes, it is a competition, but there is a lot of idea sharing that happens on the message boards, and it is fun to contribute to that when possible.

5. What originally led you into the data science field?

When I was in high school, the only career advice I got was, “You’re good at math. You should be an engineer.” So I went off to college to study to be an engineer. I knew I liked computers so I majored in Computer and Electrical Engineering. While working on my Bachelor’s degree, I found that I was more interested in the software than the hardware. So I went on to work on a Master’s degree and PhD in Computer Science. About the time I was finishing my PhD, I came to the realization, “I don’t really like computers nearly as much as all the students around me. What I *really* like to do is analyze data, the computers are simply a handy tool for pursuing my passion of analysis.” So it took me all those years and all those degrees to figure out that my real underlying skill is *not* math. My real underlying skill is that I have good analytical skills, and I enjoy analyzing things. Unfortunately, when I was in high school, analytical skills were not specifically identified, so no one was able to say, “You’ve got good analytical skills, and here are a set of career paths for people who enjoy analysis.” Hopefully schools do a better jobs these days of identifying skills and going beyond, “You’re good at math. You should be an engineer.” But just in case, maybe there are readers out there whose real passions are in analysis – in these cases, these people should be told that math, computers, etc. are simply supporting skills and tools that are available to help them in their analysis endeavors. They need not be ends of themselves, but rather serve as means to an end.

More about Steve Donaho: Dr. Steve Donaho has 20 years of experience architecting solutions for discovering interesting patterns in large quantities of data. He has place in the Top 10 in multiple Kaggle competitions across a wide variety of areas including stock market sentiment analysis, insurance, name resolution, retail sales prediction, pharmaceutical sales prediction, and airline arrival times. Prior to starting Donaho Analytics, he was Director of Research at Mantas (now part of Oracle Financial Services), a leader in delivering business intelligence to the financial services industry. At Mantas he was a driving force behind the creation of much of their new analytics technology and was an inventor on four of the company’s patents. He has published and spoken at multiple Knowledge Discovery in Databases (KDD) conferences on topics including algorithms for detecting fraud and insider trading. His areas of domain expertise include fraud detection, money laundering detection, financial markets, banking and brokerage, healthcare, telecommunications, and customer analytics.

Related Topics

Related Publications

Related Content

Site Footer

Address:

This website is provided by John Wiley & Sons Limited, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ (Company No: 00641132, VAT No: 376766987)

Published features on StatisticsViews.com are checked for statistical accuracy by a panel from the European Network for Business and Industrial Statistics (ENBIS)   to whom Wiley and StatisticsViews.com express their gratitude. This panel are: Ron Kenett, David Steinberg, Shirley Coleman, Irena Ograjenšek, Fabrizio Ruggeri, Rainer Göb, Philippe Castagliola, Xavier Tort-Martorell, Bart De Ketelaere, Antonio Pievatolo, Martina Vandebroek, Lance Mitchell, Gilbert Saporta, Helmut Waldl and Stelios Psarakis.