[New Book] Click to get The Beginner's Guide to Data Science!
Use the offer code 20offearlybird to get 20% off. Hurry, sale ends soon!

How to Get Started with Kaggle

4-Step Process for Getting Started and Getting Good at
Competitive Machine Learning.

Kaggle is a community and site for hosting machine learning competitions.

Competitive machine learning can be a great way to develop and practice your skills, as well as demonstrate your capabilities.

In this post, you will discover a simple 4-step process to get started and get good at competitive machine learning on Kaggle.

Let’s get started.

How to Get Started with Kaggle

How to Get Started with Kaggle
Photo by David Mulder, some rights reserved.

Advice on Kaggle?

I get a lot of questions via email asking:

How can I get started on Kaggle?

I took my last response to this question and decided to turn it into this blog post.I hope you find it useful.

I hope you find it useful.

Why Kaggle?

There are many ways to learn and practice applied machine learning.

Kaggle has some specific benefits that you should seriously consider:

  • The problems are well defined and all of the available data is provided directly.
  • It is harder to fool yourself with a bad test setup given the harsh truth of the public and private leaderboards.
  • There is often great discussion and sharing around each competition that you can learn from and to which you can contribute.
  • You can build up a portfolio of projects on difficult real-world datasets that can demonstrate your skill.
  • It is a complete meritocracy where ability to deliver and skill is the defining factor, not where you went to school, the math you know, or how many degrees you have.

Overview

I recommend a simple 4-step process. The steps are:

  1. Pick a platform.
  2. Practice on standard datasets.
  3. Practice old Kaggle problems.
  4. Compete on Kaggle.

The process is easy to describe, but difficult to implement. It is going to take time and effort. It is going to be hard work.

But…

It will pay off, and if you are methodical and stick to it, you will be a world-class machine learning practitioner.

You could dive straight into step 4, and that may be right for you, but I designed the process to maximize the chance you’ll stick with it and get an above-average outcome.

Let’s take a look at each step in a little more detail.

1. Pick a Platform

There are many machine learning platforms to choose from, and you may end up using many of them, but start with one.

I recommend Python.

Why?

  • Demand for skills for Python machine learning are growing.
  • Python is a fully featured programming language (unlike R).
  • The ecosystem is sufficiently mature (sklearn, pandas, statsmodels, xgboost, etc.)
  • The platform has some of the best deep learning tools (theano, tensorflow, keras)

Pick a platform and start learning how to use it.

Here’s some further reading:

2. Practice on Standard Datasets

Once you pick a platform, you need to get very good at using it on real datasets.

I recommend working through a suite of standard machine learning problems on the UCI machine learning repository or similar.

Treat each dataset as a mini competition.

  • Split it into a train and held back test set, split the test set into a public and private leaderboard set.
  • Outline a process for working through each dataset, stick to it, add to it until you can easily get top results on any small dataset to tackle.
  • Time-box each dataset to one or a few hours.
  • Leverage publications on and related to the dataset to aid in better defining a given problem and interpreting the features.
  • Learn how to get the most out of the tool, out of the algorithms, and out of a dataset.

Think of this part as drills. Get good.

Keep the projects as part of your portfolio to leverage on each new project you tackle.

Here’s some further reading:

3. Practice old Kaggle Problems

Now that you know your tools and how to use them, it’s time to practice on old Kaggle datasets.

You can access the datasets for past Kaggle competitions. You can also post candidate solutions and have them evaluated on the public and private leaderboard.

I recommend working through a suite of Kaggle problems from the last few years.

This step is designed to help you learn how top performers approach competitive machine learning and to learn how to integrate their methods into your processes.

  • Select a variety of different problem types that force you to learn and apply new and different techniques.
  • Study the forum posts, winner write-up blog posts, GitHub repositories, and all other information for a problem and incorporate the methods into your process.
  • Aim for a score in the top 10% or better in the public or private leaderboards.
  • Try to incorporate as much of the winners’ methods as possible into your candidate solutions.

Think of this as advanced drills. Get good at thinking like competition winners and using their methods and tools.

Again, add each completed project to your portfolio for leverage on future projects.

Here’s some further reading:

4. Compete on Kaggle

You are now ready to compete on Kaggle.

Get after it.

  • Consider working on one problem at a time until you top-out or get stuck.
  • Aim for achieving a top 25% or top 10% result on the private leaderboard for each competition you tackle.
  • Share liberally on the forum; this will lead to collaborations.
  • Minimize the time between reading about or thinking of a good idea and implementing it (e.g. minutes).

Have fun.

They may be competitions, but you’re participating to learn and share.

Here’s some further reading:

Summary

In this post, you discovered a simple 4-step process for getting started with and getting good at competitive machine learning on Kaggle.

Have you participated in a Kaggle competition?
How did you get started?

Have you decided to follow this process?
How are you doing with it?

16 Responses to How to Get Started with Kaggle

  1. Avatar
    Carlos Aguayo March 10, 2017 at 2:31 pm #

    Typo, you mention 3 step process but describe 4.

    Great post! Kaggle is great!

  2. Avatar
    Mohamed Abbas March 29, 2017 at 6:33 am #

    Very useful post, I registed in Kaggle two years ago but could not force myself to join competitions, but after reading this post I decided to start competing. Thanks a lot.

  3. Avatar
    Gurvijay October 26, 2017 at 8:50 am #

    This is an awesome post !
    I am a computer engineer who lost his path did his MBA and went into Finance only to know typical “IB/Sales and Trading Role” are few and far in between and the headwind is towards all things Machine Learning/Data Sciences.

    So I need to hone my skills and this post is exactly the motivation I was looking for!

    Cheers.

  4. Avatar
    Himanshu April 13, 2018 at 2:09 am #

    Thanks a lot for this post. I have learnt about R through various websites & want to gain hands on experience, but didn’t know where to start. Hopefully I will start n keep getting better. Thanks once again.

  5. Avatar
    Arun Kumar Sahoo September 2, 2018 at 1:39 am #

    I choose Data Science With R as my platform, please suggest me some steps to get started with Kaggle.

  6. Avatar
    JFeng November 14, 2018 at 9:46 am #

    Thank you for the post Jason! Always be able to learn something from you!

    Is there any classic Kaggle datasets that you recommend to start in Step 3?

  7. Avatar
    Mayur Pande January 4, 2019 at 10:11 pm #

    Thanks am going to follow this, but a quick question, I have used pandas quite a lot and feel mostly comfortable with in the projects I use, but these are really data science related, it is just used for mostly sorting data. So is it worth me skipping to step 2 straight away? Thanks

    • Avatar
      Jason Brownlee January 5, 2019 at 6:57 am #

      Start with the steps with which you’re the most comfortable.

Leave a Reply