How a Beginner Used Small Projects To Get Started in Machine Learning and Compete on Kaggle

It is valuable to get insight into how real people are getting started in machine learning.

In this post you will discover how a beginner (just like you) got started and is making great progress in applying machine learning.

I find interviews like this absolutely fascinating because of all of the things you can learn. I’m sure you will too.

Use Small Projects To Get Started

Use Small Projects To Get Started
Photo by pixonomy, some rights reserved.

Q. What resources did you use to get started in machine learning?

  1. The famous online course “Machine Learning” by Andrew Ng.
  2. The book “An Introduction to Statistical Learning: with Applications in R” by Gareth James et al.
  3. Seriously competing in a kaggle featured competition, trying to get the highest rank possible. Very intense, great learning experience.
  4. Your newsletter and blog, and a couple of guides you wrote. Especially “Small Projects Methodology“.

This approach worked well to get a working knowledge of different techniques of machine learning: What technique works in which scenarios and what are the rough theoretical basics of each one.

The above approach didn’t help so much in understanding theoretical basics of statistics and Data Analysis, which are a big part of “Data Science” outside of kaggle.

Q. How did you go on the Kaggle competition?

I found kaggle through some online course about Data Analysis with Excel (of all places).

Probably like most kaggle users I started with the famous “Titanic” learning competition in R. Later I completed another entry level competition and even wrote a tutorial for that, which got a lot of attention.

I started diving into the whole matter with very little programming knowledge. After messing around with R, a colleague and me competed in a serious kaggle competition using Python.

Although a very steep learning curve, the experience did wonders for my programming and data analysis skills.

After that however I got diverted from Data Analysis, concentrating more on Software Development and trying to make a (new) career out of that.

A great help to me during the competition was your website.

As a rough guideline we used the workflow described in your process for working through machine learning problems.

Later on I gathered some more useful tips and information from other posts on your site, for example the post on competitive machine learning or the post on how to get good at feature engineering.

Q. How would you compare Python and R for working on Kaggle competitions?

Coming from an Excel Background I found R and its concept of data frames to be more accessible at first. Especially when working with RStudio.

Diving deeper into the matter I found Python more comfortable, however. The Anaconda distribution offers an out of the box package with everything you need: Pandas, scikit-learn and iPython notebook.

Pandas offers DataFrames that are very similar to work with like the ones from R. Scikit-learn offers an immense range of ready to use machine learning tools that share a common API. The documentation is very good with a lot of examples. If you’re picking up programming and machine learning at the same time, this is an immense help.

For me, this was one of Python’s major advantages over R. R offers an enormous amount of libraries ready to use. However, you have to seek them out on your own and the documentation of the libraries is often in a very scientific language, which can leave you lost if you are a beginner that just wants to classify a dataset.

Working in a team, IPython notebook is an invaluable tool for sharing and editing code together. For beginners it also makes code a lot more understandable when you can see the result of each line immediately. As a bonus, a lot of documentations and examples are written as IPython notebooks, there are even whole books written in it.

If you don’t know IPython notebook, check out nbviewer. Once you’ve worked with it, it’s hard to work without it, really.

Q. Could you elaborate on experience with the Coursera Machine Learning course?

As a total beginner the course provided me with valuable insights into the basic principles behind statistical data analysis.

The course picks you up at virtually no prior knowledge beyond high school math and gives you a basic rundown of some of the most used techniques and practices in machine learning. The homework in Octave/Matlab was challenging but not too hard and the community is helpful. The programming skills required are very basic. I would recommend it to anyone who wants to get into machine learning and is looking for a place to start.

The only downside of the course in my opinion was that it completely negated learning methods based on decision trees or ensemble methods. As a beginner (especially on kaggle) you will see a lot of Random Forests and Gradient Boosts around. This course will not help you very much in understanding them.

The format of weekly lectures and homework made completing the course a rewarding experience. Unfortunately, they recently changed the course to a different format so your experience may vary.

Q. How did you make use of the Small Projects Methodology?

I bought it early on in search of an angle to approach the vast field of machine learning as a total beginner.

Each person is a different learner, but for me I’ve discovered that I learn best when solving practical problems.  So the idea of getting to know the field by completing small projects – each one a little “deeper” into the matter as the previous one – makes a lot of sense to me. If you look at the Small projects tree and compare it with my resources mentioned above you can see that there are a lot of similarities.

As for studying algorithms: In my opinion the best way to do it is implementing them.

A big part of Andrew Ng’s machine learning course consists of implementing basic versions of popular algorithms and it boosted my understanding of what really happens “behind the scenes” and why it happens tremendously. I would argue that fiddling with parameters and running experiments will not give you such a clear picture. When it comes to deciding which Algorithm is best suited for which task and how to optimize it, having implemented an Algorithms will give you much better insight in how to use it. But ultimately that comes down to personal preference

For anyone looking for a scope for small projects, I can definitely recommend competing on kaggle. It will “force” you (so to speak) to study your machine learning tool and to study the algorithms. You will need both for a spot near the top of the leaderboard. Combining the small projects methodology with kaggle competitions will make you proficient in machine learning very fast. And it’s fun at the same time.

Final Word

This was a great interview and I especially like the honest commentary on his experience with Python and R.

I know that the small projects approach is a powerful tool and it is fantastic to see first hand examples of beginners putting it to great use and getting impressive results.

Do you have a story of how you got started in machine learning? Share it in the comments!

No comments yet.

Leave a Reply