Last Updated on September 5, 2016
Real-world examples make the abstract description of machine learning become concrete.
In this post you will go on a tour of real world machine learning problems. You will see how machine learning can actually be used in fields like education, science, technology and medicine.
Each machine learning problem listed also includes a link to the publicly available dataset. This means that if a particular concrete machine learning problem interest you, you can download the dataset and start practicing immediately.
Most Popular Kaggle Datasets
These first 10 examples of machine learning problems were taken from the competitive machine learning website Kaggle.com. Popularity was based on the number of participating teams.
- Otto Group Product Classification Challenge. Given features of products data classify products into one of 9 product categories.
- Rossmann Store Sales. Given historical sales data for products across stores, forecast future sales.
- Bike Sharing Demand. Given daily bike rental and weather records predict future daily bike rental demand.
- The Analytics Edge. Given details of new your times articles predict which news paper articles will be popular.
- Restaurant Revenue Prediction. Given the details of a restaurant site predict the revenue of the restaurant in a given year.
- Liberty Mutual Group: Property Inspection Prediction. Given the details of inspected properties predict a hazard score for properties.
- Springleaf Marketing Response. Given features of customers predict whether they are a marketing target or not.
- Higgs Boson Machine Learning Challenge. Given the description of simulated particle collisions predict whether an event decays into a Higgs boson or not.
- Forest Cover Type Prediction. Given cartographic variables predict forest cover type.
- Amazon.com Employee Access Challenge. Given historical resource access changes for employees predict the resources required by employees.
Most Popular Research Datasets
The next 10 machine learning problems are the most popular on the University California at Irvine Machine Learning Repository website that traditionally hosts machine learning datasets used by the machine learning research community.
- Iris dataset. Given flower measurements in centimeters predict the species of iris.
- Adult dataset. Given census data predict with an individual will earn more than $50,000 a year.
- Wine dataset. Given a chemical analysis of wines predict the origin of the wind.
- Car evaluation dataset. Given details about cars predict the the estimated safety of the car.
- Breast Cancer Wisconsin dataset. Given the results of a diagnostic test on breast tissue, predict whether the mass is a tumor or not.
- Abalone dataset. Given the measurements of abalone predict the age of the abalone.
- Wine Quality dataset. Given various measurements of wine predict the quality of the wine.
- Heart Disease dataset. Given the results of various diagnostic tests on a patient predict the amount of heart disease in the patient.
- Poker Hand dataset. Given a database of poker hands predict the quality of the hand.
- Human activity recognition using smart phones dataset. From smart phone movement data predict the type of activity performed by the person holding the smart phone.
- Forest fires dataset. Given meteorological and other factors predict the burned area of forest fires.
- Internet Advertisements dataset. Given the details of images on web pages predict whether an image is an advertisement or not.
We took a whirlwind tour of 20 real-world machine learning problems.
These are actual problems posed or investigated by science and business organizations around the world.
What’s even more exciting is that these diverse problems have publicly available datasets and are also widely studied and understood.
This means you can download the data right now and explore the problem by implementing your own model, or reproduce someone else’s from a paper or blog post.
I am very much impressed by this article sir,really it helped like anything.thank you sir
Dear Mr. Jason,
Hundreds of thousands of students decide to take up machine learning but more than half of this number get phased out due to the sheer fear of complexity of the subject but you on the other hand did a fantastic job explaining the subject with such ease. I just wanted to extend a warm gesture of gratitude. Thanks a lot for helping me and thousands of other like me. Thank you.
Thanks for your kind words Paul.
Hi Jason! 🙂
I’m planning on playing around with the poker data set above and was going to try it with LDA, CART and finally Gradient Boosted Decision Trees (GBDT) with XGBoost, but I’m concerned about the classification process since some hands could fit into more than one class. Ideally, you want to predict the best possible hand out of multiple possibilities so I wasn’t quite sure how this may be done. Logically, I guess, you’d somehow determine all possible classes a hand could fit in and then use the class with the greatest value as the final answer since the classes increase as the hand improves. Any suggestions on this approach? What other models would you suggest trying for multi-class classification?
Thanks! Love your books so far!!! 😀
Sounds like an intersting problem, sorry, I’m not familiar with it. I’m hesitant to make suggestions.
Awesome! Thank you, Jason.
Thanks, I’m glad it helped.
Your knowledge is very vast and details over here are excellent. Thanks a lot.
I was looking on Prediction models on Application behavior to predict like when Application may crash or when it can start behaving different.
Any help on the same would be excellent.
Perhaps try searching on scholar.google.com
Thanks a lot. Let me search over there.
Thanks Jason for the wonderful tip. I am from a non Computer Science background, I hear cool things about Data science so i wanted to learn machine learning. But basically i just wanted to ask you few questions.I could see lot of POC’s, research projects and sample datasets to practice machine learning but :
if i get a job as a Data scientist what level of work would i be doing?
Is it using existing libraries and come up with model or invent new algorithms ?
If the big companies have readymade drag and drop model readily available on the Cloud platforms what is the need for a data scientist there ?
Regarding jobs/roles, this might help:
Yes, existing libraries like scikit-learn are recommended and will do all the hard work:
Models are easy, preparing the data and discovering which model is appropriate (via experimentation/prototyping) requires humans/domain knowledge/intuition/data scientists.
Thanks Jason for all the inputs on ML. I was browsing through different study material but could not get the info like how a ML model stores the Info of a Trained Model. Is it Binary which is created post Pickle or it has its own Database where it memorize the pattern to predict on next data set?
Any study material would be helpful. Thanks once again in advance.
Different models have a different internal representation.
For example CART is a decision tree, a neural network is a set of weights, etc.
The model specific representation is saved to file.
Does that help?
This helped a lot.. Thanks. Where do we get this mapping as once models are Trained and saved using Pickle it stores as a Binary file.
If you use pickle, then the internal representation does not matter as pickle will handle the saving and loading.
Thanks once again for you input.
Thank you Jason for this post. It gives motivation to look at different applications of Machine Learning before diving into it.
Hello Mr.Jason, Thanks a lot for sharing your intelligence with us. God will bless you for your good work.
Thank you for the feedback Suganya!