How to Define Your Machine Learning Problem

The first step in any project is defining your problem. You can use the most powerful and shiniest algorithms available, but the results will be meaningless if you are solving the wrong problem.

In this post you will learn the process for thinking deeply about your problem before you get started. This is unarguably the most important aspect of applying machine learning.

What is the problem?

What is the problem?
Photo attributed to Eleaf, some rights reserved

Problem Definition Framework

I use a simple framework when defining a new problem to address with machine learning. The framework helps me to quickly understand the elements and motivation for the problem and whether machine learning is suitable or not.

The framework involves answering three questions to varying degrees of thoroughness:

  • Step 1: What is the problem?
  • Step 2: Why does the problem need to be solved?
  • Step 3: How would I solve the problem?

Step 1: What is the Problem

The first step is defining the problem. I use a number of tactics to collect this information.

Informal description

Describe the problem as though you were describing it to a friend or colleague. This can provide a great starting point for highlighting areas that you might need to fill. It also provides the basis for a one sentence description you can use to share your understanding of the problem.

For example: I need a program that will tell me which tweets will get retweets.

Formalism

In a previous blog post defining machine learning you learned about Tom Mitchell’s machine learning formalism. Here it is again to refresh your memory.

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Use this formalism to define the T, P, and E for your problem.

For example:

  • Task (T): Classify a tweet that has not been published as going to get retweets or not.
  • Experience (E): A corpus of tweets for an account where some have retweets and some do not.
  • Performance (P): Classification accuracy, the number of tweets predicted correctly out of all tweets considered as a percentage.

Assumptions

Create a list of assumptions about the problem and it’s phrasing. These may be rules of thumb and domain specific information that you think will get you to a viable solution faster.

It can be useful to highlight questions that can be tested against real data because breakthroughs and innovation occur when assumptions and best practice are demonstrated to be wrong in the face of real data. It can also be useful to highlight areas of the problem specification that may need to be challenged, relaxed or tightened.

For example:

  • The specific words used in the tweet matter to the model.
  • The specific user that retweets does not matter to the model.
  • The number of retweets may matter to the model.
  • Older tweets are less predictive than more recent tweets.
question everything

Question Everything!
Photo attributed to dullhunk, some rights reserved

Similar problems

What other problems have you seen or can you think of that are like the problem you are trying to solve? Other problems can inform the problem you are trying to solve by highlighting limitations in your phrasing of the problem such as time dimensions and conceptual drift (where the concept being modeled changes over time). Other problems can also point to algorithms and data transformations that could be adopted to spot check performance.

For example: A related problem would be email spam discrimination that uses text messages as input data and needs binary classification decision.

Step 2: Why does the the problem need to be solved?

The second step is to think deeply about why you want or need the problem solved.

Motivation

Consider your motivation for solving the problem. What need will be fulfilled when the problem is solved?

For example, you may be solving the problem as a learning exercise. This is useful to clarify as you can decide that you don’t want to use the most suitable method to solve the problem, but instead you want to explore methods that you are not familiar with in order to learn new skills.

Alternatively, you may need to solve the problem as part of a duty at work, ultimately to keep your job.

Solution Benefits

Consider the benefits of having the problem solved. What capabilities does it enable?

It is important to be clear on the benefits of the problem being solved to ensure that you capitalize on them. These benefits can be used to sell the project to colleagues and management to get buy in and additional time or budget resources.

If it benefits you personally, then be clear on what those benefits are and how you will know when you have got them. For example, if it’s a tool or utility, then what will you be able to do with that utility that you can’t do now and why is that meaningful to you?

Solution Use

Consider how the solution to the problem will be used and what type of lifetime you expect the solution to have. As programmers we often think the work is done as soon as the program is written, but really the project is just beginning it’s maintenance lifetime.

The way the solution will be used will influence the nature and requirements of the solution you adopt.

Consider whether you are looking to write a report to present results or you want to operationalize the solution. If you want to operationalize the solution, consider the functional and nonfunctional requirements you have for a solution, just like a software project.

Step 3: How would I solve the problem?

In this third and final step of the problem definition, explore how you would solve the problem manually.

List out step-by-step what data you would collect, how you would prepare it and how you would design a program to solve the problem. This may include prototypes and experiments you would need to perform which are a gold mine because they will highlight questions and uncertainties you have about the domain that could be explored.

This is a powerful tool. It can highlight problems that actually can be solved satisfactorily using a manually implemented solution. It also flushes out important domain knowledge that has been trapped up until now like where the data is actually stored, what types of features would be useful and many other details.

Collect all of these details as they occur to you and update the previous sections of the problem definition. Especially the assumptions and rules of thumb.

We have considered a manually specified solution before when describing complex problems in why machine learning matters.

Summary

In this post you learned the value of being clear on the problem you are solving. You discovered a three step framework for defining your problem with practical tactics at at step:

  • Step 1: What is the problem? Describe the problem informally and formally and list assumptions and similar problems.
  • Step 2: Why does the problem need to be solve? List your motivation for solving the problem, the benefits a solution provides and how the solution will be used.
  • Step 3: How would I solve the problem? Describe how the problem would be solved manually to flush domain knowledge.

How do you define your problem for machine learning? Have you used any of the above tactics and if so, what were your experiences? Leave a comment.

131 Responses to How to Define Your Machine Learning Problem

  1. Avatar
    Gusseppe May 2, 2016 at 6:12 pm #

    This is so useful!.

  2. Avatar
    Joe Arakkal July 10, 2016 at 2:12 pm #

    Very well thought out way of the very first step – purpose. Aka problem definition.
    Thanks Jason. Always a joy to read your blog. Keep it coming so you can continue to educate the world. Many thanks to you !!

  3. Avatar
    Kola July 31, 2016 at 9:28 am #

    Thank you so much for this post. It has given me a direction on how to kick_start my new task. Please is there another post that extends the steps discussed here?

    • Avatar
      Jason Brownlee August 1, 2016 at 6:28 am #

      Glad to hear it Kola. This might be the only place where I discuss these ideas.

  4. Avatar
    Federica September 19, 2016 at 6:35 pm #

    This is post is very useful, thank you. It helps a lot in keeping the focus on the essential. I am currently writing my first paper and I do really find this very very helpful!

  5. Avatar
    Vicky October 2, 2016 at 4:23 pm #

    I am still trying to understand mine 🙁 (Detection of malicious Office documents using machine learning algorithm) please help

  6. Avatar
    Prakriti October 25, 2016 at 6:54 am #

    Under Step 3, do you mean decide what algorithm are we going to use?

    • Avatar
      Jason Brownlee October 25, 2016 at 8:35 am #

      Not quite Prakriti.

      It suggests thinking of the problem as a programming exercise and think about what you might have to do to solve it, what data you would need, what structures, etc.

      This can help to force you to think deeply about the problem upfront and think about what other data or data transforms and maybe even what techniques you might need to use later on.

      See the full process here:
      https://machinelearningmastery.com/start-here/#process

  7. Avatar
    Andrey Naumov October 27, 2016 at 1:32 am #

    Jason, thank you for these thoughts.
    Great way to start and understand what I want/can.

  8. Avatar
    Ivan November 8, 2016 at 7:28 am #

    The questions to steps 1 & 2 were very helpful for writing the introduction to my thesis!

    • Avatar
      Jason Brownlee November 8, 2016 at 10:00 am #

      Glad to hear it Ivan. Nice work! Best of like with your thesis (been there…).

  9. Avatar
    yasH February 9, 2017 at 1:01 am #

    Chronic Kidney Disease Detection..which Kind of data sets needed

    • Avatar
      Jason Brownlee February 9, 2017 at 7:26 am #

      I don’t know, perhaps you should contact a domain expert?

  10. Avatar
    Norman Abraham September 7, 2017 at 11:59 am #

    Good Evening,

    I’ve been looking for info regarding k-means output and what it means.

    Would you have information regarding same?

    Thank you,

    Norm

    • Avatar
      Jason Brownlee September 7, 2017 at 12:58 pm #

      Sorry, I don’t have material on k-means clustering.

  11. Avatar
    Abhishek September 17, 2017 at 1:51 pm #

    Sir, I want to create a performance analysis tool. In which I will keep track of the amount of work that I do and the amount of rest time I take in between. And I will assign intensity level to my each work. I will multiply that intensity with hour I did that work. And at the end of the day I will sum those which will give my performance. Every day I will try to check my performance with taking different combinations of the intensity of the work and the time that I give. So at the end will ML be able to help me plan my works.So that I will be able to complete my work with least amount of time. Or will ML be able to give me a prediction such that if I enter my total day plan which consists of how much amount i will work and rest I take it will give me %age of possibility of its happening.

  12. Avatar
    Roaa November 16, 2017 at 8:13 am #

    That’s perfect, long live your hand

  13. Avatar
    Alex Lange November 28, 2017 at 7:59 am #

    Great collection of material for this newbie. Thank you! This particular model reminds me of a (once) famous book by George Ploya: How To Solve It, a heuristic approach to mathematics. He has a single-page summary of questions that elaborate on your three questions. You might find it interesting to compare. Full PDF is at https://notendur.hi.is/hei2/teaching/Polya_HowToSolveIt.pdf

    In particular see questions on p. xvi — xvii

    Wikipedia summary: https://en.wikipedia.org/wiki/How_to_Solve_It

    • Avatar
      Jason Brownlee November 28, 2017 at 8:43 am #

      Thanks, yes I remember reading that book as a student.

  14. Avatar
    Rizwan Mian December 23, 2017 at 3:14 am #

    Hi Jason and fellow ML lovers, wonder if somebody can point me to some examples of defining the problem?

  15. Avatar
    Jesús Martínez February 7, 2018 at 10:18 am #

    Thank you very much for this post. As you point out in the post, the third point is key (and arguably the most important when we are defining the context of the problem) and could greatly improve the solution development speed. As in any other area of software engineering, jumping immediately into code is a great recipe for a subpar solution (and bugs!)

  16. Avatar
    Hussain Liyaqatdar February 16, 2018 at 6:17 pm #

    Hi Jason,
    I like the ‘TEP’ approach of problem definition basis the famous definition of Machine Learning by Tom Mitchell . Also I have always thought of ‘why should this be solved’ but I feel that documenting in terms of Solution Benefits or Solution Use and thinking of the Lifetime Maintenance of the project makes complete sense

    Appreciate all your content and learning by the day !

    Also I wanted to ask if investing in a formal distance learning course from the University of Chicago Graham School (for eg.) will be helpfu in terms of validating my abilitiesl in the long run or I should continue with self learning ?

    Thanks,
    Hussain

    • Avatar
      Jason Brownlee February 17, 2018 at 8:43 am #

      Thanks.

      If you like the course, then go for it. I cannot give advice on what would be a good fit for you.

  17. Avatar
    Mahesh February 20, 2018 at 9:29 pm #

    what is difference between q-learning and deep q-learning in terms of high dimensionality of data.

    • Avatar
      Jason Brownlee February 21, 2018 at 6:40 am #

      I hope to cover reinforcement learning in more detail soon.

  18. Avatar
    Hamid April 4, 2018 at 10:38 pm #

    First, many thanks about the contents.
    Second: What are the RL algorthm types? or do you know a reference in which I can find all possible types? (need it for my thesis)

  19. Avatar
    Hans April 17, 2018 at 8:22 am #

    Hi Jason , i have question about data, it is possible to conduct fraude détection reasarch without datasets about fraud, if no there ils a alternative

    • Avatar
      Jason Brownlee April 17, 2018 at 2:48 pm #

      Not with machine learning. Developing a predictive model with machine learning requires a dataset.

  20. Avatar
    Elza April 28, 2018 at 12:10 pm #

    It is instructive yet little helpful to my existing problem while training a model with a data-set of stl10. I am confused about why loss always be fluctuating within a tiny range, meaning that it has no distinct descend over large number of epoch. I am a fresh bird in this field, how could I fine-tune my model parameters? It might be caused by what reason ?

  21. Avatar
    Raisa Rasul May 14, 2018 at 3:44 am #

    I am just starting out with Machine Learning and this is very helpful!

  22. Avatar
    PTR July 6, 2018 at 1:42 am #

    It is very helpful…Good one..Keep posting.!!!

  23. Avatar
    Faisal Mohammad July 6, 2018 at 3:55 pm #

    Good, I am starting machine learning using Matlab. Is it better than python?
    My job is related to forecasting the electricity demand. We have many predefined algos in matlab, do you think they will work well?

  24. Avatar
    vinky July 28, 2018 at 1:56 am #

    Thnks, Jason it really help me to understand the basic concepts of ML.

  25. Avatar
    Ahmed July 30, 2018 at 2:43 pm #

    Thanks a lot , this is very usefull

  26. Avatar
    Kamalendu August 11, 2018 at 12:40 pm #

    Its amaizing..!! thanks Mr Jason

  27. Avatar
    vedaste August 19, 2018 at 6:48 am #

    I am interested ! i appreciate all the explanation!

    i am wondered if you could share with us the codes of building algorithm step by step!

  28. Avatar
    zsoh August 27, 2018 at 9:23 pm #

    Realy useful. Sometimes (as my case), we have a bunch of problems but wanted to solve them in one shot.
    Thanks!

  29. Avatar
    Bewketu October 7, 2018 at 5:55 pm #

    Thank you for this valuable information.

  30. Avatar
    Dan Gustafsson November 4, 2018 at 10:35 pm #

    Hi, great resource, I like step 3, but not step 2! Why? Most people in tech nowadays seem to work in a matrix organisation, therefore collaboration and pushing for finding business value through collab is more important than anything else. So step 2, “motivation” section, might need to include others besides yourself, as real beneficiaries. So you need to be close to them, and understand their real need. Solving ML probl for yourself will benefit no-one else, mostly (unless you assume the role of someone else in your organisation, and solve a problem they have not asked for/thought of. Cheers!

  31. Avatar
    dy November 26, 2018 at 1:04 pm #

    always a helpful writing!

  32. Avatar
    Nidhi December 12, 2018 at 7:02 pm #

    Thanks Jason. This is a very good help to start with machine learning.

  33. Avatar
    Eli Mead December 17, 2018 at 4:53 am #

    Great article, many thanks!

  34. Avatar
    Vijay January 13, 2019 at 5:31 pm #

    Nice information to start Machine Learning.I have read all the steps but can you explain more on Step 3: How would I solve the problem?
    Here I am not getting “Manually” means what ?
    Can you please explain with example…
    Thanks,

    • Avatar
      Jason Brownlee January 14, 2019 at 5:23 am #

      In that case, I’m prompting you to think about how you might approach the problem if you had to code a solution from scratch.

      This cannot be done with a real predictive modeling problem, but can with other problems. It helps sort out that question.

  35. Avatar
    Ali March 2, 2019 at 1:30 am #

    Bonjour,
    je suis débutant dans le Deep learning, et je voudrais de classifier une image IRM en trois classe, matiere blanche et matiére grise et liquide céphalo rachdien…..?

    • Avatar
      Jason Brownlee March 2, 2019 at 9:33 am #

      Sounds like a fascinating problem.

      Perhaps start with a CNN, and try some transfer learning to get good results very quickly.

  36. Avatar
    adarsh v kumar March 16, 2019 at 4:15 pm #

    sir,could you give your email id

  37. Avatar
    Muhammad iqbal bazmi April 27, 2019 at 4:08 pm #

    Sir, thanks for this post.
    I want to remove music from the song but the human voice.
    How can I do this, any clue?
    Please answer.

    • Avatar
      Jason Brownlee April 28, 2019 at 6:54 am #

      Sounds like an amazing project!

      I don’t have any tutorials related to this, but perhaps search or arxiv for related projects and discover what types of methods they are using?

  38. Avatar
    Muhammad iqbal bazmi April 27, 2019 at 4:10 pm #

    Sorry Sir, I don’t want to remove the human voice.

  39. Avatar
    teimoor April 29, 2019 at 5:53 am #

    hi can you please send me a book about machine learning

  40. Avatar
    Gaurav Jain July 3, 2019 at 5:35 pm #

    Thank you, Jason!

  41. Avatar
    Wael Elsayegh August 11, 2019 at 12:55 am #

    Thanks, about the data collection! what if i have an issue with detecting it from videos?
    do you have any tutorials on here?

    • Avatar
      Jason Brownlee August 11, 2019 at 6:01 am #

      I don’t have tutorials on working with video data, I hope to cover the topic in the future.

  42. Avatar
    Kate August 16, 2019 at 9:04 pm #

    Thank you for this article. I found it concise and well written. It will help with my primary research. Having worked with a colleague software dev I was wondering how much ‘data’ you would need to be able to use machine learning algorithms. I was under the impression you need at least a couple thousand per label to get anywhere close to it being indicative. Does it make sense to develop an algorithm on a data set that is 2,500 items spread over 25 different labels?

  43. Avatar
    Bayangmbe October 23, 2019 at 5:55 pm #

    Hi, it’s me again.
    I suggest my machine learning problem: today African farmers practice agriculture without knowing the potential of the soil, with my program, depending on the richness of the soil, I predict an agricultural crop that the farmer can grow without the need for fertilizer.

  44. Avatar
    Jod Baka October 28, 2019 at 2:23 am #

    I would like to know asking yourself those questions, do we need to write down all feature and answer to those question because touching the data that we gonna use ?

    • Avatar
      Jason Brownlee October 28, 2019 at 6:03 am #

      Perhaps use the parts of the framework the best help you work through your problem?

      • Avatar
        hjh February 6, 2020 at 3:15 am #

        I would like to introduce these projects to each other and teach the whole project with Python code. That is, project-based education.
        https://morioh.com/p/b56ae6b04ffc

        • Avatar
          Jason Brownlee February 6, 2020 at 8:33 am #

          Sorry, I don’t understand. Can you elaborate?

  45. Avatar
    ali May 20, 2020 at 6:40 pm #

    hello and thanks
    how to reference your contents in my thesis?

  46. Avatar
    Snehal May 31, 2020 at 10:52 am #

    Wow, this was a great read. One quick question though, Where does actual algorithm development happen. I assume post step 3. Step 3 is collecting data setting models. I think I m a bit unclear on how exactly algorithm will learn over time, or given near to accurate the answer. This could be due to my s/w code development background. Is there any reference page you could guide me to? Thanks much. This page has been super helpful to me.

  47. Avatar
    William Priest June 2, 2020 at 6:53 am #

    Came across your article through my MIT Sloan CSAIL: AI course. This structured approach to problem solving in this space is spot on. As an engineer and data geek I prefer structure that aids in consistently tackling problems. Very helpful.

  48. Avatar
    JB June 3, 2020 at 5:59 pm #

    I think this is really useful even I am very beginning in understanding machine learning.
    I would say this could be good processing any problem solving!

  49. Avatar
    Vidya June 5, 2020 at 2:46 pm #

    Hi Jason .

    How would I know that my ML solution is kind of good and there would not be a better way to do it ? Yes , one check point is the model metric , but I feel stuck up when I can not improve on the metric. Is this normal or is it that it’s better to work on a project as a group of 2 at least so that there can be a peer review ?? Or do we look for a mentor who could do the peer review ?

  50. Avatar
    Lama July 6, 2020 at 7:40 pm #

    Hey Jason
    Many thanks for this article!
    Could please explain this phrase more?
    “ If you want to operationalize the solution, consider the functional and nonfunctional requirements you have for a solution”

  51. Avatar
    Joy July 18, 2020 at 12:04 am #

    This is the most useful article/tutorial I’ve come across as a beginner. Thank you for the invaluable lesson.

  52. Avatar
    Pravinkumar Balasaheb More September 23, 2020 at 1:32 am #

    Many thanks for the article, Jason. This really helped me a lot in writing a blog as a project assignment. Do you have any links written by you focusing on the below subhead 2,3,4,5 & 6.

    1. Problem solving
    2. Data Analysis
    3. EDA Concluding Remarks
    4. Pre-processing Pipeline
    5. Building Machine Learning Models
    6. Concluding Remarks

    Please if you could post links. However, this truly helped a lot.

    • Avatar
      Jason Brownlee September 23, 2020 at 6:41 am #

      Yes, you can use the blog search to locate them.

  53. Avatar
    sarah October 14, 2020 at 6:48 pm #

    thank you so much
    I think in this way you return more time before starting writing the code

  54. Avatar
    Samuel Nyarko October 29, 2020 at 8:33 am #

    Very helpful.

  55. Avatar
    A December 20, 2020 at 7:36 pm #

    My first time reading this and I am an ML novice, but I followed it quite well. Very clearly and succinctly put. Thank you.

  56. Avatar
    Harshit Gupta December 23, 2020 at 6:29 pm #

    Great work!!

  57. Avatar
    George February 24, 2021 at 10:29 am #

    This is a great article. It’s clear, well written. Thank you!

  58. Avatar
    Felix Lluberes February 26, 2021 at 4:56 am #

    Thanks for the information. The article helped me framed some challenges as well as opportunities as we look to implement AI in our organization.

  59. Avatar
    Abhishmita March 3, 2021 at 3:10 am #

    Thank You Jason, I am making a new data science project for my portfolio and this has helped me immensely. I had an interview some time ago and I had a similar question being asked, I have an understanding of this process, but was not able to put it forward properly. This is very helpful.

  60. Avatar
    jean-christophe chouinard March 26, 2021 at 9:28 am #

    Machine learning is not all about data and programming. It takes human to support the project too. Thanks

  61. Avatar
    jagriti May 26, 2021 at 7:52 pm #

    can we apply machine learning algorithms in supermarkets to reduce queue wait time at bill paying counter.

  62. Avatar
    Amaal June 2, 2021 at 8:41 pm #

    thanks a lot, it was very helpful for me.

  63. Avatar
    luthe August 5, 2021 at 2:35 am #

    Thank you for this article, I am really happy to read you.
    I am currently working on a project to create a sculpin that can offer services (restaurant, pharmacy, etc.) to users based on the data they have entered through the messaging system. Now, thanks to this post, I have questioned some things. First I would like to know if I should train the model with a database containing information related to each partner company, which will allow the bot to be able to offer a service or a company to the customer based on what he has entered. Or rather, I first start to enter the model on databases containing conversational data in order to bring the bot to understand and exchange with the user. Also, I would like you to propose me databases related to my problem because the ones I found do not contain enough data.
    Please help me sir.

    • Avatar
      Jason Brownlee August 5, 2021 at 5:24 am #

      This is specific to your project, I don’t know. Perhaps you can discuss the issue with project stakeholders or experts in your domain.

  64. Avatar
    LUMBOL TITYEM ZUZUL August 27, 2021 at 6:09 am #

    Beautiful. Pls continue the good work is only God that will reward you

  65. Avatar
    Anandan September 29, 2021 at 3:37 pm #

    1. Learning from experience has limitations. Some event under some environment, a result and most importantly an interpretation of the result is what we call experience. No two individuals have the same experience from the same event or same individual under different frame of mind on the same event carries different experiences. Human beings are prone to errors in judgement. Machines don’t learn by themselves, it is fundamentally guided by human beings in some way or the other. How can one accept or trust learning of machine as a Holy Grail
    2. Some of the greatest inventions have started with solutions, identifying the problems came later.

  66. Avatar
    Muhammad Waseem June 15, 2022 at 4:37 am #

    You addressed the all questions which arises in mind that why we are doing this, thanks a lot

    • Avatar
      James Carmichael June 15, 2022 at 7:15 am #

      Thank you for the great feedback Muhammad!

Leave a Reply