How to Define Your Machine Learning Problem

The first step in any project is defining your problem. You can use the most powerful and shiniest algorithms available, but the results will be meaningless if you are solving the wrong problem.

In this post you will learn the process for thinking deeply about your problem before you get started. This is unarguably the most important aspect of applying machine learning.

What is the problem?

What is the problem?
Photo attributed to Eleaf, some rights reserved

Problem Definition Framework

I use a simple framework when defining a new problem to address with machine learning. The framework helps me to quickly understand the elements and motivation for the problem and whether machine learning is suitable or not.

The framework involves answering three questions to varying degrees of thoroughness:

  • Step 1: What is the problem?
  • Step 2: Why does the problem need to be solved?
  • Step 3: How would I solve the problem?

Step 1: What is the Problem

The first step is defining the problem. I use a number of tactics to collect this information.

Informal description

Describe the problem as though you were describing it to a friend or colleague. This can provide a great starting point for highlighting areas that you might need to fill. It also provides the basis for a one sentence description you can use to share your understanding of the problem.

For example: I need a program that will tell me which tweets will get retweets.

Formalism

In a previous blog post defining machine learning you learned about Tom Mitchell’s machine learning formalism. Here it is again to refresh your memory.

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Use this formalism to define the T, P, and E for your problem.

For example:

  • Task (T): Classify a tweet that has not been published as going to get retweets or not.
  • Experience (E): A corpus of tweets for an account where some have retweets and some do not.
  • Performance (P): Classification accuracy, the number of tweets predicted correctly out of all tweets considered as a percentage.

Assumptions

Create a list of assumptions about the problem and it’s phrasing. These may be rules of thumb and domain specific information that you think will get you to a viable solution faster.

It can be useful to highlight questions that can be tested against real data because breakthroughs and innovation occur when assumptions and best practice are demonstrated to be wrong in the face of real data. It can also be useful to highlight areas of the problem specification that may need to be challenged, relaxed or tightened.

For example:

  • The specific words used in the tweet matter to the model.
  • The specific user that retweets does not matter to the model.
  • The number of retweets may matter to the model.
  • Older tweets are less predictive than more recent tweets.
question everything

Question Everything!
Photo attributed to dullhunk, some rights reserved

Similar problems

What other problems have you seen or can you think of that are like the problem you are trying to solve? Other problems can inform the problem you are trying to solve by highlighting limitations in your phrasing of the problem such as time dimensions and conceptual drift (where the concept being modeled changes over time). Other problems can also point to algorithms and data transformations that could be adopted to spot check performance.

For example: A related problem would be email spam discrimination that uses text messages as input data and needs binary classification decision.

Step 2: Why does the the problem need to be solved?

The second step is to think deeply about why you want or need the problem solved.

Motivation

Consider your motivation for solving the problem. What need will be fulfilled when the problem is solved?

For example, you may be solving the problem as a learning exercise. This is useful to clarify as you can decide that you don’t want to use the most suitable method to solve the problem, but instead you want to explore methods that you are not familiar with in order to learn new skills.

Alternatively, you may need to solve the problem as part of a duty at work, ultimately to keep your job.

Solution Benefits

Consider the benefits of having the problem solved. What capabilities does it enable?

It is important to be clear on the benefits of the problem being solved to ensure that you capitalize on them. These benefits can be used to sell the project to colleagues and management to get buy in and additional time or budget resources.

If it benefits you personally, then be clear on what those benefits are and how you will know when you have got them. For example, if it’s a tool or utility, then what will you be able to do with that utility that you can’t do now and why is that meaningful to you?

Solution Use

Consider how the solution to the problem will be used and what type of lifetime you expect the solution to have. As programmers we often think the work is done as soon as the program is written, but really the project is just beginning it’s maintenance lifetime.

The way the solution will be used will influence the nature and requirements of the solution you adopt.

Consider whether you are looking to write a report to present results or you want to operationalize the solution. If you want to operationalize the solution, consider the functional and nonfunctional requirements you have for a solution, just like a software project.

Step 3: How would I solve the problem?

In this third and final step of the problem definition, explore how you would solve the problem manually.

List out step-by-step what data you would collect, how you would prepare it and how you would design a program to solve the problem. This may include prototypes and experiments you would need to perform which are a gold mine because they will highlight questions and uncertainties you have about the domain that could be explored.

This is a powerful tool. It can highlight problems that actually can be solved satisfactorily using a manually implemented solution. It also flushes out important domain knowledge that has been trapped up until now like where the data is actually stored, what types of features would be useful and many other details.

Collect all of these details as they occur to you and update the previous sections of the problem definition. Especially the assumptions and rules of thumb.

We have considered a manually specified solution before when describing complex problems in why machine learning matters.

Summary

In this post you learned the value of being clear on the problem you are solving. You discovered a three step framework for defining your problem with practical tactics at at step:

  • Step 1: What is the problem? Describe the problem informally and formally and list assumptions and similar problems.
  • Step 2: Why does the problem need to be solve? List your motivation for solving the problem, the benefits a solution provides and how the solution will be used.
  • Step 3: How would I solve the problem? Describe how the problem would be solved manually to flush domain knowledge.

How do you define your problem for machine learning? Have you used any of the above tactics and if so, what were your experiences? Leave a comment.

17 Responses to How to Define Your Machine Learning Problem

  1. Gusseppe May 2, 2016 at 6:12 pm #

    This is so useful!.

  2. Joe Arakkal July 10, 2016 at 2:12 pm #

    Very well thought out way of the very first step – purpose. Aka problem definition.
    Thanks Jason. Always a joy to read your blog. Keep it coming so you can continue to educate the world. Many thanks to you !!

  3. Kola July 31, 2016 at 9:28 am #

    Thank you so much for this post. It has given me a direction on how to kick_start my new task. Please is there another post that extends the steps discussed here?

    • Jason Brownlee August 1, 2016 at 6:28 am #

      Glad to hear it Kola. This might be the only place where I discuss these ideas.

  4. Federica September 19, 2016 at 6:35 pm #

    This is post is very useful, thank you. It helps a lot in keeping the focus on the essential. I am currently writing my first paper and I do really find this very very helpful!

  5. Vicky October 2, 2016 at 4:23 pm #

    I am still trying to understand mine 🙁 (Detection of malicious Office documents using machine learning algorithm) please help

  6. Prakriti October 25, 2016 at 6:54 am #

    Under Step 3, do you mean decide what algorithm are we going to use?

    • Jason Brownlee October 25, 2016 at 8:35 am #

      Not quite Prakriti.

      It suggests thinking of the problem as a programming exercise and think about what you might have to do to solve it, what data you would need, what structures, etc.

      This can help to force you to think deeply about the problem upfront and think about what other data or data transforms and maybe even what techniques you might need to use later on.

      See the full process here:
      http://machinelearningmastery.com/start-here/#process

  7. Andrey Naumov October 27, 2016 at 1:32 am #

    Jason, thank you for these thoughts.
    Great way to start and understand what I want/can.

  8. Ivan November 8, 2016 at 7:28 am #

    The questions to steps 1 & 2 were very helpful for writing the introduction to my thesis!

    • Jason Brownlee November 8, 2016 at 10:00 am #

      Glad to hear it Ivan. Nice work! Best of like with your thesis (been there…).

  9. yasH February 9, 2017 at 1:01 am #

    Chronic Kidney Disease Detection..which Kind of data sets needed

    • Jason Brownlee February 9, 2017 at 7:26 am #

      I don’t know, perhaps you should contact a domain expert?

Leave a Reply