[New Book] Click to get The Beginner's Guide to Data Science!
Use the offer code 20offearlybird to get 20% off. Hurry, sale ends soon!

Applied Machine Learning Process

The Systematic Process For Working Through Predictive Modeling Problems
That
Delivers Above Average Results

Over time, working on applied machine learning problems you develop a pattern or process for quickly getting to good robust results.

Once developed, you can use this process again and again on project after project. The more robust and developed your process, the faster you can get to reliable results.

In this post, I want to share with you the skeleton of my process for working a machine learning problem.

You can use this as a starting point or template on your next project.

5-Step Systematic Process

I liked to use a 5-step process:

  1. Define the Problem
  2. Prepare Data
  3. Spot Check Algorithms
  4. Improve Results
  5. Present Results

There is a lot of flexibility in this process. For example, the “prepare data” step is typically broken down into analyze data (summarize and graph) and prepare data (prepare samples for experiments). The “Spot Checks” step may involve multiple formal experiments.

It’s a great big production line that I try to move through in a linear manner. The great thing in using automated tools is that you can go back a few steps (say from “Improve Results” back to “Prepare Data”) and insert a new transform of the dataset and re-run experiments in the intervening steps to see what interesting results come out and how they compare to the experiments you executed before.

Production Line

Production Line
Photo by East Capital, some rights reserved

The process I use has been adapted from the standard data mining process of knowledge discovery in databases (or KDD), See the post What is Data Mining and KDD for more details.

1. Define the Problem

I like to use a three step process to define the problem. I like to move quickly and I use this mini process to see the problem from a few different perspectives very quickly:

  • Step 1: What is the problem? Describe the problem informally and formally and list assumptions and similar problems.
  • Step 2: Why does the problem need to be solved? List your motivation for solving the problem, the benefits a solution provides and how the solution will be used.
  • Step 3: How would I solve the problem? Describe how the problem would be solved manually to flush domain knowledge.

You can learn more about this process in the post:

2. Prepare Data

I preface data preparation with a data analysis phase that involves summarizing the attributes and visualizing them using scatter plots and histograms. I also like to describe in detail each attribute and relationships between attributes. This grunt work forces me to think about the data in the context of the problem before it is lost to the algorithms

The actual data preparation process is three step as follows:

  • Step 1: Data Selection: Consider what data is available, what data is missing and what data can be removed.
  • Step 2: Data Preprocessing: Organize your selected data by formatting, cleaning and sampling from it.
  • Step 3: Data Transformation: Transform preprocessed data ready for machine learning by engineering features using scaling, attribute decomposition and attribute aggregation.

You can learn more about this process for preparing data in the post:

3. Spot Check Algorithms

I use 10 fold cross validation in my test harnesses by default. All experiments (algorithm and dataset combinations) are repeated 10 times and the mean and standard deviation of the accuracy is collected and reported. I also use statistical significance tests to flush out meaningful results from noise. Box-plots are very useful for summarizing the distribution of accuracy results for each algorithm and dataset pair.

I spot check algorithms, which means loading up a bunch of standard machine learning algorithms into my test harness and performing a formal experiment. I typically run 10-20 standard algorithms from all the major algorithm families across all the transformed and scaled versions of the dataset I have prepared.

The goal of spot checking is to flush out the types of algorithms and dataset combinations that are good at picking out the structure of the problem so that they can be studied in more detail with focused experiments.

More focused experiments with well-performing families of algorithms may be performed in this step, but algorithm tuning is left for the next step.

You can discover more about defining your test harness in the post:

You can discover the importance of spot checking algorithms in the post:

4. Improve Results

After spot checking, it’s time to squeeze out the best result from the rig. I do this by running an automated sensitivity analysis on the parameters of the top performing algorithms. I also design and run experiments using standard ensemble methods of the top performing algorithms. I put a lot of time into thinking about how to get more out of the dataset or of the family of algorithms that have been shown to perform well.

Again, statistical significance of results is critical here. It is so easy to focus on the methods and play with algorithm configurations. The results are only meaningful if they are significant and all configuration are already thought out and the experiments are executed in batch. I also like to maintain my own personal leaderboard of top results on a problem.

In summary, the process of improving results involves:

  • Algorithm Tuning: where discovering the best models is treated like a search problem through model parameter space.
  • Ensemble Methods: where the predictions made by multiple models are combined.
  • Extreme Feature Engineering: where the attribute decomposition and aggregation seen in data preparation is pushed to the limits.

You can discover more about this process in the post:

5. Present Results

The results of a complex machine learning problem are meaningless unless they are put to work. This typically means a presentation to stakeholders. Even if it is a competition or a problem I am working on for myself, I still go through the process of presenting the results. It’s a good practice and gives me clear learnings I can build upon next time.

The template I use to present results is below and may take the form of a text document, formal report or presentation slides.

  • Context (Why): Define the environment in which the problem exists and set up the motivation for the research question.
  • Problem (Question): Concisely describe the problem as a question that you went out and answered.
  • Solution (Answer): Concisely describe the solution as an answer to the question you posed in the previous section. Be specific.
  • Findings: Bulleted lists of discoveries you made along the way that interests the audience. They may be discoveries in the data, methods that did or did not work or the model performance benefits you achieved along your journey.
  • Limitations: Consider where the model does not work or questions that the model does not answer. Do not shy away from these questions, defining where the model excels is more trusted if you can define where it does not excel.
  • Conclusions (Why+Question+Answer): Revisit the “why”, research question and the answer you discovered in a tight little package that is easy to remember and repeat for yourself and others.

You can discover more about using the results of a machine learning project in the post:

Summary

In this post, you have learned my general template for processing a machine learning problem.

I use this process almost without fail and I use it across platforms, from Weka, R and scikit-learn and even new platforms I have been playing around with like pylearn2.

What is your process, leave a comment and share?

Will you copy this process, and if so, what changes will you make to it?

115 Responses to Applied Machine Learning Process

  1. Avatar
    sadegh May 7, 2015 at 4:31 pm #

    thanx

    • Avatar
      Vipin July 29, 2020 at 8:56 pm #

      Hi Jason! I graduated college almost 2 years ago and after getting throught eliminating lots of choices about carrier path, I wanted to learn machine learning and AI in long term to work on my some dream projects.

      So I just have a doubt. May be I miss this somewhere in you articles I don’t know. May be it sound silly, but if you can answer this, it will be helpful.

      I understand your 5 – Step systematic process, But Can you give me a little hint , After spot check all the compatible algorithm and it complementary process in Weka tool , How are you going to to extract your model from tools like Weka and how are you going to implement them. For example take most popular iris dataset.

  2. Avatar
    json July 15, 2015 at 5:10 am #

    very helpful post

    • Avatar
      Paul June 30, 2019 at 11:03 am #

      Thank you very much for a great article, has made me question some of the processes I have implemented around this. Excellent level and language.

  3. Avatar
    Raihan Masud September 22, 2015 at 5:52 am #

    Well thought process. How about adding visualization after/during data section/pre-processing to view distribution of the data? You might be doing it implicitly as part of the Data Prep. step. Thanks for sharing your process. Very useful.

  4. Avatar
    Robert Chumley April 2, 2016 at 12:45 am #

    I like to take an Agile approach to Machine Learning where we apply and look for the highest priority outcomes first. The data can have a ton of value and looking at it from the stakeholders perspective and what they want to accomplish from the data is important. Then, based on the list of outcomes the stakeholders are looking for, work backward to find the individual value based on the results. I also add an additional step of formalization where the results are put into code and placed into reusable modules. This way, we can always reuse the result for later applications.

    • Avatar
      Utku December 21, 2018 at 9:40 am #

      This article does not consider a full business-oriented view. It intends to give an idea how to break down and analyze a problem for studies of algorithms like machine learning.

      From the industry point of view, I fully agree with Jason. In business language, this flow is called “waterfall” (process-based). The current trend in the industry is agile approoach.

      • Avatar
        Danilo Burbano January 24, 2019 at 9:51 pm #

        Interesting, agile approach can also be used to ML projects, can you share some references please

  5. Avatar
    ali July 1, 2016 at 9:37 pm #

    hi
    which NN has a better outcome for spam detection ?
    can i try taht on Weka Or not?
    i am looking for dataset with content and not content attributes
    i mean that has content features and none content features somethings this features length of email time and date of send email IP and some like that.

    • Avatar
      Jason Brownlee July 2, 2016 at 6:19 am #

      My advice would be to try a suite of different algorithms to see what works best on your problem.

      • Avatar
        Ali March 2, 2019 at 1:54 am #

        Bonjour,
        je suis débutant dans le Deep learning et j’ai une image IRM en niveaux de gris de taille maximal de 256*256.
        Quelle est le langage le plus simple a utiliser pour la segmentation par Deep leaning ?
        Comment je peux utiliser le deep learning pour la segmentation des images IRM, ( quelle architecture la plus correspond ?
        combien de neurones pour chaque couche ?
        quelle type de fonction d’activation pour chaque couche ? )

        • Avatar
          Jason Brownlee March 2, 2019 at 9:34 am #

          Perhaps look into CNNs.

          Specifically, look into methods like R-CNN and YOLO.

          I hope to have tutorials on these topics soon.

  6. Avatar
    Murali December 21, 2016 at 11:15 pm #

    I am very much thankful to your guidance

  7. Avatar
    Jay January 3, 2017 at 10:40 pm #

    Hi Jason, we prepare configurations for applying on Switches are routers. So we know all the parameters and we know the intended outcome. Can we look at automating the process of preparing the configuration using ML? Can you provide any pointers here, please?

    • Avatar
      Jason Brownlee January 4, 2017 at 8:54 am #

      I’m not sure Jay, it almost sounds like a constraint optimization problem rather than a predictive modeling problem.

      Try this process to define your problem and let me know how you go:
      https://machinelearningmastery.com/how-to-define-your-machine-learning-problem/

    • Avatar
      Dante Perez January 27, 2017 at 6:59 am #

      @Jay, i work on wireless packet core engineering but I do understand what you’re intending to do. Automating scripts won’t be applicable to ML since parameters are all predefined values for switches/routers to perform. But if you’re trying to predict why switch is overloading and all metrics are green then as what Jason mentioned you can look at several attributes of the switch processors like CPU, port utilization, counters etc and other variables that contributes to switch/router spiking up then yes you can use ML to create a predictive modelling.

  8. Avatar
    Preeti Agarwal January 10, 2017 at 5:34 pm #

    Great I am going to try the above steps

  9. Avatar
    Ted January 12, 2017 at 9:47 am #

    Awesome!!

  10. Avatar
    Mixymol March 13, 2017 at 3:50 pm #

    Sir, Thank you. I started.

  11. Avatar
    Giri March 16, 2017 at 6:04 pm #

    Jason,
    What would be your recommendations (blog posts, books, courses etc) for learning/mastering feature engineering? Seems to me that it is as important as selecting proper algorithm.

  12. Avatar
    Karunakaran April 6, 2017 at 6:39 pm #

    Perfect outline

  13. Avatar
    prasanna May 19, 2017 at 7:22 pm #

    very helpful.

  14. Avatar
    Winayak Wagle May 31, 2017 at 8:22 pm #

    Very useful in getting an overview and proceeding to next steps.

  15. Avatar
    harouna June 10, 2017 at 12:13 am #

    Thanks you a lot of ! very helpful !

    i find your blogs very interesting

  16. Avatar
    Lautaro June 23, 2017 at 12:15 pm #

    I love this types of post, it makes my head Clear about this topic.

    Thanks!

  17. Avatar
    ARUNESH GUPTA June 24, 2017 at 1:12 am #

    can u tell me how to start with machine learning from scratch. like which course should i take or tell me some sources

  18. Avatar
    Raj Kumar Thapa August 29, 2017 at 10:53 am #

    first thanks lot,i am doing master in cs so after more than 14 years back to academic. i was working in database in SQL .i want to use machine learning to predict some thing in feature . what should i do my (task model ) which could guide me to get result effectively and quickly

  19. Avatar
    Bill Ern September 7, 2017 at 2:14 pm #

    Jason,

    Great post on giving a newbie to machine learning a place to start and work through a problem. I will use your outline and make modifications as needed. Hopefully after working some problems I can post any modifications made.

  20. Avatar
    Anu September 9, 2017 at 6:53 pm #

    I went through multiple sites and courses, the way you explain is incomparable. I was above to give-up, now i will never give-up

  21. Avatar
    Shiloh November 9, 2017 at 10:24 am #

    Great post. Very well thought out and informative. It’s almost like you might be a person who thinks logically 🙂

  22. Avatar
    Connie November 27, 2017 at 4:11 am #

    Thank you very much for sharing & guidance Dr. Brownlee.

    This is like fitting the last piece of puzzle to the complex puzzle for me.

  23. Avatar
    Omotayo Oshiga December 21, 2017 at 4:24 pm #

    Great post Dr. Brownlee,

    Please, does your Machine Learning Mastery With Python book follow this process and explain them in details?

  24. Avatar
    SarahM January 6, 2018 at 4:13 am #

    Thank you for this guiding process. Can you explain more the 3rd step in defining the problem? Because as I know solving it depends on the exploration of the data, so how could I know how to solve the problem before and answer the question: Step 3: How would I solve the problem? Describe how the problem would be solved manually to flush domain knowledge

    • Avatar
      Jason Brownlee January 6, 2018 at 5:56 am #

      It is a question to help developers think about the problem and how they might code a non-ML solution to it.

      Does that help?

  25. Avatar
    Mohammad Ehtasham Billah January 31, 2018 at 9:13 pm #

    Hi,
    can you explain attribute decomposition and attribute aggregation?

  26. Avatar
    Mohammad Ehtasham Billah January 31, 2018 at 10:13 pm #

    When representing result, do we need to go through the all 6 points that you mentioned for all of the algorithms we used for solving the problem?Or just the final algorithm that performed best with that specific problem?

    • Avatar
      Jason Brownlee February 1, 2018 at 7:22 am #

      You can choose how to work through the process for your project.

  27. Avatar
    Jesús Martínez February 28, 2018 at 12:59 am #

    Really nice process. Thank you very much.

    I’d like to know how much time do you devote to each phase? And how many times you go through all the five phases, on average?

    Thanks a lot for your time and attention.

    • Avatar
      Jason Brownlee February 28, 2018 at 6:05 am #

      As much as I have.

      Some projects are fast, just a few hours, some are days or weeks.

  28. Avatar
    Sachidanand Tripathi April 2, 2018 at 4:02 pm #

    Thanks Jason, it does give me some insight as to where to start plus the cross validation technique illustrated is awesome, I am putting it at work.

  29. Avatar
    phil April 24, 2018 at 3:32 am #

    how do u build an attrition model for credit cards . any idea or reading material.what are the variables of interest?

  30. Avatar
    phil April 24, 2018 at 3:33 am #

    any special algorithms to use?

    • Avatar
      Jason Brownlee April 24, 2018 at 6:37 am #

      Yes, random forest and stochastic gradient boosting seem to do very well on lots of problems.

  31. Avatar
    Emmanuel June 13, 2018 at 11:12 am #

    Great work, and awesome information. Thanks Jason. Your page has been a guide a for me.

  32. Avatar
    Rick July 7, 2018 at 5:42 pm #

    Great Article,I think the five steps should be a circle when you find the best result

  33. Avatar
    ragav July 10, 2018 at 8:17 pm #

    Being a researcher too and understand from a data practitioner perspective , providing solutions and methods is really awasome. I am also getting confidence by looking your blogs Jassson

  34. Avatar
    Marco September 17, 2018 at 10:03 pm #

    Did you describe or applied your 5-step process for ML in one of your books in detail?

    • Avatar
      Jason Brownlee September 18, 2018 at 6:14 am #

      I show how to use the process in each of my “Machine Learning Mastery with …” books, e.g. in R, Python and Weka.

  35. Avatar
    Bisoi November 21, 2018 at 9:16 pm #

    my process is lockbox back-end operation service. can i get advise.

  36. Avatar
    kooshi November 22, 2018 at 5:10 am #

    Dear jason

    i want to do a machine learning project but i have big problem that is every subject that i choose , someone worked on it for example diabetes predict and heart attack

    and main question is that how can i understand that data set need for predict?(the predicted attribute)

    for example: https://archive.ics.uci.edu/ml/datasets/Iris orhttps://archive.ics.uci.edu/ml/datasets/Primary+Tumor

    what they want from us?

  37. Avatar
    Rabiu December 12, 2018 at 7:46 pm #

    Hi do you have any matlab code for prediction using deep learning.
    Thanks for the help.

  38. Avatar
    madusha January 8, 2019 at 7:27 pm #

    thank you sir!

  39. Avatar
    Ahmed January 18, 2019 at 3:25 am #

    Thanks a lot, your articles are really helpful, I always go around learning more and more, and again come here accidentally, and stay for days just reading your articles and learning from your long experiences .

  40. Avatar
    Nicko February 22, 2019 at 7:40 pm #

    Thank you for a comprehensive post and related ones! I am quite new in ML and this framework is exactly what I need now. I will share some ideas when I come up with them.

  41. Avatar
    Mohamed March 21, 2019 at 8:33 pm #

    Thanks alot for this useful information, I just have small question, What are the basics or theory part like when you say classifier or modelling or supervied vs non supervised, it seems there is a steps i am missing or a flow i am not aware of . from where can i have these info ?

  42. Avatar
    Ngel Rojas April 3, 2019 at 1:05 pm #

    great job.!

  43. Avatar
    sandipan sarkar June 30, 2019 at 5:30 am #

    I think jason as a beginner if I follow the website whole heartedly I will be the master in machine learning

  44. Avatar
    Ethan Day August 22, 2019 at 10:52 am #

    What software do you use to code on? I apologize if it is an obvious answer, but I am a young student, aspiring to become an engineer. I love solving problems and I wanted an early start on Machine Learning.

  45. Avatar
    franklin September 18, 2019 at 4:33 am #

    thanks.
    in a situation like electricity consumption data. how is the data captured? is it through sensors or other means?
    thank you

  46. Avatar
    Yoni Krichevsky December 12, 2019 at 9:19 pm #

    Hi Jason,

    Awesome website and resources! Have been on the developer to machine learning journey for the last half a year, too bad didn’t find the website earlier.

    Question about the Spot Check Algorithms step and the connection to Feature Engineering.

    In preparing the data, and feature engineering step, we might have some doubts about features: whether a feature is helpful, whether to leave the feature continuous or to bin it, whether to have 2 highly correlated features, or just one of them, and which one etc. etc.

    We might at this stage select an initial way forward (like keeping all features and pruning them later), with the goal of checking these assumptions later. Then according to the process you described, we would do the Spot Checking. This would leave us with just a few of the best algorithms.

    However, it’s possible that when we later do some changes in the features (drop / add / change format etc.), an algorithm that we dropped at the Spot Checking stage could have performed better than the alternatives we are left with.

    One way to do it is not to do the Spot Checking, and work with all algorithms. However, then the process is computationally and time expensive.

    How do you handle this issue?

    Thank you,
    Yoni

    • Avatar
      Yoni Krichevsky December 13, 2019 at 1:32 am #

      Also, how do you deal with the issue of different algorithms needing different features / transformation to perform the best? I sometimes find 2 different feature sets for different models which increases complexity.

      • Avatar
        Jason Brownlee December 13, 2019 at 6:04 am #

        A “model” is the data prep + algorithm + config.

    • Avatar
      Jason Brownlee December 13, 2019 at 6:00 am #

      Thanks!

      Yes, the process can be iterative as you prepare different views of your data.

      It can be made simpler by focusing on a subset of views/data prep methods, and a subset of models in order to see what works generally well, then use that as a starting point for a more detailed exploration.

      Does that help?

      • Avatar
        Yoni Krichevsky December 13, 2019 at 6:34 pm #

        Hi Jason,
        Not sure I fully follow. I have done numerous models, am usually very methodological about it, and reach great results.
        However,
        1. It takes quite some calendar time to reach good results (a week or a few weeks): find a very good feature engineering view, algorithms, settings per algorithm and ensembling.
        2. The model is sometimes too complex due to different algorithms requiring different views.
        3. The model takes a somewhat long time to train due to having a few different algorithms and views etc.
        4. I often times find myself taking the code that I have, and changing something on a very initial step, which requires to redo many steps in the middle. I do it not on a whim, but as part of methodological process and decision.

        I was wondering how I can streamline the process, make it simpler and faster (both runtime and overall time it takes until I’m happy with the results).

        The process you described above seems like something that can help me.

        However, although you write in other places that the process is usually iterative, and you many times start from the beginning, specifically in this post it seems that the process is one step after another one which would indeed speed it up.

        However, because of a few questions I raised, I’m not sure how one can prune out completely models in spot-cleaning step, unless they perform totally awfully. If they are close to other models, if we are not sure that we won’t change features, how can one remove algorithms?

        In general, do you have more tips and tricks how to make the process faster / not to have to start over too many times / work with less algorithms?

        Thank you!

        • Avatar
          Jason Brownlee December 14, 2019 at 6:12 am #

          Hahah, yes I experience the same steps – this is not uncommon.

          I have gone down the road of automating large parts of the process many times, and it is always a waste of time given the specifics that change with each new dataset. Applied predictive modeling is a hard problem and will remain so. Like software engineering. We keep trying to automate way the work.

          Yes, this process is a one-shot to get you a “good enough” result quickly, which is what most people need. Rarely do we need a really great result – unless it’s kaggle. You lose a lot with the one-step approach, as you point out.

          Re automation, I might have some code around that can help, e.g.:
          https://machinelearningmastery.com/spot-check-machine-learning-algorithms-in-python/

          I have a wonderful codebase I have developed that I might open source one day. CSV in, summary of data prep + model + config that gives good/best results as output. I basically built a ML SaaS for myself for regression, classification, time series, imbalanced classification, etc. 🙂

  47. Avatar
    Bhaskaran February 19, 2020 at 4:47 pm #

    Can you give me title of books that you have published and how to procure them (Amazon or any other way) Thanks for all your knowledge sharing.

  48. Avatar
    Lamone March 21, 2020 at 6:39 am #

    Is there a section on your website where you write about combining applied machine learning to applications? Like exactly how do you built a model, then implement it into software. I’m focused on making smarter software for the consumer. Just need some general guidance.

    Thank You

  49. Avatar
    George Mills July 13, 2020 at 10:47 pm #

    You really seem to know your stuff inside and out. I am rather new to this whole field, although I would like to think my general programming knowledge is rather deep, been doing it since ’94. I am trying to take, for starters, the World Factbook as a data source and determine the optimum point on Maslow’s hierarchy of needs which the world can hope to achieve. Any thoughts on the nature of the problem and methods and techniques I can use would be so greatly appreciated.

    Thank you,
    George

  50. Avatar
    Michael Mora-Poveda October 24, 2020 at 4:00 am #

    Hello Jason,

    Thanks for this post, in extremely useful. I saved your recommendations in my github for guides futures!!!

    Cheers,

    Michael

  51. Avatar
    JC Chouinard February 24, 2021 at 10:47 am #

    Well, I don’t have a process. But making SEO experiments is similar in some ways. Repeating a test multiple times, on different websites, or different parts of a website, discarding test that have no statistical significance. Summarising multiple experiment with box-plot is something that I never thought of, but I will start implementing. Thanks Jason.

  52. Avatar
    hamidreza mazandarani April 29, 2021 at 11:17 pm #

    hello and thanks
    i install python from python.org
    but i dont install component library numpy or comcv
    >>pip install numpy
    traceback (most recent call last):
    file””,lin1,in
    name error:name ‘pip’ is not defined

    thank you

    • Avatar
      Jason Brownlee April 30, 2021 at 6:06 am #

      The “pip” command is run from the command prompt, not the python interpreter.

Leave a Reply