Applied Machine Learning Process

The Systematic Process For Working Through
Predictive Modeling Problems That 
Delivers Above Average Results

Over time, working on applied machine learning problems you develop a pattern or process for quickly getting to good robust results.

Once developed, you can use this process again and again on project after project. The more robust and developed your process, the faster you can get to reliable results.

In this post, I want to share with you the skeleton of my process for working a machine learning problem.

You can use this as a starting point or template on your next project.

5-Step Systematic Process

I liked to use a 5-step process:

  1. Define the Problem
  2. Prepare Data
  3. Spot Check Algorithms
  4. Improve Results
  5. Present Results

There is a lot of flexibility in this process. For example, the “prepare data” step is typically broken down into analyze data (summarize and graph) and prepare data (prepare samples for experiments). The “Spot Checks” step may involve multiple formal experiments.

It’s a great big production line that I try to move through in a linear manner. The great thing in using automated tools is that you can go back a few steps (say from “Improve Results” back to “Prepare Data”) and insert a new transform of the dataset and re-run experiments in the intervening steps to see what interesting results come out and how they compare to the experiments you executed before.

Production Line

Production Line
Photo by East Capital, some rights reserved

The process I use has been adapted from the standard data mining process of knowledge discovery in databases (or KDD), See the post What is Data Mining and KDD for more details.

1. Define the Problem

I like to use a three step process to define the problem. I like to move quickly and I use this mini process to see the problem from a few different perspectives very quickly:

  • Step 1: What is the problem? Describe the problem informally and formally and list assumptions and similar problems.
  • Step 2: Why does the problem need to be solved? List your motivation for solving the problem, the benefits a solution provides and how the solution will be used.
  • Step 3: How would I solve the problem? Describe how the problem would be solved manually to flush domain knowledge.

You can learn more about this process in the post:

2. Prepare Data

I preface data preparation with a data analysis phase that involves summarizing the attributes and visualizing them using scatter plots and histograms. I also like to describe in detail each attribute and relationships between attributes. This grunt work forces me to think about the data in the context of the problem before it is lost to the algorithms

The actual data preparation process is three step as follows:

  • Step 1: Data Selection: Consider what data is available, what data is missing and what data can be removed.
  • Step 2: Data Preprocessing: Organize your selected data by formatting, cleaning and sampling from it.
  • Step 3: Data Transformation: Transform preprocessed data ready for machine learning by engineering features using scaling, attribute decomposition and attribute aggregation.

You can learn more about this process for preparing data in the post:

3. Spot Check Algorithms

I use 10 fold cross validation in my test harnesses by default. All experiments (algorithm and dataset combinations) are repeated 10 times and the mean and standard deviation of the accuracy is collected and reported. I also use statistical significance tests to flush out meaningful results from noise. Box-plots are very useful for summarizing the distribution of accuracy results for each algorithm and dataset pair.

I spot check algorithms, which means loading up a bunch of standard machine learning algorithms into my test harness and performing a formal experiment. I typically run 10-20 standard algorithms from all the major algorithm families across all the transformed and scaled versions of the dataset I have prepared.

The goal of spot checking is to flush out the types of algorithms and dataset combinations that are good at picking out the structure of the problem so that they can be studied in more detail with focused experiments.

More focused experiments with well-performing families of algorithms may be performed in this step, but algorithm tuning is left for the next step.

You can discover more about defining your test harness in the post:

You can discover the importance of spot checking algorithms in the post:

4. Improve Results

After spot checking, it’s time to squeeze out the best result from the rig. I do this by running an automated sensitivity analysis on the parameters of the top performing algorithms. I also design and run experiments using standard ensemble methods of the top performing algorithms. I put a lot of time into thinking about how to get more out of the dataset or of the family of algorithms that have been shown to perform well.

Again, statistical significance of results is critical here. It is so easy to focus on the methods and play with algorithm configurations. The results are only meaningful if they are significant and all configuration are already thought out and the experiments are executed in batch. I also like to maintain my own personal leaderboard of top results on a problem.

In summary, the process of improving results involves:

  • Algorithm Tuning: where discovering the best models is treated like a search problem through model parameter space.
  • Ensemble Methods: where the predictions made by multiple models are combined.
  • Extreme Feature Engineering: where the attribute decomposition and aggregation seen in data preparation is pushed to the limits.

You can discover more about this process in the post:

5. Present Results

The results of a complex machine learning problem are meaningless unless they are put to work. This typically means a presentation to stakeholders. Even if it is a competition or a problem I am working on for myself, I still go through the process of presenting the results. It’s a good practice and gives me clear learnings I can build upon next time.

The template I use to present results is below and may take the form of a text document, formal report or presentation slides.

  • Context (Why): Define the environment in which the problem exists and set up the motivation for the research question.
  • Problem (Question): Concisely describe the problem as a question that you went out and answered.
  • Solution (Answer): Concisely describe the solution as an answer to the question you posed in the previous section. Be specific.
  • Findings: Bulleted lists of discoveries you made along the way that interests the audience. They may be discoveries in the data, methods that did or did not work or the model performance benefits you achieved along your journey.
  • Limitations: Consider where the model does not work or questions that the model does not answer. Do not shy away from these questions, defining where the model excels is more trusted if you can define where it does not excel.
  • Conclusions (Why+Question+Answer): Revisit the “why”, research question and the answer you discovered in a tight little package that is easy to remember and repeat for yourself and others.

You can discover more about using the results of a machine learning project in the post:

Summary

In this post, you have learned my general template for processing a machine learning problem.

I use this process almost without fail and I use it across platforms, from Weka, R and scikit-learn and even new platforms I have been playing around with like pylearn2.

What is your process, leave a comment and share?

Will you copy this process, and if so, what changes will you make to it?

20 Responses to Applied Machine Learning Process

  1. sadegh May 7, 2015 at 4:31 pm #

    thanx

  2. json July 15, 2015 at 5:10 am #

    very helpful post

  3. Raihan Masud September 22, 2015 at 5:52 am #

    Well thought process. How about adding visualization after/during data section/pre-processing to view distribution of the data? You might be doing it implicitly as part of the Data Prep. step. Thanks for sharing your process. Very useful.

  4. Robert Chumley April 2, 2016 at 12:45 am #

    I like to take an Agile approach to Machine Learning where we apply and look for the highest priority outcomes first. The data can have a ton of value and looking at it from the stakeholders perspective and what they want to accomplish from the data is important. Then, based on the list of outcomes the stakeholders are looking for, work backward to find the individual value based on the results. I also add an additional step of formalization where the results are put into code and placed into reusable modules. This way, we can always reuse the result for later applications.

  5. ali July 1, 2016 at 9:37 pm #

    hi
    which NN has a better outcome for spam detection ?
    can i try taht on Weka Or not?
    i am looking for dataset with content and not content attributes
    i mean that has content features and none content features somethings this features length of email time and date of send email IP and some like that.

    • Jason Brownlee July 2, 2016 at 6:19 am #

      My advice would be to try a suite of different algorithms to see what works best on your problem.

  6. Murali December 21, 2016 at 11:15 pm #

    I am very much thankful to your guidance

  7. Jay January 3, 2017 at 10:40 pm #

    Hi Jason, we prepare configurations for applying on Switches are routers. So we know all the parameters and we know the intended outcome. Can we look at automating the process of preparing the configuration using ML? Can you provide any pointers here, please?

    • Jason Brownlee January 4, 2017 at 8:54 am #

      I’m not sure Jay, it almost sounds like a constraint optimization problem rather than a predictive modeling problem.

      Try this process to define your problem and let me know how you go:
      http://machinelearningmastery.com/how-to-define-your-machine-learning-problem/

    • Dante Perez January 27, 2017 at 6:59 am #

      @Jay, i work on wireless packet core engineering but I do understand what you’re intending to do. Automating scripts won’t be applicable to ML since parameters are all predefined values for switches/routers to perform. But if you’re trying to predict why switch is overloading and all metrics are green then as what Jason mentioned you can look at several attributes of the switch processors like CPU, port utilization, counters etc and other variables that contributes to switch/router spiking up then yes you can use ML to create a predictive modelling.

  8. Preeti Agarwal January 10, 2017 at 5:34 pm #

    Great I am going to try the above steps

  9. Ted January 12, 2017 at 9:47 am #

    Awesome!!

  10. Mixymol March 13, 2017 at 3:50 pm #

    Sir, Thank you. I started.

  11. Giri March 16, 2017 at 6:04 pm #

    Jason,
    What would be your recommendations (blog posts, books, courses etc) for learning/mastering feature engineering? Seems to me that it is as important as selecting proper algorithm.

Leave a Reply