How to Prepare Data For Machine Learning

Machine learning algorithms learn from data. It is critical that you feed them the right data for the problem you want to solve. Even if you have good data, you need to make sure that it is in a useful scale, format and even that meaningful features are included.

In this post you will learn how to prepare data for a machine learning algorithm. This is a big topic and you will cover the essentials.

lots of data

Lots of Data
Photo attributed to cibomahto, some rights reserved

Data Preparation Process

The more disciplined you are in your handling of data, the more consistent and better results you are like likely to achieve. The process for getting data ready for a machine learning algorithm can be summarized in three steps:

  • Step 1: Select Data
  • Step 2: Preprocess Data
  • Step 3: Transform Data

You can follow this process in a linear manner, but it is very likely to be iterative with many loops.

Step 1: Select Data

This step is concerned with selecting the subset of all available data that you will be working with. There is always a strong desire for including all data that is available, that the maxim “more is better” will hold. This may or may not be true.

You need to consider what data you actually need to address the question or problem you are working on. Make some assumptions about the data you require and be careful to record those assumptions so that you can test them later if needed.

Below are some questions to help you think through this process:

  • What is the extent of the data you have available? For example through time, database tables, connected systems. Ensure you have a clear picture of everything that you can use.
  • What data is not available that you wish you had available? For example data that is not recorded or cannot be recorded. You may be able to derive or simulate this data.
  • What data don’t you need to address the problem? Excluding data is almost always easier than including data. Note down which data you excluded and why.

It is only in small problems, like competition or toy datasets where the data has already been selected for you.

Step 2: Preprocess Data

After you have selected the data, you need to consider how you are going to use the data. This preprocessing step is about getting the selected data into a form that you can work.

Three common data preprocessing steps are formatting, cleaning and sampling:

  • Formatting: The data you have selected may not be in a format that is suitable for you to work with. The data may be in a relational database and you would like it in a flat file, or the data may be in a proprietary file format and you would like it in a relational database or a text file.
  • Cleaning: Cleaning data is the removal or fixing of missing data. There may be data instances that are incomplete and do not carry the data you believe you need to address the problem. These instances may need to be removed. Additionally, there may be sensitive information in some of the attributes and these attributes may need to be anonymized or removed from the data entirely.
  • Sampling: There may be far more selected data available than you need to work with. More data can result in much longer running times for algorithms and larger computational and memory requirements. You can take a smaller representative sample of the selected data that may be much faster for exploring and prototyping solutions before considering the whole dataset.

It is very likely that the machine learning tools you use on the data will influence the preprocessing you will be required to perform. You will likely revisit this step.

So much data

So much data
Photo attributed to Marc_Smith, some rights reserved

Step 3: Transform Data

The final step is to transform the process data. The specific algorithm you are working with and the knowledge of the problem domain will influence this step and you will very likely have to revisit different transformations of your preprocessed data as you work on your problem.

Three common data transformations are scaling, attribute decompositions and attribute aggregations. This step is also referred to as feature engineering.

  • Scaling: The preprocessed data may contain attributes with a mixtures of scales for various quantities such as dollars, kilograms and sales volume. Many machine learning methods like data attributes to have the same scale such as between 0 and 1 for the smallest and largest value for a given feature. Consider any feature scaling you may need to perform.
  • Decomposition: There may be features that represent a complex concept that may be more useful to a machine learning method when split into the constituent parts. An example is a date that may have day and time components that in turn could be split out further. Perhaps only the hour of day is relevant to the problem being solved. consider what feature decompositions you can perform.
  • Aggregation: There may be features that can be aggregated into a single feature that would be more meaningful to the problem you are trying to solve. For example, there may be a data instances for each time a customer logged into a system that could be aggregated into a count for the number of logins allowing the additional instances to be discarded. Consider what type of feature aggregations could perform.

You can spend a lot of time engineering features from your data and it can be very beneficial to the performance of an algorithm. Start small and build on the skills you learn.

Summary

In this post you learned the essence of data preparation for machine learning. You discovered a three step framework for data preparation and tactics in each step:

  • Step 1: Data Selection Consider what data is available, what data is missing and what data can be removed.
  • Step 2: Data Preprocessing Organize your selected data by formatting, cleaning and sampling from it.
  • Step 3: Data Transformation Transform preprocessed data ready for machine learning by engineering features using scaling, attribute decomposition and attribute aggregation.

Data preparation is a large subject that can involve a lot of iterations, exploration and analysis. Getting good at data preparation will make you a master at machine learning. For now, just consider the questions raised in this post when preparing data and always be looking for clearer ways of representing the problem you are trying to solve.

Resources

If you are looking to dive deeper into this subject, you can learn more in the resources below.

Do you have some data preparation process tips and tricks. Please leave a comment and share your experiences.

29 Responses to How to Prepare Data For Machine Learning

  1. Fraser March 31, 2014 at 4:30 am #

    I enjoyed your concise overview, Jason.

    Perhaps you can delve a little into the dangers/opportunities in your Step 2: Cleaning stage.

    It has been my experience that those data you may want to remove contain the more interesting data to the client (perhaps only after the requested client questions are addressed).

    Fraser

    • jasonb March 31, 2014 at 5:36 am #

      Hi Fraser, good question.
      Indeed, it can difficult to know if data is bad and you may not always have a domain expert at hand to comment. Sometimes it is obvious though, like 0 values that are impossible in the domain like a blood pressure. I’ve also seen -999 used to signal “not provided”. In these cases we can mark attributes as missing and think about possible rules for imputing if we so desire.
      Where do you draw the line though? Should severe outliers be marked as missing? Sometimes. I like to try a lot of stuff, for example, I would try removing instances with large outliers in one dimension and see what that did to my models, I’d also try removing instances with missing values and try models on variations of the data with imputed value. Almost always, modeling ground truth is not the goal, there are performance metrics like classification accuracy or AUC that we are being optimized.
      You’re right though, sometimes the broken data can represent something very interesting – anomalies that signal something useful in and of themselves in the domain.

  2. Fraser March 31, 2014 at 6:05 am #

    Yes, indeed. Is it an outlier, or a poorly encoded result, or a result with atypical calibration, or does it represent a distinct and real combination of natural conditions …

    I work a lot with chemical concentration data in water and sediment and I run into censored data routinely. Mostly from the 1000 mg/L. Censored data of this particular type is handled differently by different people and as you suggest values need to be imputed (with an appropriate sampling distribution) if the rest of a multi-parameter time-sample result is to remain in the analysis.

    For me this is what makes data analysis fun.

    I just arrived at your site, and I see so many articles of interest. Thank you for making this available.

    Fraser

  3. Fraser March 31, 2014 at 6:07 am #

    The use of the angle brackets got lost in my post above.

    “Mostly of the type “less than” .01 pg/L but occasionally the other side, say “greater than” 1000 mg/L.”

    • jasonb March 31, 2014 at 7:42 am #

      Insightful comments Fraser, thanks. Reach out any time if you want kick around some ideas on a tough problem.

  4. Fraser March 31, 2014 at 11:50 am #

    Thanks, Jason. I will do that. Fraser

  5. Surajit August 25, 2015 at 10:23 pm #

    I like “Getting good at data preparation will make you a master at machine learning”. This is indeed a good post.

    Thanks Dr Jason.

    • Rohita Gupta February 10, 2017 at 5:43 pm #

      Can you please share the link of this article

      • Jason Brownlee February 11, 2017 at 4:55 am #

        I believe Surajit was quoting from this article.

  6. Kiran Garimella November 6, 2015 at 9:20 am #

    Great set of articles!

    One issue that I run into is that the data sometimes lacks semantic integrity. This is not an issue of missing values, but just having improper values. When values are of different data types within a column, it is easy to detect and fix.

    However, when the data type is the same but the meaning changes, then it’s much more difficult. For example, I’ve seen sales data where a column named ‘marketing plan code’ would have string data type denoting marketing plan codes, except in a few cases where the users put in vendor codes because they didn’t have any other field to record that information.

    Any insights and anecdotes about this issue?

  7. KLeyn May 4, 2016 at 10:14 pm #

    Jason, does it affect an algoritm if, during the preparation process I transform the list of rows (like tables, where the key columns repeats) to pivot tablee, where the key colums shows once and a lot of columns (say hundreds) have parcial sums or counts for the different conditions (let’s say sales of january in one column, sales of february in a second column and so on).
    Does it would make muiltcorrelation as some columns could be aggregated to one?

  8. mokhtar May 12, 2016 at 4:44 am #

    Thank you for your valuable information for this important area Machine learning, started with data structure and going further to build it complete.

  9. ali October 21, 2016 at 9:50 am #

    how can i make one attribute as decision attribute in the data set in order to classification model depend on the selected attribute

    • Jason Brownlee October 22, 2016 at 6:54 am #

      Hi Ali,

      Different algorithms will chose which variables to use and how to use them. You can force a model to use one variable by deleting all of the other variables.

  10. Avin October 31, 2016 at 11:21 am #

    Hi Jason,

    Appreciate the effort you put into the great article.

    I am currently working on a project on a government data set to find if an entity(person or an individual) were involved in a a positive or a negative way. I took a flat file containing some test data and prepared the code to perform sentiment analysis using Naive Bayes algorithm using NLTK python modules.

    – In most cases we have a defined trained data set tagged as ‘positive’ or ‘negative’ (e.g movie reviews, twitter data set). In my case there is no existing trained government data set.
    – The training data is available but I need to categorize the training data set as ‘positive’ or ‘negative’.
    – My question here is, how do we go about classifying my government data as ‘positive’ or ‘negative’.

    I’m looking forward on your advice on how to categorize my government training data as positive or negative. This is very important for me to get my sentiment analysis with best possible accuracy.

    • Jason Brownlee November 1, 2016 at 7:58 am #

      Hi Avin, I would advise you locate a subject matter expert to prepare a high-quality training dataset for you (manual classifications).

  11. Mayur November 4, 2016 at 6:01 am #

    What is the best way to process large amount of data for machine learning?

    • Jason Brownlee November 4, 2016 at 11:14 am #

      Hi Mayur,

      That depends on the problem and how the data is currently represented and stored. No silver bullets, sorry.

  12. Ivan November 9, 2016 at 12:27 am #

    My current and first ML project has natural language as it’s input and I spent a huge chunk of time on preparing it.

    I stopped once the data reached a “reasonable” level so that I could continue with the project, i.e. I’m dropping the hard to parse cases and might return to them later once the whole pipeline is ready for testing.

    Keeping the 80/20 rule in mind.

  13. ted January 3, 2017 at 3:14 am #

    Thank you for your valuables posts, my question is how to apply machine learning to Cancer Registry data set?

    I have two datasets:

    1. Dataset1 :

    About 18K observation and 22 variables: the five years data set includes
    Demographic, Diagnoses, and treatments,

    2. Dataset2:
    aggregate vitals based on race grouping on: regions, stages, vitals

    Thank you for your help

    Ted

  14. José Alberto Ramos Silva March 11, 2017 at 11:46 pm #

    Hi Jason, thank you for the great effort and knowledge put into all these posts!
    My question will probably be silly, but since I’m a complete n00b I’ll do it just the same.
    Data prep, feature analysis and engineering will get you a set of data in a format completely different from original data. These data transformation steps may be very hard to do automatically. My problem is related to classification, I am using NN which may not be the best choice, but hey, humor me 😉
    So, cutting short. Originally, I get raw data, I prep and transform it. The transformed data will train and test “my” NN. Now, the “real world” will challenge my model with raw data, presumably with the same format as my original training set, minus the classification ( of course…). Now, I suppose I’ll have to go through the same data transformation of the data before the trained model can be fed with it. Right? Doesn’t this mean extra care must be taken to make the data transformation process (at least ideally) automatic itself?
    Sorry for the long question, hope to hear your thoughts on these points. And thank you once again!

    • Jason Brownlee March 12, 2017 at 8:27 am #

      Very good question José!

      Yes. Any data transformation performed on data used to fit your model must be performed on data when making predictions.

      This means we need a very clear recipe for this transform, ideally automatic and also in the case of regression problems it must be reversible so that we can convert predictions back into their predictions scale for use.

      • José Alberto Ramos Silva March 16, 2017 at 9:47 am #

        Thank you very much Jason. And keep up the excellent job you are doing!

  15. dhanpal singh April 29, 2017 at 7:51 am #

    what is the best book to learn how to prepare the datasets for machine learning models

  16. Eric Kraemer August 12, 2017 at 7:19 am #

    I would like to offer that within your topic of “Select Data” you offer a bit more explicit guidance on the topic of assessing and characterizing data quality. It’s cliché, but garbage-in-garbage-out is a fundamental concept. I so often come across advanced analytic initiatives that have started out with Assumptions for quality of “selected” data and moved on – only to find out months later that everything has to reset to basic principles of data acquisition and management.

    What transforms have been applies to source data by systems that precede the database you are selecting from?

    If sensor data is involved, what formatting, precision, transformations, signal processing, etc. have been applied?

    If data is being acquired from multiple, disparate systems what formatting, scale, and precision differences are being masked by the database system you are selecting from?

    Just a few examples.

    • Jason Brownlee August 13, 2017 at 9:43 am #

      Really good points Eric.

      It’s hard to give general advice on data prep because of all of the detail in specific data matters.

      It’s not like algorithms where you can say “try everything and see what works on your data”.

Leave a Reply