[New Book] Click to get The Beginner's Guide to Data Science!
Use the offer code 20offearlybird to get 20% off. Hurry, sale ends soon!

Tour of Data Preparation Techniques for Machine Learning

Predictive modeling machine learning projects, such as classification and regression, always involve some form of data preparation.

The specific data preparation required for a dataset depends on the specifics of the data, such as the variable types, as well as the algorithms that will be used to model them that may impose expectations or requirements on the data.

Nevertheless, there is a collection of standard data preparation algorithms that can be applied to structured data (e.g. data that forms a large table like in a spreadsheet). These data preparation algorithms can be organized or grouped by type into a framework that can be helpful when comparing and selecting techniques for a specific project.

In this tutorial, you will discover the common data preparation tasks performed in a predictive modeling machine learning task.

After completing this tutorial, you will know:

  • Techniques such as data cleaning can identify and fix errors in data like missing values.
  • Data transforms can change the scale, type, and probability distribution of variables in the dataset.
  • Techniques such as feature selection and dimensionality reduction can reduce the number of input variables.

Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Tour of Data Preparation Techniques for Machine Learning

Tour of Data Preparation Techniques for Machine Learning
Photo by Nicolas Raymond, some rights reserved.

Tutorial Overview

This tutorial is divided into six parts; they are:

  1. Common Data Preparation Tasks
  2. Data Cleaning
  3. Feature Selection
  4. Data Transforms
  5. Feature Engineering
  6. Dimensionality Reduction

Common Data Preparation Tasks

We can define data preparation as the transformation of raw data into a form that is more suitable for modeling.

Nevertheless, there are steps in a predictive modeling project before and after the data preparation step that are important and inform the data preparation that is to be performed.

The process of applied machine learning consists of a sequence of steps.

We may jump back and forth between the steps for any given project, but all projects have the same general steps; they are:

  • Step 1: Define Problem.
  • Step 2: Prepare Data.
  • Step 3: Evaluate Models.
  • Step 4: Finalize Model.

We are concerned with the data preparation step (step 2), and there are common or standard tasks that you may use or explore during the data preparation step in a machine learning project.

The types of data preparation performed depend on your data, as you might expect.

Nevertheless, as you work through multiple predictive modeling projects, you see and require the same types of data preparation tasks again and again.

These tasks include:

  • Data Cleaning: Identifying and correcting mistakes or errors in the data.
  • Feature Selection: Identifying those input variables that are most relevant to the task.
  • Data Transforms: Changing the scale or distribution of variables.
  • Feature Engineering: Deriving new variables from available data.
  • Dimensionality Reduction: Creating compact projections of the data.

This provides a rough framework that we can use to think about and navigate different data preparation algorithms we may consider on a given project with structured or tabular data.

Let’s take a closer look at each in turn.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Data Cleaning

Data cleaning involves fixing systematic problems or errors in “messy” data.

The most useful data cleaning involves deep domain expertise and could involve identifying and addressing specific observations that may be incorrect.

There are many reasons data may have incorrect values, such as being mistyped, corrupted, duplicated, and so on. Domain expertise may allow obviously erroneous observations to be identified as they are different from what is expected, such as a person’s height of 200 feet.

Once messy, noisy, corrupt, or erroneous observations are identified, they can be addressed. This might involve removing a row or a column. Alternately, it might involve replacing observations with new values.

Nevertheless, there are general data cleaning operations that can be performed, such as:

  • Using statistics to define normal data and identify outliers.
  • Identifying columns that have the same value or no variance and removing them.
  • Identifying duplicate rows of data and removing them.
  • Marking empty values as missing.
  • Imputing missing values using statistics or a learned model.

Data cleaning is an operation that is typically performed first, prior to other data preparation operations.

Overview of Data Cleaning

Overview of Data Cleaning

For more on data cleaning see the tutorial:

Feature Selection

Feature selection refers to techniques for selecting a subset of input features that are most relevant to the target variable that is being predicted.

This is important as irrelevant and redundant input variables can distract or mislead learning algorithms possibly resulting in lower predictive performance. Additionally, it is desirable to develop models only using the data that is required to make a prediction, e.g. to favor the simplest possible well performing model.

Feature selection techniques are generally grouped into those that use the target variable (supervised) and those that do not (unsupervised). Additionally, the supervised techniques can be further divided into models that automatically select features as part of fitting the model (intrinsic), those that explicitly choose features that result in the best performing model (wrapper) and those that score each input feature and allow a subset to be selected (filter).

Overview of Feature Selection Techniques

Overview of Feature Selection Techniques

Statistical methods are popular for scoring input features, such as correlation. The features can then be ranked by their scores and a subset with the largest scores used as input to a model. The choice of statistical measure depends on the data types of the input variables and a review of different statistical measures that can be used.

For an overview of how to select statistical feature selection methods based on data type, see the tutorial:

Additionally, there are different common feature selection use cases we may encounter in a predictive modeling project, such as:

When a mixture of input variable data types a present, different filter methods can be used. Alternately, a wrapper method such as the popular RFE method can be used that is agnostic to the input variable type.

The broader field of scoring the relative importance of input features is referred to as feature importance and many model-based techniques exist whose outputs can be used to aide in interpreting the model, interpreting the dataset, or in selecting features for modeling.

For more on feature importance, see the tutorial:

Data Transforms

Data transforms are used to change the type or distribution of data variables.

This is a large umbrella of different techniques and they may be just as easily applied to input and output variables.

Recall that data may have one of a few types, such as numeric or categorical, with subtypes for each, such as integer and real-valued for numeric, and nominal, ordinal, and boolean for categorical.

  • Numeric Data Type: Number values.
    • Integer: Integers with no fractional part.
    • Real: Floating point values.
  • Categorical Data Type: Label values.
    • Ordinal: Labels with a rank ordering.
    • Nominal: Labels with no rank ordering.
    • Boolean: Values True and False.

The figure below provides an overview of this same breakdown of high-level data types.

Overview of Data Variable Types

Overview of Data Variable Types

We may wish to convert a numeric variable to an ordinal variable in a process called discretization. Alternatively, we may encode a categorical variable as integers or boolean variables, required on most classification tasks.

  • Discretization Transform: Encode a numeric variable as an ordinal variable.
  • Ordinal Transform: Encode a categorical variable into an integer variable.
  • One-Hot Transform: Encode a categorical variable into binary variables.

For real-valued numeric variables, the way they are represented in a computer means there is dramatically more resolution in the range 0-1 than in the broader range of the data type. As such, it may be desirable to scale variables to this range, called normalization. If the data has a Gaussian probability distribution, it may be more useful to shift the data to a standard Gaussian with a mean of zero and a standard deviation of one.

  • Normalization Transform: Scale a variable to the range 0 and 1.
  • Standardization Transform: Scale a variable to a standard Gaussian.

The probability distribution for numerical variables can be changed.

For example, if the distribution is nearly Gaussian, but is skewed or shifted, it can be made more Gaussian using a power transform. Alternatively, quantile transforms can be used to force a probability distribution, such as a uniform or Gaussian on a variable with an unusual natural distribution.

  • Power Transform: Change the distribution of a variable to be more Gaussian.
  • Quantile Transform: Impose a probability distribution such as uniform or Gaussian.

An important consideration with data transforms is that the operations are generally performed separately for each variable. As such, we may want to perform different operations on different variable types.

Overview of Data Transform Techniques

Overview of Data Transform Techniques

We may also want to use the transform on new data in the future. This can be achieved by saving the transform objects to file along with the final model trained on all available data.

Feature Engineering

Feature engineering refers to the process of creating new input variables from the available data.

Engineering new features is highly specific to your data and data types. As such, it often requires the collaboration of a subject matter expert to help identify new features that could be constructed from the data.

This specialization makes it a challenging topic to generalize to general methods.

Nevertheless, there are some techniques that can be reused, such as:

  • Adding a boolean flag variable for some state.
  • Adding a group or global summary statistic, such as a mean.
  • Adding new variables for each component of a compound variable, such as a date-time.

A popular approach drawn from statistics is to create copies of numerical input variables that have been changed with a simple mathematical operation, such as raising them to a power or multiplied with other input variables, referred to as polynomial features.

  • Polynomial Transform: Create copies of numerical input variables that are raised to a power.

The theme of feature engineering is to add broader context to a single observation or decompose a complex variable, both in an effort to provide a more straightforward perspective on the input data.

I like to think of feature engineering as a type of data transform, although it would be just as reasonable to think of data transforms as a type of feature engineering.

Dimensionality Reduction

The number of input features for a dataset may be considered the dimensionality of the data.

For example, two input variables together can define a two-dimensional area where each row of data defines a point in that space. This idea can then be scaled to any number of input variables to create large multi-dimensional hyper-volumes.

The problem is, the more dimensions this space has (e.g. the more input variables), the more likely it is that the dataset represents a very sparse and likely unrepresentative sampling of that space. This is referred to as the curse of dimensionality.

This motivates feature selection, although an alternative to feature selection is to create a projection of the data into a lower-dimensional space that still preserves the most important properties of the original data.

This is referred to generally as dimensionality reduction and provides an alternative to feature selection. Unlike feature selection, the variables in the projected data are not directly related to the original input variables, making the projection difficult to interpret.

The most common approach to dimensionality reduction is to use a matrix factorization technique:

  • Principal Component Analysis (PCA)
  • Singular Value Decomposition (SVD)

The main impact of these techniques is that they remove linear dependencies between input variables, e.g. correlated variables.

Other approaches exist that discover a lower dimensionality reduction. We might refer to these as model-based methods such as LDA and perhaps autoencoders.

  • Linear Discriminant Analysis (LDA)

Sometimes manifold learning algorithms can also be used, such as Kohonen self-organizing maps and t-SNE.

Overview of Dimensionality Reduction Techniques

Overview of Dimensionality Reduction Techniques

For on dimensionality reduction, see the tutorial:

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Books

Articles

Summary

In this tutorial, you discovered the common data preparation tasks performed in a predictive modeling machine learning task.

Specifically, you learned:

  • Techniques, such data cleaning, can identify and fix errors in data like missing values.
  • Data transforms can change the scale, type, and probability distribution of variables in the dataset.
  • Techniques such as feature selection and dimensionality reduction can reduce the number of input variables.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Modern Data Preparation!

Data Preparation for Machine Learning

Prepare Your Machine Learning Data in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Data Preparation for Machine Learning

It provides self-study tutorials with full working code on:
Feature Selection, RFE, Data Cleaning, Data Transforms, Scaling, Dimensionality Reduction, and much more...

Bring Modern Data Preparation Techniques to
Your Machine Learning Projects


See What's Inside

38 Responses to Tour of Data Preparation Techniques for Machine Learning

  1. Avatar
    Santosh June 19, 2020 at 6:57 am #

    Hi Jason, its an amazing journey going with your articles that you publish. Tons of thanks.

  2. Avatar
    Sangamithra Varadarajan June 19, 2020 at 3:07 pm #

    Flow diagram representation is amazing and gives more clarity on the topics…Thanks a lot.. this is something to be booarked and revisited again and again.

  3. Avatar
    Desbois June 19, 2020 at 3:22 pm #

    I like the breakdown of data preparation techniques-the graphs are so helpful. I always read your articles, and you make them easy to understand.

    Thanks for your time and dedications!!!!

  4. Avatar
    KVS Setty June 19, 2020 at 5:57 pm #

    Thanks a lot for another interesting and eye opening article on “Data Preparation” step of ML project.

    just a typo , I think , in the following sentence:

    “Other approaches exist that discover a lower dimensionality reduction. We might refer to these as based methods such as LDA and perhaps autoencoders.”

    the word “Model” is missing , I think it is “Model based” rather than just “based”.

  5. Avatar
    Sachin Prabhu June 19, 2020 at 9:35 pm #

    Thank you so much for insightful series on Data preparation 🙂

  6. Avatar
    Shiya June 19, 2020 at 10:40 pm #

    Your posts are very informative and thanks for sharing your knowledge. Please keep doing it.

  7. Avatar
    Nur Asikin June 21, 2020 at 3:42 am #

    HI.
    I’d love it if I am allowed to print the pages, for me to peruse conveniently offline. But everytime I click “print” , the box “Thank you for signing up” (or the others) keep popping up! Will you make it go away? Thanks a lot.

  8. Avatar
    Anthony The Koala June 21, 2020 at 6:06 am #

    Dear Dr Jason,
    Thank you for providing a summary chart of the reduction techniques located at the bottom of your tutorial.

    I am aware you have done all the techniques such as SVD and PCA.

    Is it possible to have links to each of the five topics please? These are Manifold Learning – SOM & tSNE, Model based – LDA, PCA & SVD.

    Thank you,
    Anthony of Sydney

    • Avatar
      Jason Brownlee June 21, 2020 at 6:37 am #

      I have another tutorial on this topic written and scheduled with some manifold learning methods

      Thanks for the suggestion. SOM is not provided by scikit-learn, and tsne is but it does not support this form of usage in supervised learning.

  9. Avatar
    Antonio June 22, 2020 at 6:01 am #

    Jason, thanks for the material, high quality as usual. Don’t you think its worthy to mention Factor Analysis together with PCA for dimension reduction, also very helpful during feature extraction

    Regards.,

  10. Avatar
    Ronke Babatunde June 27, 2020 at 12:19 am #

    Thanks for the well informing tutorials, please keep encouraging

  11. Avatar
    Eduardo Rojas July 5, 2020 at 2:06 pm #

    I love how clearly you structure the topics, I am very visual and your diagrams provide me with an easier way to integrate knowledge.

    There is only one issue that recently brings me to confusion, it is the identification if we are facing ‘outliers’ or ‘anomalies’ and how to deal with these values, such as: techniques to differentiate them, how to process anomalies during cleaning, or transformation and if it is convenient to do so.

  12. Avatar
    Swayonok August 6, 2020 at 4:51 am #

    Its amazing and very apt, and it shades a light on the classfication of the diverse methods that you have bucketed with parent nodes ,thhat is god send to me.Thank you so much

  13. Avatar
    Felix Farinola August 8, 2020 at 2:55 am #

    It’s awesome going through your article, it gives good insight on the go.

  14. Avatar
    Prem March 30, 2021 at 10:43 am #

    Its wonderful article Jason,

    One question in case of one to many relation data like Bank Loan Dataset,

    If one customer has 2 dependents, each have different salaries and different employee information,

    If find two approaches could be done

    1st, sum-up the salary, create a new column “Number of Applicants” and populate as 2, sum-up the credit limits, have to skip some.

    2nd approach, we can split the dependents as two records with same application number, each details of record is populated with their corresponding dependent information.

    May I know which approach is better?

    • Avatar
      Jason Brownlee March 31, 2021 at 5:57 am #

      Thanks.

      I would expect that reducing all data down into one record might be appropriate, but it really depends on the specifics of the data and the project goals.

      • Avatar
        Prem March 31, 2021 at 9:20 am #

        Thank you very much Jason, Much appreciated for the reply

  15. Avatar
    C. Moore August 15, 2021 at 3:26 am #

    This is a nice overview, but you did not get into data cleaning for image data. Admittedly, this is my pet topic, as I am the author of the Python library cleanX, but I hope you will cover data cleaning of images in the future!

    • Avatar
      Adrian Tam August 17, 2021 at 7:17 am #

      Thank you for your suggestion!

  16. Avatar
    Laura August 18, 2021 at 1:03 am #

    Hello Jason,

    In your overview figure of Data Transform, you mention that ordinal type data should receive a Label Encoder transformation, I thought this Label Encoder was in the case of a label (1D), should it be ordinal or categorical. Why an ordinal type data couldn’t receive a dummy transform?

    Thank you very much for all the work you are doing here.

    • Avatar
      Adrian Tam August 18, 2021 at 3:32 am #

      Ordinal data is 1st, 2nd, 3rd, etc. A label encoder treat these as labels, and convert it into 0, 1, 2, etc. This is a single value. Dummy encoder is one-hot encoding, which convert into multiple columns of 0 or 1 values. Of course you can do so to ordinal data, but you will not preserve the information that 2 is greater than 1, and 1 is greater than 0. Usually this is not something we want.

  17. Avatar
    Meriem August 24, 2021 at 1:37 pm #

    Hi,

    Thank you for this post.

    My question is the following:

    Should we perform feature selection before normalization? Also, if I have an imbalanced dataset, which problem to handle first?

    Thank you,

  18. Avatar
    Tony May 30, 2022 at 10:01 pm #

    Hej,

    Thank you for the learning materials.

    I have three large files and the common column is “ID”. What is the best approach in this scenario to clean the data(impute, remove duplicates, scale, etc.) before merging/concatenating or after merging?

    The merging process, after or before cleaning the data, will produce a lot of extra missing values! That really confused me about how to solve this issue and how to start solving it.

    Thank you,

    Tony

Leave a Reply