Last Updated on June 30, 2020

Predictive modeling machine learning projects, such as classification and regression, always involve some form of data preparation.

The specific data preparation required for a dataset depends on the specifics of the data, such as the variable types, as well as the algorithms that will be used to model them that may impose expectations or requirements on the data.

Nevertheless, there is a collection of standard data preparation algorithms that can be applied to structured data (e.g. data that forms a large table like in a spreadsheet). These data preparation algorithms can be organized or grouped by type into a framework that can be helpful when comparing and selecting techniques for a specific project.

In this tutorial, you will discover the common data preparation tasks performed in a predictive modeling machine learning task.

After completing this tutorial, you will know:

- Techniques such as data cleaning can identify and fix errors in data like missing values.
- Data transforms can change the scale, type, and probability distribution of variables in the dataset.
- Techniques such as feature selection and dimensionality reduction can reduce the number of input variables.

**Kick-start your project** with my new book Data Preparation for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

## Tutorial Overview

This tutorial is divided into six parts; they are:

- Common Data Preparation Tasks
- Data Cleaning
- Feature Selection
- Data Transforms
- Feature Engineering
- Dimensionality Reduction

## Common Data Preparation Tasks

We can define data preparation as the transformation of raw data into a form that is more suitable for modeling.

Nevertheless, there are steps in a predictive modeling project before and after the data preparation step that are important and inform the data preparation that is to be performed.

The process of applied machine learning consists of a sequence of steps.

We may jump back and forth between the steps for any given project, but all projects have the same general steps; they are:

**Step 1**: Define Problem.**Step 2**: Prepare Data.**Step 3**: Evaluate Models.**Step 4**: Finalize Model.

We are concerned with the data preparation step (step 2), and there are common or standard tasks that you may use or explore during the data preparation step in a machine learning project.

The types of data preparation performed depend on your data, as you might expect.

Nevertheless, as you work through multiple predictive modeling projects, you see and require the same types of data preparation tasks again and again.

These tasks include:

**Data Cleaning**: Identifying and correcting mistakes or errors in the data.**Feature Selection**: Identifying those input variables that are most relevant to the task.**Data Transforms**: Changing the scale or distribution of variables.**Feature Engineering**: Deriving new variables from available data.**Dimensionality Reduction**: Creating compact projections of the data.

This provides a rough framework that we can use to think about and navigate different data preparation algorithms we may consider on a given project with structured or tabular data.

Let’s take a closer look at each in turn.

### Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Data Cleaning

Data cleaning involves fixing systematic problems or errors in “*messy*” data.

The most useful data cleaning involves deep domain expertise and could involve identifying and addressing specific observations that may be incorrect.

There are many reasons data may have incorrect values, such as being mistyped, corrupted, duplicated, and so on. Domain expertise may allow obviously erroneous observations to be identified as they are different from what is expected, such as a person’s height of 200 feet.

Once messy, noisy, corrupt, or erroneous observations are identified, they can be addressed. This might involve removing a row or a column. Alternately, it might involve replacing observations with new values.

Nevertheless, there are general data cleaning operations that can be performed, such as:

- Using statistics to define normal data and identify outliers.
- Identifying columns that have the same value or no variance and removing them.
- Identifying duplicate rows of data and removing them.
- Marking empty values as missing.
- Imputing missing values using statistics or a learned model.

Data cleaning is an operation that is typically performed first, prior to other data preparation operations.

For more on data cleaning see the tutorial:

## Feature Selection

Feature selection refers to techniques for selecting a subset of input features that are most relevant to the target variable that is being predicted.

This is important as irrelevant and redundant input variables can distract or mislead learning algorithms possibly resulting in lower predictive performance. Additionally, it is desirable to develop models only using the data that is required to make a prediction, e.g. to favor the simplest possible well performing model.

Feature selection techniques are generally grouped into those that use the target variable (**supervised**) and those that do not (**unsupervised**). Additionally, the supervised techniques can be further divided into models that automatically select features as part of fitting the model (**intrinsic**), those that explicitly choose features that result in the best performing model (**wrapper**) and those that score each input feature and allow a subset to be selected (**filter**).

Statistical methods are popular for scoring input features, such as correlation. The features can then be ranked by their scores and a subset with the largest scores used as input to a model. The choice of statistical measure depends on the data types of the input variables and a review of different statistical measures that can be used.

For an overview of how to select statistical feature selection methods based on data type, see the tutorial:

Additionally, there are different common feature selection use cases we may encounter in a predictive modeling project, such as:

- Categorical inputs for a classification target variable.
- Numerical inputs for a classification target variable.
- Numerical inputs for a regression target variable.

When a mixture of input variable data types a present, different filter methods can be used. Alternately, a wrapper method such as the popular RFE method can be used that is agnostic to the input variable type.

The broader field of scoring the relative importance of input features is referred to as feature importance and many model-based techniques exist whose outputs can be used to aide in interpreting the model, interpreting the dataset, or in selecting features for modeling.

For more on feature importance, see the tutorial:

## Data Transforms

Data transforms are used to change the type or distribution of data variables.

This is a large umbrella of different techniques and they may be just as easily applied to input and output variables.

Recall that data may have one of a few types, such as **numeric** or **categorical**, with subtypes for each, such as integer and real-valued for numeric, and nominal, ordinal, and boolean for categorical.

**Numeric Data Type**: Number values.**Integer**: Integers with no fractional part.**Real**: Floating point values.

**Categorical Data Type**: Label values.**Ordinal**: Labels with a rank ordering.**Nominal**: Labels with no rank ordering.**Boolean**: Values True and False.

The figure below provides an overview of this same breakdown of high-level data types.

We may wish to convert a numeric variable to an ordinal variable in a process called discretization. Alternatively, we may encode a categorical variable as integers or boolean variables, required on most classification tasks.

**Discretization Transform**: Encode a numeric variable as an ordinal variable.**Ordinal Transform**: Encode a categorical variable into an integer variable.**One-Hot Transform**: Encode a categorical variable into binary variables.

For real-valued numeric variables, the way they are represented in a computer means there is dramatically more resolution in the range 0-1 than in the broader range of the data type. As such, it may be desirable to scale variables to this range, called normalization. If the data has a Gaussian probability distribution, it may be more useful to shift the data to a standard Gaussian with a mean of zero and a standard deviation of one.

**Normalization Transform**: Scale a variable to the range 0 and 1.**Standardization Transform**: Scale a variable to a standard Gaussian.

The probability distribution for numerical variables can be changed.

For example, if the distribution is nearly Gaussian, but is skewed or shifted, it can be made more Gaussian using a power transform. Alternatively, quantile transforms can be used to force a probability distribution, such as a uniform or Gaussian on a variable with an unusual natural distribution.

**Power Transform**: Change the distribution of a variable to be more Gaussian.**Quantile Transform**: Impose a probability distribution such as uniform or Gaussian.

An important consideration with data transforms is that the operations are generally performed separately for each variable. As such, we may want to perform different operations on different variable types.

We may also want to use the transform on new data in the future. This can be achieved by saving the transform objects to file along with the final model trained on all available data.

## Feature Engineering

Feature engineering refers to the process of creating new input variables from the available data.

Engineering new features is highly specific to your data and data types. As such, it often requires the collaboration of a subject matter expert to help identify new features that could be constructed from the data.

This specialization makes it a challenging topic to generalize to general methods.

Nevertheless, there are some techniques that can be reused, such as:

- Adding a boolean flag variable for some state.
- Adding a group or global summary statistic, such as a mean.
- Adding new variables for each component of a compound variable, such as a date-time.

A popular approach drawn from statistics is to create copies of numerical input variables that have been changed with a simple mathematical operation, such as raising them to a power or multiplied with other input variables, referred to as polynomial features.

**Polynomial Transform**: Create copies of numerical input variables that are raised to a power.

The theme of feature engineering is to add broader context to a single observation or decompose a complex variable, both in an effort to provide a more straightforward perspective on the input data.

I like to think of feature engineering as a type of data transform, although it would be just as reasonable to think of data transforms as a type of feature engineering.

## Dimensionality Reduction

The number of input features for a dataset may be considered the dimensionality of the data.

For example, two input variables together can define a two-dimensional area where each row of data defines a point in that space. This idea can then be scaled to any number of input variables to create large multi-dimensional hyper-volumes.

The problem is, the more dimensions this space has (e.g. the more input variables), the more likely it is that the dataset represents a very sparse and likely unrepresentative sampling of that space. This is referred to as the curse of dimensionality.

This motivates feature selection, although an alternative to feature selection is to create a projection of the data into a lower-dimensional space that still preserves the most important properties of the original data.

This is referred to generally as dimensionality reduction and provides an alternative to feature selection. Unlike feature selection, the variables in the projected data are not directly related to the original input variables, making the projection difficult to interpret.

The most common approach to dimensionality reduction is to use a matrix factorization technique:

- Principal Component Analysis (PCA)
- Singular Value Decomposition (SVD)

The main impact of these techniques is that they remove linear dependencies between input variables, e.g. correlated variables.

Other approaches exist that discover a lower dimensionality reduction. We might refer to these as model-based methods such as LDA and perhaps autoencoders.

- Linear Discriminant Analysis (LDA)

Sometimes manifold learning algorithms can also be used, such as Kohonen self-organizing maps and t-SNE.

For on dimensionality reduction, see the tutorial:

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Tutorials

- How to Prepare Data For Machine Learning
- Applied Machine Learning Process
- How to Perform Data Cleaning for Machine Learning with Python
- How to Choose a Feature Selection Method For Machine Learning
- Introduction to Dimensionality Reduction for Machine Learning

### Books

- Feature Engineering and Selection: A Practical Approach for Predictive Models, 2019.
- Applied Predictive Modeling, 2013.
- Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

### Articles

## Summary

In this tutorial, you discovered the common data preparation tasks performed in a predictive modeling machine learning task.

Specifically, you learned:

- Techniques, such data cleaning, can identify and fix errors in data like missing values.
- Data transforms can change the scale, type, and probability distribution of variables in the dataset.
- Techniques such as feature selection and dimensionality reduction can reduce the number of input variables.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

Hi Jason, its an amazing journey going with your articles that you publish. Tons of thanks.

Thanks, I’m happy it helps.

Flow diagram representation is amazing and gives more clarity on the topics…Thanks a lot.. this is something to be booarked and revisited again and again.

You’re welcome, I’m glad it helps!

I like the breakdown of data preparation techniques-the graphs are so helpful. I always read your articles, and you make them easy to understand.

Thanks for your time and dedications!!!!

Thanks!

Thanks a lot for another interesting and eye opening article on “Data Preparation” step of ML project.

just a typo , I think , in the following sentence:

“Other approaches exist that discover a lower dimensionality reduction. We might refer to these as based methods such as LDA and perhaps autoencoders.”

the word “Model” is missing , I think it is “Model based” rather than just “based”.

Thanks, fixed!

Thank you so much for insightful series on Data preparation 🙂

You’re very welcome!

Your posts are very informative and thanks for sharing your knowledge. Please keep doing it.

Thanks!

HI.

I’d love it if I am allowed to print the pages, for me to peruse conveniently offline. But everytime I click “print” , the box “Thank you for signing up” (or the others) keep popping up! Will you make it go away? Thanks a lot.

Good question, I answer it here:

https://machinelearningmastery.com/faq/single-faq/how-can-i-print-a-tutorial-page-without-the-sign-up-form

Dear Dr Jason,

Thank you for providing a summary chart of the reduction techniques located at the bottom of your tutorial.

I am aware you have done all the techniques such as SVD and PCA.

Is it possible to have links to each of the five topics please? These are Manifold Learning – SOM & tSNE, Model based – LDA, PCA & SVD.

Thank you,

Anthony of Sydney

I have another tutorial on this topic written and scheduled with some manifold learning methods

Thanks for the suggestion. SOM is not provided by scikit-learn, and tsne is but it does not support this form of usage in supervised learning.

Jason, thanks for the material, high quality as usual. Don’t you think its worthy to mention Factor Analysis together with PCA for dimension reduction, also very helpful during feature extraction

Regards.,

Great suggestion, thanks!

Thanks for the well informing tutorials, please keep encouraging

You’re welcome.

I love how clearly you structure the topics, I am very visual and your diagrams provide me with an easier way to integrate knowledge.

There is only one issue that recently brings me to confusion, it is the identification if we are facing ‘outliers’ or ‘anomalies’ and how to deal with these values, such as: techniques to differentiate them, how to process anomalies during cleaning, or transformation and if it is convenient to do so.

Thanks!

Good question, see this:

https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/

And this:

https://machinelearningmastery.com/robust-scaler-transforms-for-machine-learning/

Its amazing and very apt, and it shades a light on the classfication of the diverse methods that you have bucketed with parent nodes ,thhat is god send to me.Thank you so much

Thanks!

It’s awesome going through your article, it gives good insight on the go.

Thanks.

Its wonderful article Jason,

One question in case of one to many relation data like Bank Loan Dataset,

If one customer has 2 dependents, each have different salaries and different employee information,

If find two approaches could be done

1st, sum-up the salary, create a new column “Number of Applicants” and populate as 2, sum-up the credit limits, have to skip some.

2nd approach, we can split the dependents as two records with same application number, each details of record is populated with their corresponding dependent information.

May I know which approach is better?

Thanks.

I would expect that reducing all data down into one record might be appropriate, but it really depends on the specifics of the data and the project goals.

Thank you very much Jason, Much appreciated for the reply

You’re welcome.

This is a nice overview, but you did not get into data cleaning for image data. Admittedly, this is my pet topic, as I am the author of the Python library cleanX, but I hope you will cover data cleaning of images in the future!

Thank you for your suggestion!

Hello Jason,

In your overview figure of Data Transform, you mention that ordinal type data should receive a Label Encoder transformation, I thought this Label Encoder was in the case of a label (1D), should it be ordinal or categorical. Why an ordinal type data couldn’t receive a dummy transform?

Thank you very much for all the work you are doing here.

Ordinal data is 1st, 2nd, 3rd, etc. A label encoder treat these as labels, and convert it into 0, 1, 2, etc. This is a single value. Dummy encoder is one-hot encoding, which convert into multiple columns of 0 or 1 values. Of course you can do so to ordinal data, but you will not preserve the information that 2 is greater than 1, and 1 is greater than 0. Usually this is not something we want.

Hi,

Thank you for this post.

My question is the following:

Should we perform feature selection before normalization? Also, if I have an imbalanced dataset, which problem to handle first?

Thank you,

Yes you should. That would be better. For imbalanced data, you may leave it as is in preparation but take care of it during training. See, for example, https://machinelearningmastery.com/framework-for-imbalanced-classification-projects/