Improve Model Accuracy with Data Pre-Processing

By Jason Brownlee on August 15, 2020 in Data Preparation 28

Data preparation can make or break the predictive ability of your model.

In Chapter 3 of their book Applied Predictive Modeling, Kuhn and Johnson introduce the process of data preparation. They refer to it as the addition, deletion or transformation of training set data.

In this post you will discover the data pre-process steps that you can use to improve the predictive ability of your models.

Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

I Love Spreadsheets
Photo by Craig Chew-Moulding, some rights reserved

Data Preparation

You must pre-process your raw data before you model your problem. The specific preparation may depend on the data that you have available and the machine learning algorithms you want to use.

Sometimes, pre-processing of data can lead to unexpected improvements in model accuracy. This may be because a relationship in the data has been simplified or unobscured.

Data preparation is an important step and you should experiment with data pre-processing steps that are appropriate for your data to see if you can get that desirable boost in model accuracy.

There are three types of pre-processing you can consider for your data:

Add attributes to your data
Remove attributes from your data
Transform attributes in your data

We will dive into each of these three types of pre-process and review some specific examples of operations that you can perform.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Add Data Attributes

Advanced models can extract the relationships from complex attributes, although some models require those relationships to be spelled out plainly. Deriving new attributes from your training data to include in the modeling process can give you a boost in model performance.

Dummy Attributes: Categorical attributes can be converted into n-binary attributes, where n is the number of categories (or levels) that the attribute has. These denormalized or decomposed attributes are known as dummy attributes or dummy variables.
Transformed Attribute: A transformed variation of an attribute can be added to the dataset in order to allow a linear method to exploit possible linear and non-linear relationships between attributes. Simple transforms like log, square and square root can be used.
Missing Data: Attributes with missing data can have that missing data imputed using a reliable method, such as k-nearest neighbors.

Remove Data Attributes

Some methods perform poorly with redundant or duplicate attributes. You can get a boost in model accuracy by removing attributes from your data.

Projection: Training data can be projected into lower dimensional spaces, but still characterize the inherent relationships in the data. A popular approach is Principal Component Analysis (PCA) where the principal components found by the method can be taken as a reduced set of input attributes.
Spatial Sign: A spatial sign projection of the data will transform data onto the surface of a multidimensional sphere. The results can be used to highlight the existence of outliers that can be modified or removed from the data.
Correlated Attributes: Some algorithms degrade in importance with the existence of highly correlated attributes. Pairwise attributes with high correlation can be identified and the most correlated attributes can be removed from the data.

Transform Data Attributes

Transformations of training data can reduce the skewness of data as well as the prominence of outliers in the data. Many models expect data to be transformed before you can apply the algorithm.

Centering: Transform the data so that it has a mean of zero and a standard deviation of one. This is typically called data standardization.
Scaling: A standard scaling transformation is to map the data from the original scale to a scale between zero and one. This is typically called data normalization.
Remove Skew: Skewed data is data that has a distribution that is pushed to one side or the other (larger or smaller values) rather than being normally distributed. Some methods assume normally distributed data and can perform better if the skew is removed. Try replacing the attribute with the log, square root or inverse of the values.
Box-Cox: A Box-Cox transform or family of transforms can be used to reliably adjust data to remove skew.
Binning: Numeric data can be made discrete by grouping values into bins. This is typically called data discretization. This process can be performed manually, although is more reliable if performed systematically and automatically using a heuristic that makes sense in the domain.

Summary

Data pre-process is an important step that can be required to prepare raw data for modeling, to meet the expectations of data for a specific machine learning algorithms, and can give unexpected boosts in model accuracy.

In this post we discovered three groups of data pre-processing methods:

Adding Attributes
Removing Attributes
Transforming Attributes

The next time you are looking for a boost in model accuracy, consider what new perspectives you can engineer on your data for your models to explore and exploit.

28 Responses to Improve Model Accuracy with Data Pre-Processing

Juhyoung Lee July 19, 2016 at 3:54 pm #

Can you explain more detail about concept of attribute and dummy attribute?

Reply
- Jason Brownlee July 20, 2016 at 5:17 am #
  
  Sure Juhyoung Lee,
  
  You can take a categorical attribute like “color” with the values “red” and “blue and turn it into two binary attributes has_red and has_blue.
  
  These new binary variables are dummy variables.
  
  You can learn more here:
  https://en.wikipedia.org/wiki/Dummy_variable_(statistics)
  
  Reply
kay December 12, 2016 at 8:23 am #

It was a great article, although i had a question suppose there is a data set consisting means, modes, min’s, max’s etc. How can we represent all those values on a common scale,or genralize the values, for example let’s say mean of heights in a group of people is x, and mode is y, and min value is z, and there is group 2 with same data, can the values be represented on a common scale?

Reply
- Jason Brownlee December 13, 2016 at 8:04 am #
  
  Hi Kay,
  
  You can scale each column (data type or feature) separately to a range of 0-1.
  
  You can use the formula:
  
  y = (x - min) / (max - min)
  
  1
  
  y = (x - min) / (max - min)
  
  Where x is a given value and min and max are the limits of values on the column.
  
  I hope that helps.
  
  Reply
Jan July 20, 2017 at 8:25 pm #

Thanks for the article, Jason!

If I have some normally distributed features and some skewed features, can I just transform the skewed data and leave the normally distributed data untouched? Can I i.e. log transform some features and leave others?

Regards!

Reply
- Jason Brownlee July 21, 2017 at 9:33 am #
  
  Absolutely.
  
  Reply
- Efatathios Chatzikyriakidis April 20, 2020 at 7:28 am #
  
  Hi!
  
  You can add in the list also aggregations features, or statistical features in general (5-number summaries), outliers removal.
  
  Reply
  - Jason Brownlee April 20, 2020 at 7:37 am #
    
    Great tip.
    
    Reply
Shabnam December 1, 2017 at 6:26 am #

About “Correlated Attributes” that you mentioned in this post:
I was wondering if you have any post on using it (for example in sklearn), so I can read and understand more.

Reply
- Jason Brownlee December 1, 2017 at 7:44 am #
  
  I may have an example in the R book. Perhaps search the blog?
  
  Reply
Shabnam December 1, 2017 at 6:31 am #

I have a question about scaling. If we have a binary classification (0,1). Is it better to keep it as it is or change it to (-1,1) for example? Or does it depends on the data?

When I read posts about machine learning, I am not sure, if the notes are always true or not.
I noticed that in many cases, the response depends on the type of data. How can we say if a note/rule/point is dependent to a data type?

Reply
- Jason Brownlee December 1, 2017 at 7:44 am #
  
  Really depends on the data and the algorithms being used.
  
  Reply
Andrea Grandi December 11, 2017 at 3:58 am #

When you say “Correlated Attributes: Some algorithms degrade in importance with the existence of highly correlated attributes. Pairwise attributes with high correlation can be identified and the most correlated attributes can be removed from the data.”

I was totally convinced of the opposite :O

I mean: I usually remove those attributes which are not correlated and keep those highly correlated… this can explain why I couldn’t optimise my models too much.

Do you have more documentation about this particular subject?

Thanks!

Reply
- Jason Brownlee December 11, 2017 at 5:33 am #
  
  Multiple correlated input features do mess up most models, at worst resulting in worse skill, at best, creating a lot of redundancy in the model.
  
  Formally, the problem is often referred to as multicollinearity:
  https://en.wikipedia.org/wiki/Multicollinearity
  
  Reply
Mohammad Ehtasham Billah February 5, 2018 at 10:12 am #

Among all of these steps which are the most important ones? e.g. If I apply PCA for dimensionality reduction and create just two new features can I expect that other problems in the data (e.g.outliers, multicollinearity, skewed distribution) will no longer exist?

for transform attribute in add data attribute section, how can I choose which transformation is best among log, square and square root? Can I apply each of those transformations and keep all of them in the dataset. I feel like it may cause redundancy and multicollinearity.

Thank you for your great posts!!

Reply
- Jason Brownlee February 5, 2018 at 2:52 pm #
  
  The framing the problem offers the biggest leverage.
  
  PCA with outliers might cause problems, better to remove them first.
  
  Try maintaining multiple “views” of the data and try modes on each to explore the best combination of view/model/config.
  
  Reply
  - Mohammad Ehtasham Billah February 8, 2018 at 6:18 am #
    
    Hi Jason,
    For pairwise attributes with high correlation, what is the accepted level? The correlation can be -1 to +1.
    
    Reply
    - Jason Brownlee February 8, 2018 at 8:32 am #
      
      Perhaps below 0.5. I’d recommend testing and see what works best for your data and algorithms.
      
      Reply
maunish September 19, 2019 at 2:04 am #

Hi jason,

I have participated in a Kaggle competition in which i have to classify forest cover type ,
i tried everything i used like stacking various models , feature engineering , feature extraction
but my model accuracy is not increasing above 80%.

i also found that 2 types of cover type are really hard to separate so i tried to build a model to separate this 2 cover types.

nothing is working so i am frustrated it feels that i am not knowing something that others know.

it would be really helpful if you could give an insight , i am working on this for 2 weeks and very frustrated now.

and sorry for asking such a stupid question and long question.

Reply
- Jason Brownlee September 19, 2019 at 6:04 am #
  
  I have some suggestions here that may help:
  https://machinelearningmastery.com/machine-learning-performance-improvement-cheat-sheet/
  
  And here:
  https://machinelearningmastery.com/start-here/#better
  
  Reply
rahul June 8, 2020 at 8:34 am #

Hi Jason,
I have a .csv file with 10 columns and roughly 6000 rows. My data is represented in the form of only 0 and 1. Each row represents a timeframe of a video.
Let’s say I want to bring the number of rows down from 6000 to 1000 rows without loosing information. What method is reliable in my case? And how it can be done?

Reply
- Jason Brownlee June 8, 2020 at 1:19 pm #
  
  Without losing information? Not sure I can help sorry.
  
  Reply
Priya April 30, 2021 at 7:31 pm #

Hi Sir,

As kNN is distance based algorithm so data normalization may have an impact on this algorithm. But is it possible that normalization, negatively impact model accuracy with kNN.

(I am getting RMSE=50 without Normalization, and RMSE=70 with normalization in kNN algorithm)
Is it possible or I am doing some logical mistake?

Reply
- Jason Brownlee May 1, 2021 at 6:04 am #
  
  It is possible that data scaling does not help your model.
  
  Reply
  - Priya May 1, 2021 at 2:40 pm #
    
    thanks for your reply
    
    Reply
    - Jason Brownlee May 2, 2021 at 5:27 am #
      
      You’re welcome.
      
      Reply
John Rustam August 13, 2021 at 2:09 am #

i like very much what you write , its very clear. Would you mind send me pdf versionplease ?

rustam@ui.ac.id

Reply
- Adrian Tam August 13, 2021 at 5:34 am #
  
  I believe you can print the web page into PDF using your browser.
  
  Reply

Navigation

Improve Model Accuracy with Data Pre-Processing

Data Preparation

Want to Get Started With Data Preparation?

Add Data Attributes

Remove Data Attributes

Transform Data Attributes

Summary

Get a Handle on Modern Data Preparation!

Prepare Your Machine Learning Data in Minutes

Bring Modern Data Preparation Techniques to
Your Machine Learning Projects

More On This Topic

28 Responses to Improve Model Accuracy with Data Pre-Processing

Leave a Reply Click here to cancel reply.

Navigation

Data Preparation

Want to Get Started With Data Preparation?

Add Data Attributes

Remove Data Attributes

Transform Data Attributes

Summary

Get a Handle on Modern Data Preparation!

Prepare Your Machine Learning Data in Minutes

Bring Modern Data Preparation Techniques to Your Machine Learning Projects

More On This Topic

28 Responses to Improve Model Accuracy with Data Pre-Processing

Leave a Reply Click here to cancel reply.

Bring Modern Data Preparation Techniques to
Your Machine Learning Projects