Rescaling Data for Machine Learning in Python with Scikit-Learn

Your data must be prepared before you can build models. The data preparation process can involve three steps: data selection, data preprocessing and data transformation.

In this post you will discover two simple data transformation methods you can apply to your data in Python using scikit-learn.

Update: See this post for a more up to date set of examples.

Data Rescaling

Data Rescaling
Photo by Quinn Dombrowski, some rights reserved.

Data Rescaling

Your preprocessed data may contain attributes with a mixtures of scales for various quantities such as dollars, kilograms and sales volume.

Many machine learning methods expect or are more effective if the data attributes have the same scale. Two popular data scaling methods are normalization and standardization.

Beat Information Overload and Master the Fastest Growing Platform of Machine Learning Pros


Machine Learning Mastery With Python Mini-CourseGet my free Machine Learning With Python mini course and start loading your own datasets from CSV in just 1 hour.

Daily lessons in your inbox for 14 days, and a Machine-Learning-With-Python “Cheat Sheet” you can download right now.

Download Your FREE Mini-Course >>

 

Data Normalization

Normalization refers to rescaling real valued numeric attributes into the range 0 and 1.

It is useful to scale the input attributes for a model that relies on the magnitude of values, such as distance measures used in k-nearest neighbors and in the preparation of coefficients in regression.

The example below demonstrate data normalization of the Iris flowers dataset.

For more information see the normalize function in the API documentation.

Data Standardization

Standardization refers to shifting the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance).

It is useful to standardize attributes for a model that relies on the distribution of attributes such as Gaussian processes.

The example below demonstrate data standardization of the Iris flowers dataset.

For more information see the scale function in the API documentation.

Tip: Which Method To Use

It is hard to know whether rescaling your data will improve the performance of your algorithms before you apply them. If often can, but not always.

A good tip is to create rescaled copies of your dataset and race them against each other using your test harness and a handful of algorithms you want to spot check. This can quickly highlight the benefits (or lack there of) of rescaling your data with given models, and which rescaling method may be worthy of further investigation.

Summary

Data rescaling is an important part of data preparation before applying machine learning algorithms.

In this post you discovered where data rescaling fits into the process of applied machine learning and two methods: Normalization and Standardization that you can use to rescale your data in Python using the scikit-learn library.

Update: See this post for a more up to date set of examples.

Frustrated With Python Machine Learning?

Develop Your Own Models and Predictions in Minutes

...with just a few lines of scikit-learn code

Discover how in my new Ebook: Machine Learning Mastery With Python

It covers self-study tutorials and end-to-end projects on topics like:
Loading data, visualization, modeling, algorithm tuning, and much more...

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

 

2 Responses to Rescaling Data for Machine Learning in Python with Scikit-Learn

  1. Matt July 10, 2016 at 7:13 pm #

    Hi thank you for this nice article ! I have a quick question (maybe long to explain but I think the answer is short 🙂 )
    I am using data standardization for a k-NN algorithm. I will have new instances and I’ll have to determine their class, their data won’t be standardized.
    Should I :
    1) Standardize my training set with the scale function. Then manually calculate the means and the std of my training set to standardize my new vector.
    2) Add the new data to the training set and then standardize the set with the mean function.
    3) Neither
    Thank you !

Leave a Reply