Rescaling Data for Machine Learning in Python with Scikit-Learn

Your data must be prepared before you can build models. The data preparation process can involve three steps: data selection, data preprocessing and data transformation.

In this post you will discover two simple data transformation methods you can apply to your data in Python using scikit-learn.

Update: See this post for a more up to date set of examples.

Data Rescaling

Data Rescaling
Photo by Quinn Dombrowski, some rights reserved.

Data Rescaling

Your preprocessed data may contain attributes with a mixtures of scales for various quantities such as dollars, kilograms and sales volume.

Many machine learning methods expect or are more effective if the data attributes have the same scale. Two popular data scaling methods are normalization and standardization.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Data Normalization

Normalization refers to rescaling real valued numeric attributes into the range 0 and 1.

It is useful to scale the input attributes for a model that relies on the magnitude of values, such as distance measures used in k-nearest neighbors and in the preparation of coefficients in regression.

The example below demonstrate data normalization of the Iris flowers dataset.

For more information see the normalize function in the API documentation.

Data Standardization

Standardization refers to shifting the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance).

It is useful to standardize attributes for a model that relies on the distribution of attributes such as Gaussian processes.

The example below demonstrate data standardization of the Iris flowers dataset.

For more information see the scale function in the API documentation.

Tip: Which Method To Use

It is hard to know whether rescaling your data will improve the performance of your algorithms before you apply them. If often can, but not always.

A good tip is to create rescaled copies of your dataset and race them against each other using your test harness and a handful of algorithms you want to spot check. This can quickly highlight the benefits (or lack there of) of rescaling your data with given models, and which rescaling method may be worthy of further investigation.

Summary

Data rescaling is an important part of data preparation before applying machine learning algorithms.

In this post you discovered where data rescaling fits into the process of applied machine learning and two methods: Normalization and Standardization that you can use to rescale your data in Python using the scikit-learn library.

Update: See this post for a more up to date set of examples.


Frustrated With Python Machine Learning?

Master Machine Learning With Python

Develop Your Own Models in Minutes

…with just a few lines of scikit-learn code

Discover how in my new Ebook:
Machine Learning Mastery With Python

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.


5 Responses to Rescaling Data for Machine Learning in Python with Scikit-Learn

  1. Matt July 10, 2016 at 7:13 pm #

    Hi thank you for this nice article ! I have a quick question (maybe long to explain but I think the answer is short 🙂 )
    I am using data standardization for a k-NN algorithm. I will have new instances and I’ll have to determine their class, their data won’t be standardized.
    Should I :
    1) Standardize my training set with the scale function. Then manually calculate the means and the std of my training set to standardize my new vector.
    2) Add the new data to the training set and then standardize the set with the mean function.
    3) Neither
    Thank you !

  2. Tzu-Yen June 16, 2017 at 6:50 am #

    I came across your website which is extremely helpful for studying machine learning. Thanks for the great effort.

    Just a friendly reminder. The normalization function has an axis parameter with a default value equals to 1, so it will run on rows/data by default. For feature normalization, you need to set axis = 0.

    • Jason Brownlee June 16, 2017 at 8:12 am #

      Thanks Tzu-Yen!

    • Abel August 17, 2017 at 7:09 am #

      Hey Tzu-Yen, you saved my day…. Thanks a lot

Leave a Reply