Rescaling Data for Machine Learning in Python with Scikit-Learn

Last Updated on

Your data must be prepared before you can build models. The data preparation process can involve three steps: data selection, data preprocessing and data transformation.

In this post you will discover two simple data transformation methods you can apply to your data in Python using scikit-learn.

Discover how to prepare data with pandas, fit and evaluate models with scikit-learn, and more in my new book, with 16 step-by-step tutorials, 3 projects, and full python code.

Let’s get started.

Update: See this post for a more up to date set of examples.

Data Rescaling

Data Rescaling
Photo by Quinn Dombrowski, some rights reserved.

Data Rescaling

Your preprocessed data may contain attributes with a mixtures of scales for various quantities such as dollars, kilograms and sales volume.

Many machine learning methods expect or are more effective if the data attributes have the same scale. Two popular data scaling methods are normalization and standardization.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Data Normalization

Normalization refers to rescaling real valued numeric attributes into the range 0 and 1.

It is useful to scale the input attributes for a model that relies on the magnitude of values, such as distance measures used in k-nearest neighbors and in the preparation of coefficients in regression.

The example below demonstrate data normalization of the Iris flowers dataset.

For more information see the normalize function in the API documentation.

Data Standardization

Standardization refers to shifting the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance).

It is useful to standardize attributes for a model that relies on the distribution of attributes such as Gaussian processes.

The example below demonstrate data standardization of the Iris flowers dataset.

For more information see the scale function in the API documentation.

Tip: Which Method To Use

It is hard to know whether rescaling your data will improve the performance of your algorithms before you apply them. If often can, but not always.

A good tip is to create rescaled copies of your dataset and race them against each other using your test harness and a handful of algorithms you want to spot check. This can quickly highlight the benefits (or lack there of) of rescaling your data with given models, and which rescaling method may be worthy of further investigation.

Summary

Data rescaling is an important part of data preparation before applying machine learning algorithms.

In this post you discovered where data rescaling fits into the process of applied machine learning and two methods: Normalization and Standardization that you can use to rescale your data in Python using the scikit-learn library.

Update: See this post for a more up to date set of examples.

Discover Fast Machine Learning in Python!

Master Machine Learning With Python

Develop Your Own Models in Minutes

...with just a few lines of scikit-learn code

Learn how in my new Ebook:
Machine Learning Mastery With Python

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more...

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

See What's Inside

43 Responses to Rescaling Data for Machine Learning in Python with Scikit-Learn

  1. Matt July 10, 2016 at 7:13 pm #

    Hi thank you for this nice article ! I have a quick question (maybe long to explain but I think the answer is short 🙂 )
    I am using data standardization for a k-NN algorithm. I will have new instances and I’ll have to determine their class, their data won’t be standardized.
    Should I :
    1) Standardize my training set with the scale function. Then manually calculate the means and the std of my training set to standardize my new vector.
    2) Add the new data to the training set and then standardize the set with the mean function.
    3) Neither
    Thank you !

  2. Tzu-Yen June 16, 2017 at 6:50 am #

    I came across your website which is extremely helpful for studying machine learning. Thanks for the great effort.

    Just a friendly reminder. The normalization function has an axis parameter with a default value equals to 1, so it will run on rows/data by default. For feature normalization, you need to set axis = 0.

    • Jason Brownlee June 16, 2017 at 8:12 am #

      Thanks Tzu-Yen!

    • Abel August 17, 2017 at 7:09 am #

      Hey Tzu-Yen, you saved my day…. Thanks a lot

  3. Shud November 1, 2017 at 4:00 pm #

    Hi Jason,

    May i know how to bring the data back to original scale? I need my predictions in original scale. I normalised by data and tried .inverse_transform(data) to get back my original data. But it gave me an error – AttributeError: ‘Normalizer’ object has no attribute ‘inverse_transform’
    Any kind of help would be appreciated.

  4. Sumit November 17, 2017 at 3:37 pm #

    Hi Jason,
    I have one question.
    Suppose today my data range is 5000 to 10000 which will be scaled between 0 to 1
    And tomorrow if new data entry comes with 10500. if I use same 5000 to 10000 range for fitting then it produce output X1 and and If i specify 5000 to 10500 range then it produce output X2 which is not equal to X1.

    How to over come this? How to handle new data with old range?

    Thanks & Regards
    Sumit

    • Jason Brownlee November 18, 2017 at 10:12 am #

      You can estimate the expected range of data in the future and those min/max to scale.

      Or, you can estimate a mean/stdev and standardize instead, if the data is Gaussian.

  5. Rizwan Mian January 1, 2018 at 10:11 am #

    Hi Jason, when I use standardization as suggested in the post, I see mean and standard deviation very close to zero and one, respectively…but not exactly. Wonder if such close-enough values are acceptable in the community?

    count 7.68E+02
    mean -6.48E-17
    std 1.00E+00
    min -1.14E+00
    25% -8.45E-01
    50% -2.51E-01
    75% 6.40E-01
    max 3.91E+00

  6. Suranga April 8, 2018 at 11:25 pm #

    Hi , can you tell me the formula behind preprocessing.scale()?

  7. neethu May 16, 2018 at 9:05 pm #

    While am training the data set am getting an accuracy around 100%.But when am testing am not getting the proper answer.what could be the reason?can you please help me with it

    • Jason Brownlee May 17, 2018 at 6:31 am #

      Sounds like you are overfitting the training dataset.

  8. Hagais June 27, 2018 at 6:01 pm #

    hai Mr. Jason Brownlee, I wanna ask about how to prints the values back in their original scale using normalizer

    • Jason Brownlee June 28, 2018 at 6:14 am #

      You can invert the scaling transforms, call scaler.invert_transform()

  9. Nitin vij July 18, 2018 at 8:27 pm #

    Hi Jason,

    I tried to use standardized_X = preprocessing.scale(X).
    I checked with print(np.mean(standardized_X[:,0])) but mean is not zero (for any of the four columns) although standard deviation is one. Am I doing something incorrectly.

    Please suggest.

    • Jason Brownlee July 19, 2018 at 7:49 am #

      I’m surprised. I would expect it to be zero, perhaps confirm that you have not introduced a bug or typo?

  10. vivek September 13, 2018 at 11:29 pm #

    If i have target values in different range for prediction using regression with deep neural network will it be helpful to get better accuracy by doing normalization of target values?If yes, then which technique i should use for that.

    • Jason Brownlee September 14, 2018 at 6:37 am #

      Yes, perhaps try normalize and standardize and see which results in better skill compared to no rescaling.

  11. Abhishek Shankar September 19, 2018 at 4:43 am #

    When to normalize and when to standardize ??

    • Jason Brownlee September 19, 2018 at 6:27 am #

      Normalize when the data variables have different units.

      Standardize when a variable has a gaussian distribution.

  12. olufemi george January 1, 2019 at 2:01 am #

    i have a dummy question;

    there were 2 lines in your code.

    X = iris.data
    y = iris.target

    how does X and Y know what the dependent and independent variables are? Will this work on my own dataset, without me having to tell it ( doubt it).

    Thanks

  13. hannah February 19, 2019 at 7:52 pm #

    Hi Jason,

    First, I would like to say that I really appreciate your blog posts, they’re helping me a lot!

    I have a problem with reproducing a normalization/standardization from the article. The authors wrote that “(…)Magnitudes are scaled logarithmically. The features are normalized per frequency band to zero mean and unit variance.” – do they mean standardize, so is it enought to simply use preprocessing.scale ?

    To make it more confusing, I found another group which is reproducing the results from the first one. They wrote that “The logarithm of the normalized sum magnitude of the
    filter bank energies is computed for each window. These features were normalized to range between 0 and 1 before feeding to the network input.” And for me that looks like normal normalization.

    Is it the same process described? I doubt…

    Thanks a lot for your help!

    Hannah

    • Jason Brownlee February 20, 2019 at 8:00 am #

      Perhaps contact the authors directly and ask exactly what they did?

      Unless academics release code used to produce the results, their papers are a waste of time in the best case or fraud in the worst case.

  14. Murali krishna March 25, 2019 at 1:16 pm #

    Hi Jason,

    A quick question on standardization, lets say I have built a model on a selected sample data from entire population and standardized the values before running through a model. So, can I directly use these beta coefficients on the entire population or since I have found beta coefficients by standardizing the values should I standardize all the values in the entire population.

    • Jason Brownlee March 25, 2019 at 2:18 pm #

      I recommend estimating the coefficients from the training set and using them on all data going forward.

  15. Murali March 25, 2019 at 5:21 pm #

    Hey thanks for the reply, can you please elaborate on why should we be using estimates coefficients from training set and use them on all the data

  16. Anna May 20, 2019 at 1:03 pm #

    I am trying to code LSTM for household_power_consumption_days.csv data by using Pytorch. So, should I normalize or rescale data?

    • Jason Brownlee May 20, 2019 at 2:37 pm #

      Sorry, I don’t have any examples for Pytorch.

      Scaling data prior to modeling is a good practice.

  17. Liten May 29, 2019 at 12:49 am #

    Hello Jason,

    In order to train an SVM classifier, should the data be scaled to [0,1 ] or [-1, 1]?
    Thanks in advance

    • Jason Brownlee May 29, 2019 at 8:45 am #

      The targets need to be {-1, 1} I believe. But sklearn will do this for you – from memory

  18. MWh August 15, 2019 at 9:46 pm #

    Thanks Jason,

    Do i need normalising/scaling if i only have 1 feature?
    My data has x and y only, where y is the dependent variable. I am working on Random forest regression.

  19. jeff August 15, 2019 at 10:53 pm #

    How can I normalize a dataset with text values to numbers properly in sklearn? and my dataset have train and test part and Im getting different number of columns after normalization. How can I normalize them eqally?

  20. Bilal August 27, 2019 at 7:17 pm #

    is normalization required for Decision Tree Algorithm

  21. Amit Yadav September 24, 2019 at 7:12 pm #

    Thank you very much for writing this article. When to use normalisation and when to use Standardisation ?, I went through an article they said “We can use Normalisation if we want to rescale every observation of dataset, We can use standardisation if we want to rescale by features in data sets.

    • Jason Brownlee September 25, 2019 at 5:56 am #

      I’m happy that it helped.

      Normalize generally, and use standardization when a variable is gaussian.

      If in doubt, test both and use whichever results in a model with the best skill.

Leave a Reply