[New Book] Click to get The Beginner's Guide to Data Science!
Use the offer code 20offearlybird to get 20% off. Hurry, sale ends soon!

Rescaling Data for Machine Learning in Python with Scikit-Learn

Your data must be prepared before you can build models. The data preparation process can involve three steps: data selection, data preprocessing and data transformation.

In this post you will discover two simple data transformation methods you can apply to your data in Python using scikit-learn.

Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update: See this post for a more up to date set of examples.

Data Rescaling

Data Rescaling
Photo by Quinn Dombrowski, some rights reserved.

Data Rescaling

Your preprocessed data may contain attributes with a mixtures of scales for various quantities such as dollars, kilograms and sales volume.

Many machine learning methods expect or are more effective if the data attributes have the same scale. Two popular data scaling methods are normalization and standardization.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Data Normalization

Normalization refers to rescaling real valued numeric attributes into the range 0 and 1.

It is useful to scale the input attributes for a model that relies on the magnitude of values, such as distance measures used in k-nearest neighbors and in the preparation of coefficients in regression.

The example below demonstrate data normalization of the Iris flowers dataset.

For more information see the normalize function in the API documentation.

Data Standardization

Standardization refers to shifting the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance).

It is useful to standardize attributes for a model that relies on the distribution of attributes such as Gaussian processes.

The example below demonstrate data standardization of the Iris flowers dataset.

For more information see the scale function in the API documentation.

Tip: Which Method To Use

It is hard to know whether rescaling your data will improve the performance of your algorithms before you apply them. If often can, but not always.

A good tip is to create rescaled copies of your dataset and race them against each other using your test harness and a handful of algorithms you want to spot check. This can quickly highlight the benefits (or lack there of) of rescaling your data with given models, and which rescaling method may be worthy of further investigation.

Summary

Data rescaling is an important part of data preparation before applying machine learning algorithms.

In this post you discovered where data rescaling fits into the process of applied machine learning and two methods: Normalization and Standardization that you can use to rescale your data in Python using the scikit-learn library.

Update: See this post for a more up to date set of examples.

Get a Handle on Modern Data Preparation!

Data Preparation for Machine Learning

Prepare Your Machine Learning Data in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Data Preparation for Machine Learning

It provides self-study tutorials with full working code on:
Feature Selection, RFE, Data Cleaning, Data Transforms, Scaling, Dimensionality Reduction, and much more...

Bring Modern Data Preparation Techniques to
Your Machine Learning Projects


See What's Inside

100 Responses to Rescaling Data for Machine Learning in Python with Scikit-Learn

  1. Avatar
    Matt July 10, 2016 at 7:13 pm #

    Hi thank you for this nice article ! I have a quick question (maybe long to explain but I think the answer is short 🙂 )
    I am using data standardization for a k-NN algorithm. I will have new instances and I’ll have to determine their class, their data won’t be standardized.
    Should I :
    1) Standardize my training set with the scale function. Then manually calculate the means and the std of my training set to standardize my new vector.
    2) Add the new data to the training set and then standardize the set with the mean function.
    3) Neither
    Thank you !

  2. Avatar
    Tzu-Yen June 16, 2017 at 6:50 am #

    I came across your website which is extremely helpful for studying machine learning. Thanks for the great effort.

    Just a friendly reminder. The normalization function has an axis parameter with a default value equals to 1, so it will run on rows/data by default. For feature normalization, you need to set axis = 0.

    • Avatar
      Jason Brownlee June 16, 2017 at 8:12 am #

      Thanks Tzu-Yen!

    • Avatar
      Abel August 17, 2017 at 7:09 am #

      Hey Tzu-Yen, you saved my day…. Thanks a lot

  3. Avatar
    Shud November 1, 2017 at 4:00 pm #

    Hi Jason,

    May i know how to bring the data back to original scale? I need my predictions in original scale. I normalised by data and tried .inverse_transform(data) to get back my original data. But it gave me an error – AttributeError: ‘Normalizer’ object has no attribute ‘inverse_transform’
    Any kind of help would be appreciated.

  4. Avatar
    Sumit November 17, 2017 at 3:37 pm #

    Hi Jason,
    I have one question.
    Suppose today my data range is 5000 to 10000 which will be scaled between 0 to 1
    And tomorrow if new data entry comes with 10500. if I use same 5000 to 10000 range for fitting then it produce output X1 and and If i specify 5000 to 10500 range then it produce output X2 which is not equal to X1.

    How to over come this? How to handle new data with old range?

    Thanks & Regards
    Sumit

    • Avatar
      Jason Brownlee November 18, 2017 at 10:12 am #

      You can estimate the expected range of data in the future and those min/max to scale.

      Or, you can estimate a mean/stdev and standardize instead, if the data is Gaussian.

  5. Avatar
    Rizwan Mian January 1, 2018 at 10:11 am #

    Hi Jason, when I use standardization as suggested in the post, I see mean and standard deviation very close to zero and one, respectively…but not exactly. Wonder if such close-enough values are acceptable in the community?

    count 7.68E+02
    mean -6.48E-17
    std 1.00E+00
    min -1.14E+00
    25% -8.45E-01
    50% -2.51E-01
    75% 6.40E-01
    max 3.91E+00

  6. Avatar
    Suranga April 8, 2018 at 11:25 pm #

    Hi , can you tell me the formula behind preprocessing.scale()?

  7. Avatar
    neethu May 16, 2018 at 9:05 pm #

    While am training the data set am getting an accuracy around 100%.But when am testing am not getting the proper answer.what could be the reason?can you please help me with it

    • Avatar
      Jason Brownlee May 17, 2018 at 6:31 am #

      Sounds like you are overfitting the training dataset.

  8. Avatar
    Hagais June 27, 2018 at 6:01 pm #

    hai Mr. Jason Brownlee, I wanna ask about how to prints the values back in their original scale using normalizer

    • Avatar
      Jason Brownlee June 28, 2018 at 6:14 am #

      You can invert the scaling transforms, call scaler.invert_transform()

  9. Avatar
    Nitin vij July 18, 2018 at 8:27 pm #

    Hi Jason,

    I tried to use standardized_X = preprocessing.scale(X).
    I checked with print(np.mean(standardized_X[:,0])) but mean is not zero (for any of the four columns) although standard deviation is one. Am I doing something incorrectly.

    Please suggest.

    • Avatar
      Jason Brownlee July 19, 2018 at 7:49 am #

      I’m surprised. I would expect it to be zero, perhaps confirm that you have not introduced a bug or typo?

  10. Avatar
    vivek September 13, 2018 at 11:29 pm #

    If i have target values in different range for prediction using regression with deep neural network will it be helpful to get better accuracy by doing normalization of target values?If yes, then which technique i should use for that.

    • Avatar
      Jason Brownlee September 14, 2018 at 6:37 am #

      Yes, perhaps try normalize and standardize and see which results in better skill compared to no rescaling.

  11. Avatar
    Abhishek Shankar September 19, 2018 at 4:43 am #

    When to normalize and when to standardize ??

    • Avatar
      Jason Brownlee September 19, 2018 at 6:27 am #

      Normalize when the data variables have different units.

      Standardize when a variable has a gaussian distribution.

  12. Avatar
    olufemi george January 1, 2019 at 2:01 am #

    i have a dummy question;

    there were 2 lines in your code.

    X = iris.data
    y = iris.target

    how does X and Y know what the dependent and independent variables are? Will this work on my own dataset, without me having to tell it ( doubt it).

    Thanks

  13. Avatar
    hannah February 19, 2019 at 7:52 pm #

    Hi Jason,

    First, I would like to say that I really appreciate your blog posts, they’re helping me a lot!

    I have a problem with reproducing a normalization/standardization from the article. The authors wrote that “(…)Magnitudes are scaled logarithmically. The features are normalized per frequency band to zero mean and unit variance.” – do they mean standardize, so is it enought to simply use preprocessing.scale ?

    To make it more confusing, I found another group which is reproducing the results from the first one. They wrote that “The logarithm of the normalized sum magnitude of the
    filter bank energies is computed for each window. These features were normalized to range between 0 and 1 before feeding to the network input.” And for me that looks like normal normalization.

    Is it the same process described? I doubt…

    Thanks a lot for your help!

    Hannah

    • Avatar
      Jason Brownlee February 20, 2019 at 8:00 am #

      Perhaps contact the authors directly and ask exactly what they did?

      Unless academics release code used to produce the results, their papers are a waste of time in the best case or fraud in the worst case.

  14. Avatar
    Murali krishna March 25, 2019 at 1:16 pm #

    Hi Jason,

    A quick question on standardization, lets say I have built a model on a selected sample data from entire population and standardized the values before running through a model. So, can I directly use these beta coefficients on the entire population or since I have found beta coefficients by standardizing the values should I standardize all the values in the entire population.

    • Avatar
      Jason Brownlee March 25, 2019 at 2:18 pm #

      I recommend estimating the coefficients from the training set and using them on all data going forward.

  15. Avatar
    Murali March 25, 2019 at 5:21 pm #

    Hey thanks for the reply, can you please elaborate on why should we be using estimates coefficients from training set and use them on all the data

  16. Avatar
    Anna May 20, 2019 at 1:03 pm #

    I am trying to code LSTM for household_power_consumption_days.csv data by using Pytorch. So, should I normalize or rescale data?

    • Avatar
      Jason Brownlee May 20, 2019 at 2:37 pm #

      Sorry, I don’t have any examples for Pytorch.

      Scaling data prior to modeling is a good practice.

  17. Avatar
    Liten May 29, 2019 at 12:49 am #

    Hello Jason,

    In order to train an SVM classifier, should the data be scaled to [0,1 ] or [-1, 1]?
    Thanks in advance

    • Avatar
      Jason Brownlee May 29, 2019 at 8:45 am #

      The targets need to be {-1, 1} I believe. But sklearn will do this for you – from memory

  18. Avatar
    MWh August 15, 2019 at 9:46 pm #

    Thanks Jason,

    Do i need normalising/scaling if i only have 1 feature?
    My data has x and y only, where y is the dependent variable. I am working on Random forest regression.

  19. Avatar
    jeff August 15, 2019 at 10:53 pm #

    How can I normalize a dataset with text values to numbers properly in sklearn? and my dataset have train and test part and Im getting different number of columns after normalization. How can I normalize them eqally?

  20. Avatar
    Bilal August 27, 2019 at 7:17 pm #

    is normalization required for Decision Tree Algorithm

  21. Avatar
    Amit Yadav September 24, 2019 at 7:12 pm #

    Thank you very much for writing this article. When to use normalisation and when to use Standardisation ?, I went through an article they said “We can use Normalisation if we want to rescale every observation of dataset, We can use standardisation if we want to rescale by features in data sets.

    • Avatar
      Jason Brownlee September 25, 2019 at 5:56 am #

      I’m happy that it helped.

      Normalize generally, and use standardization when a variable is gaussian.

      If in doubt, test both and use whichever results in a model with the best skill.

  22. Avatar
    Marcos Cesar M. Pablos November 3, 2019 at 6:09 am #

    Hello Jason, thanks for all information, and, lets see if you can help me.

    I do the scaling in my predictors, and when I do the prediction, the result comes out in scientific notation, researching I saw that there is a way to do an inverse_transform, reverse the scaling process, I tried, but I failed to successfully reverse, can you help me?
    Below is the test code.

    import pandas as pd
    import numpy as np
    base = pd.read_csv(‘c://udemy//ia//bd//house-prices.csv’)
    X = base.iloc[:, 3:19].values
    y = base.iloc[:, 2].values

    from sklearn.model_selection import train_test_split
    X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X, y,
    test_size = 0.3,
    random_state = 0)
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X_treinamento = scaler.fit_transform(X_treinamento)

    from sklearn.linear_model import LinearRegression
    from sklearn.preprocessing import PolynomialFeatures

    poly = PolynomialFeatures(degree=4)
    X_treinamento_poly = poly.fit_transform(X_treinamento)
    X_teste_poly = poly.transform(X_teste)

    regressor = LinearRegression()
    regressor.fit(X_treinamento_poly, y_treinamento)
    score = regressor.score(X_treinamento_poly, y_treinamento)

    previsoes = regressor.predict(X_teste_poly)

    previsoes = scaler.inverse_transform(previsoes) = is not working.

  23. Avatar
    Marcos Cesar M. Pablos November 3, 2019 at 8:26 am #

    Hello Jason,
    Thanks for the tip, reading what you told me, I saw that I can inverse_transform, but applying I have an error, which seems to be basic and easy to solve, but I’m not getting, it gives the error:

    ValueError: operands could not be broadcast together with shapes (6484,) (16,) (6484,)

    I tried to apply the reshape to the predictions variable, but it doesn’t work at all.

    any suggestion?

    • Avatar
      Jason Brownlee November 4, 2019 at 6:35 am #

      The shape of the data and order of the columns in the data must be identical when calling transform() and inverse_transform().

      Perhaps check this.

  24. Avatar
    Marcos Cesar M. Pablos November 4, 2019 at 9:20 am #

    Got It, now is working, thanks!

    import pandas as pd

    base = pd.read_csv(‘c://udemy//ia//bd//house-prices.csv’)
    X = base.iloc[:, 3:19].values
    y = base.iloc[:, 2:3].values

    from sklearn.preprocessing import StandardScaler
    scaler_x = StandardScaler()
    X = scaler_x.fit_transform(X)
    scaler_y = StandardScaler()
    y = scaler_y.fit_transform(y)

    from sklearn.model_selection import train_test_split
    X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X, y,
    test_size = 0.3,
    random_state = 0)

    from sklearn.linear_model import LinearRegression
    from sklearn.preprocessing import PolynomialFeatures

    poly = PolynomialFeatures(degree=4)
    X_treinamento_poly = poly.fit_transform(X_treinamento)
    X_teste_poly = poly.transform(X_teste)

    regressor = LinearRegression()
    regressor.fit(X_treinamento_poly, y_treinamento)
    score = regressor.score(X_treinamento_poly, y_treinamento)

    # previsĂłes com o scalonamento reverso
    previsoes1 = scaler_y.inverse_transform(regressor.predict(X_teste_poly))

    y_teste = scaler_y.inverse_transform(y_teste)

    from sklearn.metrics import mean_absolute_error
    mae = mean_absolute_error(y_teste, previsoes1)

    #Testando com o y_teste e previsĂłes ainda com scalonamento
    previsoes = regressor.predict(X_teste_poly)

    scaler_teste = StandardScaler()
    y_teste = scaler_teste.fit_transform(y_teste)

    from sklearn.metrics import mean_absolute_error
    mae = mean_absolute_error(y_teste, previsoes)

  25. Avatar
    Sia November 12, 2019 at 12:47 am #

    Sir how I get the mean=0 and standard deviation =1 for a given dataset in Python?

  26. Avatar
    Vibhaas January 20, 2020 at 11:07 pm #

    Good article Jason. I learned from few points from your post and taken it forward by implementing ML model to compare the impact. here is the article https://medium.com/@vibhaas.kotwal/feature-scaling-8c92bdd080a1
    Please let me know your comments.

    • Avatar
      Jason Brownlee January 21, 2020 at 7:13 am #

      Thanks.

      Sorry, I don’t have the capacity to review your piece.

  27. Avatar
    Rana January 29, 2020 at 12:19 am #

    Sir,
    I get intercept_ =3.1378, 93 support_vectors_ and 93 dual_coef_ then,
    how can i get the hyperplane equation of polynomial SVM in python.
    Thank you.

    • Avatar
      Jason Brownlee January 29, 2020 at 6:39 am #

      I believe you can retrieve all coefficients from the sklearn API.

  28. Avatar
    Akram March 31, 2020 at 8:53 am #

    Dear Jason,
    I have a csv dataset which need to be normalized. However, I just need to normalize all columns except target, how can I perform it?

    import pandas as pd
    from sklearn import preprocessing

    data=pd.read_csv(‘File.csv’)
    min_max_scaler = preprocessing.MinMaxScaler()
    np_scaled = min_max_scaler.fit_transform(data)
    df_normalized = pd.DataFrame(np_scaled)
    df_normalized = df_normalized.to_csv(Norm_File.csv’,header=True, index=False)

    Thanks a lot.

    • Avatar
      Jason Brownlee March 31, 2020 at 1:33 pm #

      Not sure you can use the scaler directly on dataframes, perhaps extract the numpy array from them first?

  29. Avatar
    Nhu April 18, 2020 at 7:12 pm #

    hi

    • Avatar
      Jason Brownlee April 19, 2020 at 5:53 am #

      Hi, do you have any questions about this tutorial?

  30. Avatar
    Nhu April 18, 2020 at 7:21 pm #

    hi

    i have a dataset
    inlude 3 columns
    receney frequency monetary
    so i can preprocessing date by Standardize or Normalize? (i use dataset for Kmeans )
    pls help me. sorry my english not good
    Thank u so much

  31. Avatar
    nhu April 18, 2020 at 7:25 pm #

    i am only start to learn
    i wroten it
    pls can u help me check my code is right or wrong
    i did preprocessing
    ***********
    from sklearn.cluster import KMeans
    import pandas as pd
    import matplotlib.pyplot as plt

    sse = {}

    #load our data from CSV
    tx_user = pd.read_csv(‘rfm_data.csv’, sep =’,’ , engine=’python’)

    # display(tx_user[[‘M’]].boxplot())
    #PRE-PROCESSING ———————————————–
    col_names = [‘R’,’F’, ‘M’]

    #Step 1: Rescale Data
    #from sklearn.preprocessing import MinMaxScaler
    #min_max_scaler = MinMaxScaler()
    #tx_user[col_names] = min_max_scaler.fit_transform(tx_user[col_names])

    #Step 2: Standardize Data
    from sklearn.preprocessing import StandardScaler
    standard_scaler = StandardScaler()
    tx_user[col_names]=pd.read_csv(‘rfm_data.csv’, sep =’,’ , engine=’python’)
    tx_user[col_names] = standard_scaler.fit_transform(tx_user[col_names])

    #Step 3: Normalize Data
    #from sklearn.preprocessing import Normalizer
    #normalizer = Normalizer()
    #tx_user[col_names] = normalizer.fit_transform(tx_user[col_names])

    #print(‘Descriptive statistic of preprocessed data: ‘)
    #display(tx_user.describe())
    #END OF PRE-PROCESSING ——————————————-

    for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(tx_user[[‘R’,’F’, ‘M’]])
    tx_user[“clusters”] = kmeans.labels_
    sse[k] = kmeans.inertia_
    print(‘\n \n Sum of squared distances of samples to their closest cluster center: \n’)

    df_sse = pd.DataFrame(sse.items(), columns = [‘K Cluster’,’Sum of Squared Errors’])
    display(df_sse)

    keys = list(sse.keys())
    values = list(sse.values())

    plt.figure()
    plt.plot(keys, values)
    plt.xlabel(“Number of cluster – Kmean on dataraw _ group by”)

    # Add title and axis names
    plt.title(‘Within-Cluster-Sum of Squared Errors (WSS) for different values of k’)
    plt.xlabel(‘K cluster’)
    plt.ylabel(‘Sum of Squared Errors (WSS)’)

    plt.show()

  32. Avatar
    Nhu April 19, 2020 at 1:56 pm #

    I want to cluster customers. I have used 2 methods for the same data set. a method I code manually using RFM model in economics. Another method I use clustering using the Kmeans algorithm. Now I want to compare which method is better. But I had trouble. I still haven’t figured out which method will give better results

    • Avatar
      Jason Brownlee April 20, 2020 at 5:22 am #

      Select a metric, design a test harness, then apply both methods in the test harness to see which does better on your metric.

  33. Avatar
    Nhu April 20, 2020 at 7:28 pm #

    Thank u so much

  34. Avatar
    Nhu April 27, 2020 at 5:12 am #

    Hi
    After I have clustered “customer segmentation the clients, I want to visualization those clusters. What I have to do

  35. Avatar
    Nhu April 28, 2020 at 3:22 am #

    yes yes yes.i want pair-wise scatter plots

  36. Avatar
    Nhu April 28, 2020 at 3:25 am #

    i was code . but i got “AttributeError: ‘KMeans’ object has no attribute ‘labels’
    “. but i still can not fix
    # Modules
    import matplotlib.pyplot as plt
    from matplotlib.image import imread
    import pandas as pd
    import seaborn as sns
    from sklearn.datasets.samples_generator import (make_blobs,
    make_circles,
    make_moons)
    import numpy as np
    from sklearn.cluster import KMeans
    from sklearn.preprocessing import StandardScaler
    from sklearn.metrics import silhouette_samples, silhouette_score

    %matplotlib inline
    sns.set_context(‘notebook’)
    plt.style.use(‘fivethirtyeight’)
    from warnings import filterwarnings
    filterwarnings(‘ignore’)
    # Import the data
    df = pd.read_csv(‘Visualization_1.csv’)

    # Plot the data
    plt.figure(figsize=(6, 6))
    plt.scatter(df.iloc[:, 0], df.iloc[:, 1])
    plt.xlabel(‘Eruption time in mins’)
    plt.ylabel(‘Waiting time to next eruption’)
    plt.title(‘Visualization of raw data’);

    # Standardize the data
    X_std = StandardScaler().fit_transform(df)

    # Run local implementation of kmeans
    def cluster(n_clusters):
    km = Kmeans(n_clusters=2, max_iter=100)
    km.fit(X_std)
    centroids = km.centroids

    # Plot the clustered data
    fig, ax = plt.subplots(figsize=(6, 6))
    plt.scatter(X_std[km.labels == 0, 0], X_std[km.labels == 0, 1],
    c=’green’, label=’cluster 1′)
    plt.scatter(X_std[km.labels == 1, 0], X_std[km.labels == 1, 1],
    c=’blue’, label=’cluster 2′)
    plt.scatter(centroids[:, 0], centroids[:, 1], marker=’*’, s=300,
    c=’r’, label=’centroid’)
    plt.legend()
    plt.xlim([-2, 2])
    plt.ylim([-2, 2])
    plt.xlabel(‘Eruption time in mins’)
    plt.ylabel(‘Waiting time to next eruption’)
    plt.title(‘Visualization of clustered data’, fontweight=’bold’)
    ax.set_aspect(‘equal’);

  37. Avatar
    Nhu April 30, 2020 at 1:40 am #

    thank u so much so much

  38. Avatar
    meems May 19, 2020 at 1:45 am #

    hello i am new to all these but I was given a task I am not sure how to do
    Normalize data with pandas:
    a. Subtract the mean value of each feature from the dataset.
    b. After subtracting the mean, additionally scale (divide) the feature values by their
    respective “standard deviations.”
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn import preprocessing
    #step 1
    col_names = [“Size”,”Bedrooms”,”Price”]#name cols
    #importing data
    df2 = pd.read_csv(“dataset2.txt”, header = None,skiprows=0, names= col_names)
    #print first 5 elements of Dataframe
    print(df2.head())
    print(df2.describe())#show some stats

    I have no idea how to subtract means and the std please can you show me how

  39. Avatar
    Malik Elam June 8, 2020 at 4:23 am #

    Hi Jason, thanks for this article.
    Do you recommend using Sigmoid function transformation of input feature to get outliers closer to the bulk of other values, given that outliers are extreme values that are not errors. I.e. -1+2/(1+e^-ax); this would replace outliers and standardize data.

    • Avatar
      Jason Brownlee June 8, 2020 at 6:19 am #

      Not really. Perhaps try it and see if it is appropriate for your data/model/project.

  40. Avatar
    sarah June 10, 2020 at 2:13 am #

    Hi Jason,

    Do you have a tutorial on how to normalize/scale multivariate time series data?

    Thanks for all of your valuable guides,

  41. Avatar
    Bilal July 13, 2020 at 1:35 pm #

    Hi,
    Please tell me the formula behind the preprocessing.normalize()

  42. Avatar
    Dmitry August 13, 2020 at 2:25 am #

    What I don’t understand is how to unscale predicted dataset since it has different dimensions than the training features dataset?

    • Avatar
      Jason Brownlee August 13, 2020 at 6:19 am #

      Use the scaler object and call the inverse_transform() function and pass in the predictions.

      The scaler for the target takes one column for y or yhat – the same dimensions.

  43. Avatar
    Jaydeep Chauhan October 9, 2020 at 11:16 am #

    Hi all,
    Can you please help me to understand how the mean files are calculated in the following repository?
    I m getting different values for the same dataset.

    https://github.com/Veleslavia/EUSIPCO2017/tree/master/means

  44. Avatar
    Yao October 13, 2020 at 4:52 pm #

    Hi Jason,
    Thank you so much for the post and I found it very helpful.

    I’ve got a question on how to use scaler. For example, I’m working on a regression problem and I have my input X (10 columns,100k rows) and target y (1 column, 100k rows) as my training dataset. I used two StandardScaler X_scaler and y_scaler to scale X and y respectively. Since I used scaled X and y for model fitting, I would expect the prediction result y_hat is scaled as well. For training I can easily inverse my prediction to its original scale since I have the y_scaler. However, for real prediction, I only have my new dataset X’. If I scale my X’ and put them into my model, I would expect to get a scaled prediction. But how can I inverse my prediction to its original scale since it may differ from the training y?

    Could you please help me with this? Thank you so much!

    • Avatar
      Jason Brownlee October 14, 2020 at 6:13 am #

      You’re welcome.

      Yes, you must scale all new data in the same way as the training data. e.g. input data. You can also scale the target in training data and the model will learn to predict scaled targets. You can then invert the transform on the predictions to get the original scale.

      You must ensure your training dataset is sufficiently representative of the data so that the model learns the problem and that the transform captures the scale of the data. If this is challenging, you can manage the scaling/transform of the data manually (e.g. clip values out of range in new data)

      Does that help?

  45. Avatar
    Fatima March 20, 2021 at 12:35 pm #

    Hi Dr Jason,, here in Data Normalization example to
    normalize the data attributes
    normalized_X = preprocessing.normalize(X)

    what the function or the rule that the normalize submit to apply the normalization ( rescaling) can I know the detailed information about this technique ..

    thanks alot

    • Avatar
      Jason Brownlee March 21, 2021 at 6:05 am #

      Use normalization if it results in better model performance than not using it. That is the very best rule.

  46. Avatar
    San May 11, 2021 at 2:34 am #

    I’ve a dataset where I’ve to normalize as data is in different scales. Now, is there a way to get back to the original data i.e., kinda denormalizing and going back to the data with different scales ?

    • Avatar
      Jason Brownlee May 11, 2021 at 6:44 am #

      Yes, you can invert the transform using the scaler objects directly, e.g. scaler.inverse_transform()

  47. Avatar
    Hammed May 13, 2021 at 10:06 am #

    Good day, my brother and I love your book and how you simplified a lot of things in it.

    So my question is on Normalization of dataset. I have a large dataset, which has an amount column that goes literarily from -574617714.32 to 600000000.0 and about 10 million transactions also (and this is just the sample).

    I normalized the amount and also used categorical encoder on some other features. But my problem arise when I want to predict the outcome of a new data which has not been normalized or encoded. For encoding the data, I used a dictionary to store the encoded values which I then use to swap out the values in the new data to be predicted with the corresponding values in the dictionary.

    If I am to normalize the amount with sklearn, it would only normalize on the new data to be predicted not taking into account the earlier minmax, If I add two rows to indicate min and max, the results are still different.

    I tried using “MaxAbsScaler().fit(dataset[[‘AMOUNT’]]).max_abs_” to retrieve the fit parameter and store in either json or csv, then use it to transform the amount colum of the new data, still did not work.

    My question, how do I store the parameters used in the normalization of the train data, so I can use that to transform the new data to be predicted.

    Or do I just use this as you showed in one of your books

    “dataset[‘AMOUNT’].apply(lambda x: (x – minmax[0]) / (minmax[1] – minmax[0]))”

    and then I define min and max. I am not actually confident with this as the changes in the normalized amount is rather insignificant, compared to that of MaxAbsScaler.

    Another question, as the column contain amount and its contents varies from negative to positive, normalization or standardization, which is best for optimum result?

    Thanks.

  48. Avatar
    Hammed May 14, 2021 at 8:32 pm #

    Thank you, would try it out.

  49. Avatar
    Ayesha July 14, 2021 at 5:11 am #

    Hi, Great article,
    I need a bit guidance regarding scaling.
    I am training neural net. I used scaling in this way:

    from sklearn import preprocessing
    scaler = preprocessing.StandardScaler().fit(x_train)
    X_train = scaler.transform(x_train)
    X_test=scaler.transform(Xv)
    scalerY = preprocessing.StandardScaler().fit(y_train.values.reshape(-1, 1))
    Y_train = scalerY.transform(y_train.values.reshape(-1, 1))
    Y_test=scalerY.transform(yv.values.reshape(-1, 1))

    But I am getting confused that whether it is right or wrong technique. This way of scaling provides me better results but when I scale data using StandardScaler, I got very worse results. Can you please guide me in this regard? Thanks

Leave a Reply