Last Updated on

Your data must be prepared before you can build models. The data preparation process can involve three steps: data selection, data preprocessing and data transformation.

In this post you will discover two simple data transformation methods you can apply to your data in Python using scikit-learn.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

**Update**: See this post for a more up to date set of examples.

## Data Rescaling

Your preprocessed data may contain attributes with a mixtures of scales for various quantities such as dollars, kilograms and sales volume.

Many machine learning methods expect or are more effective if the data attributes have the same scale. Two popular data scaling methods are normalization and standardization.

### Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Data Normalization

Normalization refers to rescaling real valued numeric attributes into the range 0 and 1.

It is useful to scale the input attributes for a model that relies on the magnitude of values, such as distance measures used in k-nearest neighbors and in the preparation of coefficients in regression.

The example below demonstrate data normalization of the Iris flowers dataset.

1 2 3 4 5 6 7 8 9 10 11 |
# Normalize the data attributes for the Iris dataset. from sklearn.datasets import load_iris from sklearn import preprocessing # load the iris dataset iris = load_iris() print(iris.data.shape) # separate the data from the target attributes X = iris.data y = iris.target # normalize the data attributes normalized_X = preprocessing.normalize(X) |

For more information see the normalize function in the API documentation.

## Data Standardization

Standardization refers to shifting the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance).

It is useful to standardize attributes for a model that relies on the distribution of attributes such as Gaussian processes.

The example below demonstrate data standardization of the Iris flowers dataset.

1 2 3 4 5 6 7 8 9 10 11 |
# Standardize the data attributes for the Iris dataset. from sklearn.datasets import load_iris from sklearn import preprocessing # load the Iris dataset iris = load_iris() print(iris.data.shape) # separate the data and target attributes X = iris.data y = iris.target # standardize the data attributes standardized_X = preprocessing.scale(X) |

For more information see the scale function in the API documentation.

## Tip: Which Method To Use

It is hard to know whether rescaling your data will improve the performance of your algorithms before you apply them. If often can, but not always.

A good tip is to create rescaled copies of your dataset and race them against each other using your test harness and a handful of algorithms you want to spot check. This can quickly highlight the benefits (or lack there of) of rescaling your data with given models, and which rescaling method may be worthy of further investigation.

## Summary

Data rescaling is an important part of data preparation before applying machine learning algorithms.

In this post you discovered where data rescaling fits into the process of applied machine learning and two methods: Normalization and Standardization that you can use to rescale your data in Python using the scikit-learn library.

**Update**: See this post for a more up to date set of examples.

Hi thank you for this nice article ! I have a quick question (maybe long to explain but I think the answer is short 🙂 )

I am using data standardization for a k-NN algorithm. I will have new instances and I’ll have to determine their class, their data won’t be standardized.

Should I :

1) Standardize my training set with the scale function. Then manually calculate the means and the std of my training set to standardize my new vector.

2) Add the new data to the training set and then standardize the set with the mean function.

3) Neither

Thank you !

Great question Matt.

Generally, it is a good idea to calculate the rescale parameters using the training data and use those parameters to rescale the test dataset and all datasets that needs predictions going forward.

You can and should use the StandardScaler() object that has the means and standard deviations inside.

See this post for a more up to date example:

http://machinelearningmastery.com/prepare-data-machine-learning-python-scikit-learn/

I came across your website which is extremely helpful for studying machine learning. Thanks for the great effort.

Just a friendly reminder. The normalization function has an axis parameter with a default value equals to 1, so it will run on rows/data by default. For feature normalization, you need to set axis = 0.

Thanks Tzu-Yen!

Hey Tzu-Yen, you saved my day…. Thanks a lot

Hi Jason,

May i know how to bring the data back to original scale? I need my predictions in original scale. I normalised by data and tried .inverse_transform(data) to get back my original data. But it gave me an error – AttributeError: ‘Normalizer’ object has no attribute ‘inverse_transform’

Any kind of help would be appreciated.

I would recommend using the MinMaxScaler instead:

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

Hi Jason,

I have one question.

Suppose today my data range is 5000 to 10000 which will be scaled between 0 to 1

And tomorrow if new data entry comes with 10500. if I use same 5000 to 10000 range for fitting then it produce output X1 and and If i specify 5000 to 10500 range then it produce output X2 which is not equal to X1.

How to over come this? How to handle new data with old range?

Thanks & Regards

Sumit

You can estimate the expected range of data in the future and those min/max to scale.

Or, you can estimate a mean/stdev and standardize instead, if the data is Gaussian.

Hi Jason, when I use standardization as suggested in the post, I see mean and standard deviation very close to zero and one, respectively…but not exactly. Wonder if such close-enough values are acceptable in the community?

count 7.68E+02

mean -6.48E-17

std 1.00E+00

min -1.14E+00

25% -8.45E-01

50% -2.51E-01

75% 6.40E-01

max 3.91E+00

Perhaps.

Hi , can you tell me the formula behind preprocessing.scale()?

Sure, learn more here:

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html

And here:

https://en.wikipedia.org/wiki/Standard_score

While am training the data set am getting an accuracy around 100%.But when am testing am not getting the proper answer.what could be the reason?can you please help me with it

Sounds like you are overfitting the training dataset.

hai Mr. Jason Brownlee, I wanna ask about how to prints the values back in their original scale using normalizer

You can invert the scaling transforms, call scaler.invert_transform()

Hi Jason,

I tried to use standardized_X = preprocessing.scale(X).

I checked with print(np.mean(standardized_X[:,0])) but mean is not zero (for any of the four columns) although standard deviation is one. Am I doing something incorrectly.

Please suggest.

I’m surprised. I would expect it to be zero, perhaps confirm that you have not introduced a bug or typo?

If i have target values in different range for prediction using regression with deep neural network will it be helpful to get better accuracy by doing normalization of target values?If yes, then which technique i should use for that.

Yes, perhaps try normalize and standardize and see which results in better skill compared to no rescaling.

When to normalize and when to standardize ??

Normalize when the data variables have different units.

Standardize when a variable has a gaussian distribution.

i have a dummy question;

there were 2 lines in your code.

X = iris.data

y = iris.target

how does X and Y know what the dependent and independent variables are? Will this work on my own dataset, without me having to tell it ( doubt it).

Thanks

I am choosing the dependent/independent based on my knowledge of the problem.

Perhaps this post will help you with your problem:

http://machinelearningmastery.com/how-to-define-your-machine-learning-problem/

Hi Jason,

First, I would like to say that I really appreciate your blog posts, they’re helping me a lot!

I have a problem with reproducing a normalization/standardization from the article. The authors wrote that “(…)Magnitudes are scaled logarithmically. The features are normalized per frequency band to zero mean and unit variance.” – do they mean standardize, so is it enought to simply use preprocessing.scale ?

To make it more confusing, I found another group which is reproducing the results from the first one. They wrote that “The logarithm of the normalized sum magnitude of the

filter bank energies is computed for each window. These features were normalized to range between 0 and 1 before feeding to the network input.” And for me that looks like normal normalization.

Is it the same process described? I doubt…

Thanks a lot for your help!

Hannah

Perhaps contact the authors directly and ask exactly what they did?

Unless academics release code used to produce the results, their papers are a waste of time in the best case or fraud in the worst case.

Hi Jason,

A quick question on standardization, lets say I have built a model on a selected sample data from entire population and standardized the values before running through a model. So, can I directly use these beta coefficients on the entire population or since I have found beta coefficients by standardizing the values should I standardize all the values in the entire population.

I recommend estimating the coefficients from the training set and using them on all data going forward.

Hey thanks for the reply, can you please elaborate on why should we be using estimates coefficients from training set and use them on all the data

If we estimate the distribution from all data, then evaluate the performance of the model on a subset of that data, we will be subject to data leakage and the results will be optimistic – we are using knowledge out side of the scope of the test.

More here:

https://machinelearningmastery.com/data-leakage-machine-learning/

I am trying to code LSTM for household_power_consumption_days.csv data by using Pytorch. So, should I normalize or rescale data?

Sorry, I don’t have any examples for Pytorch.

Scaling data prior to modeling is a good practice.

Hello Jason,

In order to train an SVM classifier, should the data be scaled to [0,1 ] or [-1, 1]?

Thanks in advance

The targets need to be {-1, 1} I believe. But sklearn will do this for you – from memory

Thanks Jason,

Do i need normalising/scaling if i only have 1 feature?

My data has x and y only, where y is the dependent variable. I am working on Random forest regression.

Perhaps try it and see?

How can I normalize a dataset with text values to numbers properly in sklearn? and my dataset have train and test part and Im getting different number of columns after normalization. How can I normalize them eqally?

Perhaps try a bag of words model:

https://machinelearningmastery.com/gentle-introduction-bag-words-model/

is normalization required for Decision Tree Algorithm

Typically not.

Thank you very much for writing this article. When to use normalisation and when to use Standardisation ?, I went through an article they said “We can use Normalisation if we want to rescale every observation of dataset, We can use standardisation if we want to rescale by features in data sets.

I’m happy that it helped.

Normalize generally, and use standardization when a variable is gaussian.

If in doubt, test both and use whichever results in a model with the best skill.

Hello Jason, thanks for all information, and, lets see if you can help me.

I do the scaling in my predictors, and when I do the prediction, the result comes out in scientific notation, researching I saw that there is a way to do an inverse_transform, reverse the scaling process, I tried, but I failed to successfully reverse, can you help me?

Below is the test code.

import pandas as pd

import numpy as np

base = pd.read_csv(‘c://udemy//ia//bd//house-prices.csv’)

X = base.iloc[:, 3:19].values

y = base.iloc[:, 2].values

from sklearn.model_selection import train_test_split

X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X, y,

test_size = 0.3,

random_state = 0)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_treinamento = scaler.fit_transform(X_treinamento)

from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=4)

X_treinamento_poly = poly.fit_transform(X_treinamento)

X_teste_poly = poly.transform(X_teste)

regressor = LinearRegression()

regressor.fit(X_treinamento_poly, y_treinamento)

score = regressor.score(X_treinamento_poly, y_treinamento)

previsoes = regressor.predict(X_teste_poly)

previsoes = scaler.inverse_transform(previsoes) = is not working.

Yes, there are examples here you can use as a starting point:

https://machinelearningmastery.com/machine-learning-data-transforms-for-time-series-forecasting/

Hello Jason,

Thanks for the tip, reading what you told me, I saw that I can inverse_transform, but applying I have an error, which seems to be basic and easy to solve, but I’m not getting, it gives the error:

ValueError: operands could not be broadcast together with shapes (6484,) (16,) (6484,)

I tried to apply the reshape to the predictions variable, but it doesn’t work at all.

any suggestion?

The shape of the data and order of the columns in the data must be identical when calling transform() and inverse_transform().

Perhaps check this.

Got It, now is working, thanks!

import pandas as pd

base = pd.read_csv(‘c://udemy//ia//bd//house-prices.csv’)

X = base.iloc[:, 3:19].values

y = base.iloc[:, 2:3].values

from sklearn.preprocessing import StandardScaler

scaler_x = StandardScaler()

X = scaler_x.fit_transform(X)

scaler_y = StandardScaler()

y = scaler_y.fit_transform(y)

from sklearn.model_selection import train_test_split

X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X, y,

test_size = 0.3,

random_state = 0)

from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=4)

X_treinamento_poly = poly.fit_transform(X_treinamento)

X_teste_poly = poly.transform(X_teste)

regressor = LinearRegression()

regressor.fit(X_treinamento_poly, y_treinamento)

score = regressor.score(X_treinamento_poly, y_treinamento)

# previsóes com o scalonamento reverso

previsoes1 = scaler_y.inverse_transform(regressor.predict(X_teste_poly))

y_teste = scaler_y.inverse_transform(y_teste)

from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_teste, previsoes1)

#Testando com o y_teste e previsóes ainda com scalonamento

previsoes = regressor.predict(X_teste_poly)

scaler_teste = StandardScaler()

y_teste = scaler_teste.fit_transform(y_teste)

from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_teste, previsoes)

Happy to hear that.

Sir how I get the mean=0 and standard deviation =1 for a given dataset in Python?

Use the StandardScaler

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Good article Jason. I learned from few points from your post and taken it forward by implementing ML model to compare the impact. here is the article https://medium.com/@vibhaas.kotwal/feature-scaling-8c92bdd080a1

Please let me know your comments.

Thanks.

Sorry, I don’t have the capacity to review your piece.

Sir,

I get intercept_ =3.1378, 93 support_vectors_ and 93 dual_coef_ then,

how can i get the hyperplane equation of polynomial SVM in python.

Thank you.

I believe you can retrieve all coefficients from the sklearn API.

Dear Jason,

I have a csv dataset which need to be normalized. However, I just need to normalize all columns except target, how can I perform it?

import pandas as pd

from sklearn import preprocessing

data=pd.read_csv(‘File.csv’)

min_max_scaler = preprocessing.MinMaxScaler()

np_scaled = min_max_scaler.fit_transform(data)

df_normalized = pd.DataFrame(np_scaled)

df_normalized = df_normalized.to_csv(Norm_File.csv’,header=True, index=False)

Thanks a lot.

Not sure you can use the scaler directly on dataframes, perhaps extract the numpy array from them first?

hi

Hi, do you have any questions about this tutorial?

hi

i have a dataset

inlude 3 columns

receney frequency monetary

so i can preprocessing date by Standardize or Normalize? (i use dataset for Kmeans )

pls help me. sorry my english not good

Thank u so much

Typically the date is removed from the data.

Perhaps this will help you with clustering:

https://machinelearningmastery.com/clustering-algorithms-with-python/

Thank u so much. really Your post helped me a lot. have a beautiful day for u

You’re welcome, I’m happy to hear that!

i am only start to learn

i wroten it

pls can u help me check my code is right or wrong

i did preprocessing

***********

from sklearn.cluster import KMeans

import pandas as pd

import matplotlib.pyplot as plt

sse = {}

#load our data from CSV

tx_user = pd.read_csv(‘rfm_data.csv’, sep =’,’ , engine=’python’)

# display(tx_user[[‘M’]].boxplot())

#PRE-PROCESSING ———————————————–

col_names = [‘R’,’F’, ‘M’]

#Step 1: Rescale Data

#from sklearn.preprocessing import MinMaxScaler

#min_max_scaler = MinMaxScaler()

#tx_user[col_names] = min_max_scaler.fit_transform(tx_user[col_names])

#Step 2: Standardize Data

from sklearn.preprocessing import StandardScaler

standard_scaler = StandardScaler()

tx_user[col_names]=pd.read_csv(‘rfm_data.csv’, sep =’,’ , engine=’python’)

tx_user[col_names] = standard_scaler.fit_transform(tx_user[col_names])

#Step 3: Normalize Data

#from sklearn.preprocessing import Normalizer

#normalizer = Normalizer()

#tx_user[col_names] = normalizer.fit_transform(tx_user[col_names])

#print(‘Descriptive statistic of preprocessed data: ‘)

#display(tx_user.describe())

#END OF PRE-PROCESSING ——————————————-

for k in range(1, 10):

kmeans = KMeans(n_clusters=k, max_iter=1000).fit(tx_user[[‘R’,’F’, ‘M’]])

tx_user[“clusters”] = kmeans.labels_

sse[k] = kmeans.inertia_

print(‘\n \n Sum of squared distances of samples to their closest cluster center: \n’)

df_sse = pd.DataFrame(sse.items(), columns = [‘K Cluster’,’Sum of Squared Errors’])

display(df_sse)

keys = list(sse.keys())

values = list(sse.values())

plt.figure()

plt.plot(keys, values)

plt.xlabel(“Number of cluster – Kmean on dataraw _ group by”)

# Add title and axis names

plt.title(‘Within-Cluster-Sum of Squared Errors (WSS) for different values of k’)

plt.xlabel(‘K cluster’)

plt.ylabel(‘Sum of Squared Errors (WSS)’)

plt.show()

This is a common question that I answer here:

https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code

I want to cluster customers. I have used 2 methods for the same data set. a method I code manually using RFM model in economics. Another method I use clustering using the Kmeans algorithm. Now I want to compare which method is better. But I had trouble. I still haven’t figured out which method will give better results

Select a metric, design a test harness, then apply both methods in the test harness to see which does better on your metric.

Thank u so much

You’re welcome.

Hi

After I have clustered “customer segmentation the clients, I want to visualization those clusters. What I have to do

Perhaps try pair-wise scatter plots?

yes yes yes.i want pair-wise scatter plots

i was code . but i got “AttributeError: ‘KMeans’ object has no attribute ‘labels’

“. but i still can not fix

# Modules

import matplotlib.pyplot as plt

from matplotlib.image import imread

import pandas as pd

import seaborn as sns

from sklearn.datasets.samples_generator import (make_blobs,

make_circles,

make_moons)

import numpy as np

from sklearn.cluster import KMeans

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import silhouette_samples, silhouette_score

%matplotlib inline

sns.set_context(‘notebook’)

plt.style.use(‘fivethirtyeight’)

from warnings import filterwarnings

filterwarnings(‘ignore’)

# Import the data

df = pd.read_csv(‘Visualization_1.csv’)

# Plot the data

plt.figure(figsize=(6, 6))

plt.scatter(df.iloc[:, 0], df.iloc[:, 1])

plt.xlabel(‘Eruption time in mins’)

plt.ylabel(‘Waiting time to next eruption’)

plt.title(‘Visualization of raw data’);

# Standardize the data

X_std = StandardScaler().fit_transform(df)

# Run local implementation of kmeans

def cluster(n_clusters):

km = Kmeans(n_clusters=2, max_iter=100)

km.fit(X_std)

centroids = km.centroids

# Plot the clustered data

fig, ax = plt.subplots(figsize=(6, 6))

plt.scatter(X_std[km.labels == 0, 0], X_std[km.labels == 0, 1],

c=’green’, label=’cluster 1′)

plt.scatter(X_std[km.labels == 1, 0], X_std[km.labels == 1, 1],

c=’blue’, label=’cluster 2′)

plt.scatter(centroids[:, 0], centroids[:, 1], marker=’*’, s=300,

c=’r’, label=’centroid’)

plt.legend()

plt.xlim([-2, 2])

plt.ylim([-2, 2])

plt.xlabel(‘Eruption time in mins’)

plt.ylabel(‘Waiting time to next eruption’)

plt.title(‘Visualization of clustered data’, fontweight=’bold’)

ax.set_aspect(‘equal’);

This is a common question that I answer here:

https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code

thank u so much so much

You’re welcome.

hello i am new to all these but I was given a task I am not sure how to do

Normalize data with pandas:

a. Subtract the mean value of each feature from the dataset.

b. After subtracting the mean, additionally scale (divide) the feature values by their

respective “standard deviations.”

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn import preprocessing

#step 1

col_names = [“Size”,”Bedrooms”,”Price”]#name cols

#importing data

df2 = pd.read_csv(“dataset2.txt”, header = None,skiprows=0, names= col_names)

#print first 5 elements of Dataframe

print(df2.head())

print(df2.describe())#show some stats

I have no idea how to subtract means and the std please can you show me how

This is a common question that I answer here:

https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code

Hi Jason, thanks for this article.

Do you recommend using Sigmoid function transformation of input feature to get outliers closer to the bulk of other values, given that outliers are extreme values that are not errors. I.e. -1+2/(1+e^-ax); this would replace outliers and standardize data.

Not really. Perhaps try it and see if it is appropriate for your data/model/project.

Hi Jason,

Do you have a tutorial on how to normalize/scale multivariate time series data?

Thanks for all of your valuable guides,

You can scale each series separately.

This may help as a starting point:

https://machinelearningmastery.com/machine-learning-data-transforms-for-time-series-forecasting/