Many machine learning algorithms make assumptions about your data.

It is often a very good idea to prepare your data in such way to best expose the structure of the problem to the machine learning algorithms that you intend to use.

In this post you will discover how to prepare your data for machine learning in Python using scikit-learn.

Let’s get started.

**Update March/2018**: Added alternate link to download the dataset as the original appears to have been taken down.

## Need For Data Preprocessing

You almost always need to preprocess your data. It is a required step.

A difficulty is that different algorithms make different assumptions about your data and may require different transforms. Further, when you follow all of the rules and prepare your data, sometimes algorithms can deliver better results without the preprocessing.

Generally, I would recommend creating many different views and transforms of your data, then exercise a handful of algorithms on each view of your dataset. This will help you to flush out which data transforms might be better at exposing the structure of your problem in general.

### Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

## Preprocessing Machine Learning Recipes

This section lists 4 different data preprocessing recipes for machine learning.

All of the recipes were designed to be complete and standalone.

You can copy and paste them directly into your project and start working.

The Pima Indian diabetes dataset is used in each recipe. This is a binary classification problem where all of the attributes are numeric and have different scales. It is a great example of dataset that can benefit from pre-processing.

You can learn more about this data set on the UCI Machine Learning Repository webpage (update: download from here).

Each recipe follows the same structure:

- Load the dataset from a URL.
- Split the dataset into the input and output variables for machine learning.
- Apply a preprocessing transform to the input variables.
- Summarize the data to show the change.

The transforms are calculated in such a way that they can be applied to your training data and any samples of data you may have in the future.

The scikit-learn documentation has some information on how to use various different preprocessing methods. You can review the preprocess API in scikit-learn here.

### 1. Rescale Data

When your data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale.

Often this is referred to as normalization and attributes are often rescaled into the range between 0 and 1. This is useful for optimization algorithms in used in the core of machine learning algorithms like gradient descent. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors.

You can rescale your data using scikit-learn using the MinMaxScaler class.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# Rescale data (between 0 and 1) import pandas import scipy import numpy from sklearn.preprocessing import MinMaxScaler url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) array = dataframe.values # separate array into input and output components X = array[:,0:8] Y = array[:,8] scaler = MinMaxScaler(feature_range=(0, 1)) rescaledX = scaler.fit_transform(X) # summarize transformed data numpy.set_printoptions(precision=3) print(rescaledX[0:5,:]) |

After rescaling you can see that all of the values are in the range between 0 and 1.

1 2 3 4 5 |
[[ 0.353 0.744 0.59 0.354 0. 0.501 0.234 0.483] [ 0.059 0.427 0.541 0.293 0. 0.396 0.117 0.167] [ 0.471 0.92 0.525 0. 0. 0.347 0.254 0.183] [ 0.059 0.447 0.541 0.232 0.111 0.419 0.038 0. ] [ 0. 0.688 0.328 0.354 0.199 0.642 0.944 0.2 ]] |

### 2. Standardize Data

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.

It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression and linear discriminate analysis.

You can standardize data using scikit-learn with the StandardScaler class.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Standardize data (0 mean, 1 stdev) from sklearn.preprocessing import StandardScaler import pandas import numpy url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) array = dataframe.values # separate array into input and output components X = array[:,0:8] Y = array[:,8] scaler = StandardScaler().fit(X) rescaledX = scaler.transform(X) # summarize transformed data numpy.set_printoptions(precision=3) print(rescaledX[0:5,:]) |

The values for each attribute now have a mean value of 0 and a standard deviation of 1.

1 2 3 4 5 |
[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426] [-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191] [ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106] [-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042] [-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]] |

### 3. Normalize Data

Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm in linear algebra).

This preprocessing can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as K-Nearest Neighbors.

You can normalize data in Python with scikit-learn using the Normalizer class.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Normalize data (length of 1) from sklearn.preprocessing import Normalizer import pandas import numpy url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) array = dataframe.values # separate array into input and output components X = array[:,0:8] Y = array[:,8] scaler = Normalizer().fit(X) normalizedX = scaler.transform(X) # summarize transformed data numpy.set_printoptions(precision=3) print(normalizedX[0:5,:]) |

The rows are normalized to length 1.

1 2 3 4 5 |
[[ 0.034 0.828 0.403 0.196 0. 0.188 0.004 0.28 ] [ 0.008 0.716 0.556 0.244 0. 0.224 0.003 0.261] [ 0.04 0.924 0.323 0. 0. 0.118 0.003 0.162] [ 0.007 0.588 0.436 0.152 0.622 0.186 0.001 0.139] [ 0. 0.596 0.174 0.152 0.731 0.188 0.01 0.144]] |

### 4. Binarize Data (Make Binary)

You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.

This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.

You can create new binary attributes in Python using scikit-learn with the Binarizer class.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# binarization from sklearn.preprocessing import Binarizer import pandas import numpy url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) array = dataframe.values # separate array into input and output components X = array[:,0:8] Y = array[:,8] binarizer = Binarizer(threshold=0.0).fit(X) binaryX = binarizer.transform(X) # summarize transformed data numpy.set_printoptions(precision=3) print(binaryX[0:5,:]) |

You can see that all values equal or less than 0 are marked 0 and all of those above 0 are marked 1.

1 2 3 4 5 |
[[ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 0. 0. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1. 1. 1.] [ 0. 1. 1. 1. 1. 1. 1. 1.]] |

## Summary

In this post you discovered how you can prepare your data for machine learning in Python using scikit-learn.

You now have recipes to:

- Rescale data.
- Standardize data.
- Normalize data.
- Binarize data.

Your action step for this post is to type or copy-and-paste each recipe and get familiar with data preprocesing in scikit-learn.

Do you have any questions about data preprocessing in Python or this post? Ask in the comments and I will do my best to answer.

Hey Jason,

On Normalizing, do you need to do this if you are planning on using euclidean, or cosine distance measures to find similar items in a dataframe?

e.g. you have a vector where each column has some attributes about the product, and you want to find other products that have similar attributes.

Keen to hear your thoughts

Thanks

SM

Excellent!

Thanks Ernest.

Hi Jason,

Thanks for the post and the website overall. It really explains a lot.

I have a question regarding preparing the data ,if I am to normalize my Input data, does the precision of the values have an effect ? Will it make the weight matrix more sparse while training with higher precision if the training data is not very high?

In that case should I be limiting the precision depending on the amount of training data?

I am interested in sequence classification for EEG, In my case I intend to try out RNN . I was planning on normalizing the data since I wish the scaling to be performed on each individual input sequence.

Hoping to hear from you,thanks !

Great question Akshay.

I don’t have a clear answer for you. It may. I have not seen it have an effect, but I would not rule it out.

If you’re worried, I would recommend testing with samples of your data at different precisions and different transforms and evaluate the effect.

I expect the configuration of your model will be a much larger leaver on performance.

Hi Jason,Thank you for the reply.

I intend to build an RNN from scratch for the application similar to sentiment analysis (Many to one). I am a bit confused about the final stage. while training, when I feed a single sequence(belong to one of the class) to the training set , do I apply softmax to the last output of the network alone and compute the loss and leave the rest unattended?

Where exactly is the many to “ONE” represented?

Sorry Akshay, I don’t have example of implementing an RNN from scratch.

My advice would be to peek into the source code for standard deep learning library like Keras.

Should one normalize the test and train datasets separately? or does he have to normalize the whole dataset, before splitting it?

Yes. Normalize the train dataset and use the min/max from train to normalize the test set.

In this case, min/max of test set might be smaller or bigger than min/max of the training set. If they are, would it cause a problem to the validation?

You should estimate them using domain knowledge if possible, otherwise, estimate from train and clip test data if they exceed the known bounds.

Hi Jason, I often read about people normalize on the input features, but not on output, why?

Should we normalize on the output features as well if the output have a wide range of scale too? from 1e-3 to 1e3

BTW, it is for a regression problem.

You can normalize the output variable in regression too, but you will need to reverse the scaling of predictions in order to make use of them or quote error scores in a meaningful way (e.g. meaningful to your problem).

The MSE loss is very high (1e8) when I didn’t applied normalization on the output variable, and small MSE loss (0.0xxx) when I applied normalization.

Is there something wrong in my implementation? Should I run large epochs(maybe 50000?) when the output variable isn’t normalized? (currently running 500 epochs with normalization).

Perhaps. Continue to try different things and see if you can learn more about your system/problem.

@Roy,

– if you don’t normalize and the features are not of similar scale, then the gradient descent would take a very long time to converge [1]

– if Root MSE is much much smaller than the mean/median value of the predicted vector, I think your model is good enough

[1] Stranford.edu: https://www.youtube.com/watch?v=yQci-wS0iMw&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW&index=21https://www.youtube.com/watch?v=yQci-wS0iMw&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW&index=21

Normalizing input variables is an intent to make the underlying relationship between input and output variables easier to model.

Your tutorials are awesome. 🙂

I have converted rescaledX to a dataframe and plotted histogram for rescaling, standardization and normalization. They all seem to be scaling down the magnitude of an attribute to a small range — 0 to 1 in case of rescaling and normalization.

– are they doing similar transformation i.e. scaling down attributes so they become comparable?

– do you only apply one method in any given situation?

– which would be appropriate in which situation?

Thanking in advance.

Good questions.

This post explains how they work and when to use them:

http://machinelearningmastery.com/how-to-scale-data-for-long-short-term-memory-networks-in-python/

Hi Jason, I really like your posts. I was looking for some explanation on using power transformations on data for scaling. Like using logarithms and exponents and stuff like that. I would really like to understand what it does to the data and how we as data scientist can be power users of such scaling techniques

The best advice is to try a suite of transforms and see what results in the more predictive model.

I have a few posts on power transforms like log and boxcox, try the search feature.

Hi Jason,thanks for your all posts , I have question related to Multilayer Perceptron classification algorithm

if we want to apply this algorithm on mixed data set (numeric and nominal).

EX (23,125,75,black,green) this data presents the age ,length,weight ,Hair color, Eye color Respectively.

For numeric attributes we will normalize the data to be in the same range.

what about nominal attributes?

Do we need to transform nominal attributes to binary attributes?

I would recommend either using an integer encoding or a one hot encoding.

It is common to use a one hot encoding.

I have many posts on the topic.

Hello Jason, great post

However,

I have a question (maybe is almost the same that Dimos).

What is the most often approach to preprocess (I mean use 1 of 4 explained)

How values you normalize?

all features (X)

fit_transform train features(X_train_std=model.fit_trainsform(X_train)) and from them transform X_test (X_test_std=model.transform(X_test))

and then:

If we have to predict new features that I get today(for example: 0,95,80,45,92,36.5,0.330,26,0 in diabetes model)

we have to preprocess that feature or is not necessary relevant and predict it without preprocess:

Thank you for help

Any process used to prepare data for training the model must be performed when making predictions on new data with the final model.

This means coefficients used in scaling (e.g. min/max) are really part of the model and must be chosen mindfully.

thak you for your answer

Hi Jason

I am applying normalization for network attacker data. i used min/max normalization. but in the real data there is some features have a large values. if i want to apply standard deviation normalization. should i apply only one normalization type? or can i apply min/max for all data and then apply standard deviation for all data. what is the sequence and is it wrong if i apply standard deviation normalization only on the large value features?

I would recommend trying both approaches and see what works best for your data and models.

I don’t understand the two comands.

X = dataset[:,0:8]

Y = dataset[:,8]

This is called array slicing, I will have a post on this topic on the blog tomorrow.

Here, we are selecting the columns 0-7 for input and 8 for output.

Dear Dr. Jason Brownlee, i have prepared my own dataset on hand writing from different people, and i prepared the images in 28X28 pixel so the problem is how i am going to prepare the training and testing data set so as i will then write the code to recognize the data?

Sounds great.

my idea is can you help me how i have to do that? and how i have to read my images data set and training data set using tensorflow?

Perhaps this example would help:

http://machinelearningmastery.com/object-recognition-convolutional-neural-networks-keras-deep-learning-library/

That is a great link that shows how to use the existing CIFAR-10, thank you for that, but as i tried to mention it above, i have handwritten images prepared in 28×28 pixels, so how i have to prepare the training set (how to label my dataset)? it can be .csv or .txt file, i need the way how i have to prepare training set and access in tensorflow like MNIST?

The images will be standard image formats like jpg or png, the labels will be in a csv file, perhaps associated with each filename.

Hi Jason. First of all, great work with the tutorials.

Here’s something I don’t understand though. What’s the difference between rescaling data and normalizing data? It seems like they’re both making sure that all values are between 0 and 1 ?

So what’s the difference?

Thanks.

Please email me the answers as well since i do not check this blog often

jourdandgo@gmail.com

Normalizing is a type of rescaling.

Hello Sir!! I am planning a research work which is about music genera classification. My work includes preparing the dataset for the type of music I want to use as there are no public dataset for those music. My problem is I don’t know how to prepare music dataset. I have red a lot about spectrogram. But, what are the best mechanisms to prepare music dataset? Is it only spectrogram I have to use or I have alternate choices?

Sorry, I cannot help you with music datasets. I hope to cover the topic in the future.

Hi, if I would like to scale a image with dim=3x64x64. How to use StandardScaler() to do that? Thank you

Sorry, the standard scalar is not for scaling images.

so, to improve the performance of training images, which scale method we should use? Or, just divide train set to a value, for example, train_x/255, …?

Try a suite of transforms and see what works best for your data.

Hi Jason, thanks for your posts.

I have a question about data preprocessing. Can we have multiple inputs with different shape? for example two different files, one including bit vectors, one including matrixes?

If so, how can we use them for ML algorithms?

Basically, I want to add additional information to data, so classifier can use for better prediction.

For most algorithms the input data must be reshaped and padded to be homogeneous.

Thanks for your response. Yes, I understand that. This extra information is like a metadata that gives information about the structure that generates the data. Therefore, it is a separate type that gives mores information about the system. Is there any way to apply it to ML algorithms?

Sure, perhaps you could use a multiple input model or an ensemble of models.

Do you have any link/reference suggestion that I can read more about it? I could not find a good resource yet. Thanks in advance.

Hi Jason,

What is Y used for? I realize the comment and description say it’s the output column, but after slicing the ‘class’ column to it, I’m not seeing Y used for anything in the four examples. Commenting it out does not seem to have any effect. Is it just a placeholder for later? If so, why did we assign ‘class’ data to it instead of creating an empty array?

Thanks,

John

We are not preparing the class variable in these examples. It is needed during modeling.

Thanks for great article. I would like to ask a question regarding using simple nearest neighbors algorithm from scikit learn library with standard settings. I have a list of data columns from salesforce leads table giving few metrics for total time spent on page, total emails opened, as well as alphabetical values such as – source of the lead with values signup, contact us etc., as well as country of origin information.

So far I have transformed all non-numerical data to numerical form in the simple way 0, 1, 2, 3, 4 for each unique value. With this approach scoring accuracy seams to reach 70% at its best. Now I want to go one step further and either normalize or standardize the data set, but can’t really decide which route to take. So far I have decided to go with safest advice and standardize all data. But then I have worries about some scenarios, for example certain fields will have long ranges of data, i.e. those representing each country, or those that show number of emails sent. On another hands other fields like source, will have numerical values 0, 1, 2, 3 and no more, but the field itself does have very high correlation to the outcome of winning lead or loosing lead.

I would be very grateful if you could point me to the right direction and perhaps without too much diving into small details, what would be the common sense approach.

Also, is it possible to use both methods for data set, i.e. standardize data first, and then normalize.

Thanks,

Donatas

Good question.

The data preparation methods must scale with the data. Perhaps for counts you can estimate the largest possible/reasonable count that you can use to normalize the count by, or perhaps invert the count, e.g. 1/n.

Hi @jason can you please tell why normalizer result and rescaling (0-1) results are different. isn’t there a standard way of doing so which should give the same result irrespective of the class used (i.e MinMaxScaler or normalizer).

I don’t follow, sorry. Can you give more context?

Hi Sír. I have a housing datasets whose target variable is a positively skewed distribution. So far that’s the only variable I have seen to be skewed although I think there will be more. Now I have read that there is need to make this distribution approximately a normal distribution using log transformation. But the challenge I’m facing right now is how to perform log transformation on the price feature in the housing dataset. I’d like to if there is a scikit-learn library for this and if not how should I go about it? More so I plan on using linear regression to predict housing prices for this dataset.

You can use a boxcox transform to fix the skew:

https://machinelearningmastery.com/power-transform-time-series-forecast-data-python/

Hi Jason

I am using MinMaxScaler preprocessing technique to normalize my data. I have data of 200 patients, where each patient data for single electrode is 20 seconds i.e. 10240 sample.Then, the dimension of my data is 200*10240. I want to rescale my data row-wise but MinMaxScaler scale the data column wise which may not be correct for my data as i want to rescale my data accordingly 1*10240.

What changes are required in order to operate row wise independently of other electrode?

In general, each column of data represents a separate feature that may have a different scale or units from other columns.

For this reason, it is good practice to scale data column-wise.

Does that help?

HEllo sir,

I have colleted 1000 tweets on demonetization. Then i am extracting different features like pos based, lexiocn based fetaures, morphological features, ngram features.So different feature vectors are created for each type and then they are stacked column wise. I have divided dataset of 1000 tweets into 80% as training and 20% as testing. I have trained svm classifier but accuracy is not more than 60%.

How should i improve accuracy or which feature selection should i need to use?

Thanks

Here are some ideas:

http://machinelearningmastery.com/machine-learning-performance-improvement-cheat-sheet/

Big Question is for me. Why should we use random values for weight and Bias value?

Good question.

Neural nets use random initial values for the weights. This is by design. It allows the learning algorithn (batch/mini-batch/stochastic gradient descent) to explore the weight space from a different starting point each time the model is evaluated, removing bias in the training process.

It is why it is a good idea to evaluate a neural net by training multiple instances of the model:

https://machinelearningmastery.com/evaluate-skill-deep-learning-models/

Does that help?

X = array[:,0:8]

Y = array[:,8] I have doubt here X is only 0 to right feature and Y is target 8th column right .

Confirm it by inspecting the data, it is correct.

Learn more about array slicing and ranges in python here:

https://machinelearningmastery.com/index-slice-reshape-numpy-arrays-machine-learning-python/

Thank you so much for all of your help – I have learned a ton from all of your posts!

I have a project where I have 54 input variables, and 8 output variables. I have decent results from what I have learned from you. However, I have standardized all my input variables, and I think I could achieve better performance if I only standardize some of them. Meaning, 5 of the input columns are the same variable type as the outputs, I think it would be better not to scaler this. Additionally, I one of the inputs in the month of the year – I do not think that that needs to be standardized either.

Does my thought process to do selective preprocessing make any sense? Is it possible to do this?

Thank you

You’re welcome Sam.

Perhaps. I would recommend designing and running careful experiments to test your idea. Let the results guide you rather than “best practice”.

Hi, Jason,

Really helpful article. I tried to access the Pima Indian diabetes dataset and it’s no longer available to download at the link provided due to permission restrictions.

Note that I provided an alternate link in the post.

Here it is again:

https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv

Somehow I missed that. Thanks! 🙂

No probs.

I have n-dimensional binary data. Suggest some good classifier for binary data-set.

Try a suite of methods and see what works best for your specific dataset.