How To Prepare Your Data For Machine Learning in Python with Scikit-Learn

By Jason Brownlee on December 11, 2019 in Python Machine Learning 171

Many machine learning algorithms make assumptions about your data.

It is often a very good idea to prepare your data in such way to best expose the structure of the problem to the machine learning algorithms that you intend to use.

In this post you will discover how to prepare your data for machine learning in Python using scikit-learn.

Kick-start your project with my new book Machine Learning Mastery With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Mar/2018: Added alternate link to download the dataset as the original appears to have been taken down.

How To Prepare Your Data For Machine Learning in Python with Scikit-Learn
Photo by Vinoth Chandar, some rights reserved.

Need For Data Preprocessing

You almost always need to preprocess your data. It is a required step.

A difficulty is that different algorithms make different assumptions about your data and may require different transforms. Further, when you follow all of the rules and prepare your data, sometimes algorithms can deliver better results without the preprocessing.

Generally, I would recommend creating many different views and transforms of your data, then exercise a handful of algorithms on each view of your dataset. This will help you to flush out which data transforms might be better at exposing the structure of your problem in general.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Preprocessing Machine Learning Recipes

This section lists 4 different data preprocessing recipes for machine learning.

All of the recipes were designed to be complete and standalone.

You can copy and paste them directly into your project and start working.

The Pima Indian diabetes dataset is used in each recipe. This is a binary classification problem where all of the attributes are numeric and have different scales. It is a great example of dataset that can benefit from pre-processing.

Each recipe follows the same structure:

Load the dataset from a URL.
Split the dataset into the input and output variables for machine learning.
Apply a preprocessing transform to the input variables.
Summarize the data to show the change.

The transforms are calculated in such a way that they can be applied to your training data and any samples of data you may have in the future.

The scikit-learn documentation has some information on how to use various different preprocessing methods. You can review the preprocess API in scikit-learn here.

1. Rescale Data

When your data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale.

Often this is referred to as normalization and attributes are often rescaled into the range between 0 and 1. This is useful for optimization algorithms in used in the core of machine learning algorithms like gradient descent. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors.

You can rescale your data using scikit-learn using the MinMaxScaler class.

# Rescale data (between 0 and 1)
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

# Rescale data (between 0 and 1)

import pandas

import scipy

import numpy

from sklearn.preprocessing import MinMaxScaler

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

# separate array into input and output components

X = array[:,0:8]

Y = array[:,8]

scaler = MinMaxScaler(feature_range=(0, 1))

rescaledX = scaler.fit_transform(X)

# summarize transformed data

numpy.set_printoptions(precision=3)

print(rescaledX[0:5,:])

After rescaling you can see that all of the values are in the range between 0 and 1.

[[ 0.353  0.744  0.59   0.354  0.     0.501  0.234  0.483]
 [ 0.059  0.427  0.541  0.293  0.     0.396  0.117  0.167]
 [ 0.471  0.92   0.525  0.     0.     0.347  0.254  0.183]
 [ 0.059  0.447  0.541  0.232  0.111  0.419  0.038  0.   ]
 [ 0.     0.688  0.328  0.354  0.199  0.642  0.944  0.2  ]]

[[ 0.353 0.744 0.59 0.354 0. 0.501 0.234 0.483]

[ 0.059 0.427 0.541 0.293 0. 0.396 0.117 0.167]

[ 0.471 0.92 0.525 0. 0. 0.347 0.254 0.183]

[ 0.059 0.447 0.541 0.232 0.111 0.419 0.038 0. ]

[ 0. 0.688 0.328 0.354 0.199 0.642 0.944 0.2 ]]

2. Standardize Data

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.

It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression and linear discriminate analysis.

You can standardize data using scikit-learn with the StandardScaler class.

# Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
import pandas
import numpy
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

# Standardize data (0 mean, 1 stdev)

from sklearn.preprocessing import StandardScaler

import pandas

import numpy

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

# separate array into input and output components

X = array[:,0:8]

Y = array[:,8]

scaler = StandardScaler().fit(X)

rescaledX = scaler.transform(X)

# summarize transformed data

numpy.set_printoptions(precision=3)

print(rescaledX[0:5,:])

The values for each attribute now have a mean value of 0 and a standard deviation of 1.

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]

[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426]

[-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191]

[ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106]

[-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042]

[-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]

3. Normalize Data

Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm in linear algebra).

This preprocessing can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as K-Nearest Neighbors.

You can normalize data in Python with scikit-learn using the Normalizer class.

# Normalize data (length of 1)
from sklearn.preprocessing import Normalizer
import pandas
import numpy
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(normalizedX[0:5,:])

# Normalize data (length of 1)

from sklearn.preprocessing import Normalizer

import pandas

import numpy

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

# separate array into input and output components

X = array[:,0:8]

Y = array[:,8]

scaler = Normalizer().fit(X)

normalizedX = scaler.transform(X)

# summarize transformed data

numpy.set_printoptions(precision=3)

print(normalizedX[0:5,:])

The rows are normalized to length 1.

[[ 0.034  0.828  0.403  0.196  0.     0.188  0.004  0.28 ]
 [ 0.008  0.716  0.556  0.244  0.     0.224  0.003  0.261]
 [ 0.04   0.924  0.323  0.     0.     0.118  0.003  0.162]
 [ 0.007  0.588  0.436  0.152  0.622  0.186  0.001  0.139]
 [ 0.     0.596  0.174  0.152  0.731  0.188  0.01   0.144]]

[[ 0.034 0.828 0.403 0.196 0. 0.188 0.004 0.28 ]

[ 0.008 0.716 0.556 0.244 0. 0.224 0.003 0.261]

[ 0.04 0.924 0.323 0. 0. 0.118 0.003 0.162]

[ 0.007 0.588 0.436 0.152 0.622 0.186 0.001 0.139]

[ 0. 0.596 0.174 0.152 0.731 0.188 0.01 0.144]]

4. Binarize Data (Make Binary)

You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.

This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.

You can create new binary attributes in Python using scikit-learn with the Binarizer class.

# binarization
from sklearn.preprocessing import Binarizer
import pandas
import numpy
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(binaryX[0:5,:])

# binarization

from sklearn.preprocessing import Binarizer

import pandas

import numpy

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

# separate array into input and output components

X = array[:,0:8]

Y = array[:,8]

binarizer = Binarizer(threshold=0.0).fit(X)

binaryX = binarizer.transform(X)

# summarize transformed data

numpy.set_printoptions(precision=3)

print(binaryX[0:5,:])

You can see that all values equal or less than 0 are marked 0 and all of those above 0 are marked 1.

[[ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  0.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.]
 [ 0.  1.  1.  1.  1.  1.  1.  1.]]

[[ 1. 1. 1. 1. 0. 1. 1. 1.]

[ 1. 1. 1. 1. 0. 1. 1. 1.]

[ 1. 1. 1. 0. 0. 1. 1. 1.]

[ 1. 1. 1. 1. 1. 1. 1. 1.]

[ 0. 1. 1. 1. 1. 1. 1. 1.]]

Summary

In this post you discovered how you can prepare your data for machine learning in Python using scikit-learn.

You now have recipes to:

Rescale data.
Standardize data.
Normalize data.
Binarize data.

Your action step for this post is to type or copy-and-paste each recipe and get familiar with data preprocesing in scikit-learn.

Do you have any questions about data preprocessing in Python or this post? Ask in the comments and I will do my best to answer.

171 Responses to How To Prepare Your Data For Machine Learning in Python with Scikit-Learn

suhel May 18, 2016 at 2:40 pm #

Hey Jason,

On Normalizing, do you need to do this if you are planning on using euclidean, or cosine distance measures to find similar items in a dataframe?

e.g. you have a vector where each column has some attributes about the product, and you want to find other products that have similar attributes.

Keen to hear your thoughts

Thanks
SM

Reply
Ernest Quisbert December 20, 2016 at 9:41 pm #

Excellent!

Reply
- Jason Brownlee December 21, 2016 at 8:37 am #
  
  Thanks Ernest.
  
  Reply
Akshay January 5, 2017 at 11:44 pm #

Hi Jason,
Thanks for the post and the website overall. It really explains a lot.
I have a question regarding preparing the data ,if I am to normalize my Input data, does the precision of the values have an effect ? Will it make the weight matrix more sparse while training with higher precision if the training data is not very high?

In that case should I be limiting the precision depending on the amount of training data?

I am interested in sequence classification for EEG, In my case I intend to try out RNN . I was planning on normalizing the data since I wish the scaling to be performed on each individual input sequence.

Hoping to hear from you,thanks !

Reply
- Jason Brownlee January 6, 2017 at 9:12 am #
  
  Great question Akshay.
  
  I don’t have a clear answer for you. It may. I have not seen it have an effect, but I would not rule it out.
  
  If you’re worried, I would recommend testing with samples of your data at different precisions and different transforms and evaluate the effect.
  
  I expect the configuration of your model will be a much larger leaver on performance.
  
  Reply
  - Akshay January 6, 2017 at 1:39 pm #
    
    Hi Jason,Thank you for the reply.
    
    I intend to build an RNN from scratch for the application similar to sentiment analysis (Many to one). I am a bit confused about the final stage. while training, when I feed a single sequence(belong to one of the class) to the training set , do I apply softmax to the last output of the network alone and compute the loss and leave the rest unattended?
    Where exactly is the many to “ONE” represented?
    
    Reply
    - Jason Brownlee January 7, 2017 at 8:22 am #
      
      Sorry Akshay, I don’t have example of implementing an RNN from scratch.
      
      My advice would be to peek into the source code for standard deep learning library like Keras.
      
      Reply
DImos May 24, 2017 at 8:04 am #

Should one normalize the test and train datasets separately? or does he have to normalize the whole dataset, before splitting it?

Reply
- Jason Brownlee June 2, 2017 at 11:30 am #
  
  Yes. Normalize the train dataset and use the min/max from train to normalize the test set.
  
  Reply
  - Rizwan Mian, PhD August 1, 2017 at 1:25 pm #
    
    In this case, min/max of test set might be smaller or bigger than min/max of the training set. If they are, would it cause a problem to the validation?
    
    Reply
    - Jason Brownlee August 2, 2017 at 7:43 am #
      
      You should estimate them using domain knowledge if possible, otherwise, estimate from train and clip test data if they exceed the known bounds.
      
      Reply
Roy July 10, 2017 at 7:08 am #

Hi Jason, I often read about people normalize on the input features, but not on output, why?

Should we normalize on the output features as well if the output have a wide range of scale too? from 1e-3 to 1e3

Reply
- Roy July 10, 2017 at 7:11 am #
  
  BTW, it is for a regression problem.
  
  Reply
  - Jason Brownlee July 11, 2017 at 10:22 am #
    
    You can normalize the output variable in regression too, but you will need to reverse the scaling of predictions in order to make use of them or quote error scores in a meaningful way (e.g. meaningful to your problem).
    
    Reply
    - Roy July 20, 2017 at 12:25 pm #
      
      The MSE loss is very high (1e8) when I didn’t applied normalization on the output variable, and small MSE loss (0.0xxx) when I applied normalization.
      
      Is there something wrong in my implementation? Should I run large epochs(maybe 50000?) when the output variable isn’t normalized? (currently running 500 epochs with normalization).
      
      Reply
      - Jason Brownlee July 21, 2017 at 9:27 am #
        
        Perhaps. Continue to try different things and see if you can learn more about your system/problem.
      - Rizwan Mian, PhD August 1, 2017 at 1:32 pm #
        
        @Roy,
        – if you don’t normalize and the features are not of similar scale, then the gradient descent would take a very long time to converge [1]
        – if Root MSE is much much smaller than the mean/median value of the predicted vector, I think your model is good enough
        
        [1] Stranford.edu: https://www.youtube.com/watch?v=yQci-wS0iMw&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW&index=21https://www.youtube.com/watch?v=yQci-wS0iMw&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW&index=21
- Jason Brownlee July 11, 2017 at 10:21 am #
  
  Normalizing input variables is an intent to make the underlying relationship between input and output variables easier to model.
  
  Reply
Rizwan Mian, PhD August 1, 2017 at 1:23 pm #

Your tutorials are awesome. 🙂

I have converted rescaledX to a dataframe and plotted histogram for rescaling, standardization and normalization. They all seem to be scaling down the magnitude of an attribute to a small range — 0 to 1 in case of rescaling and normalization.
– are they doing similar transformation i.e. scaling down attributes so they become comparable?
– do you only apply one method in any given situation?
– which would be appropriate in which situation?

Thanking in advance.

Reply
- Jason Brownlee August 2, 2017 at 7:43 am #
  
  Good questions.
  
  This post explains how they work and when to use them:
  https://machinelearningmastery.com/how-to-scale-data-for-long-short-term-memory-networks-in-python/
  
  Reply
Vijay Jayaraman August 1, 2017 at 6:24 pm #

Hi Jason, I really like your posts. I was looking for some explanation on using power transformations on data for scaling. Like using logarithms and exponents and stuff like that. I would really like to understand what it does to the data and how we as data scientist can be power users of such scaling techniques

Reply
- Jason Brownlee August 2, 2017 at 7:50 am #
  
  The best advice is to try a suite of transforms and see what results in the more predictive model.
  
  I have a few posts on power transforms like log and boxcox, try the search feature.
  
  Reply
Noor August 15, 2017 at 7:39 pm #

Hi Jason,thanks for your all posts , I have question related to Multilayer Perceptron classification algorithm

if we want to apply this algorithm on mixed data set (numeric and nominal).

EX (23,125,75,black,green) this data presents the age ,length,weight ,Hair color, Eye color Respectively.

For numeric attributes we will normalize the data to be in the same range.
what about nominal attributes?
Do we need to transform nominal attributes to binary attributes?

Reply
- Jason Brownlee August 16, 2017 at 6:32 am #
  
  I would recommend either using an integer encoding or a one hot encoding.
  
  It is common to use a one hot encoding.
  
  I have many posts on the topic.
  
  Reply
mitillo September 2, 2017 at 9:33 pm #

Hello Jason, great post
However,
I have a question (maybe is almost the same that Dimos).
What is the most often approach to preprocess (I mean use 1 of 4 explained)
How values you normalize?
all features (X)
fit_transform train features(X_train_std=model.fit_trainsform(X_train)) and from them transform X_test (X_test_std=model.transform(X_test))

and then:
If we have to predict new features that I get today(for example: 0,95,80,45,92,36.5,0.330,26,0 in diabetes model)

we have to preprocess that feature or is not necessary relevant and predict it without preprocess:

Thank you for help

Reply
- Jason Brownlee September 3, 2017 at 5:43 am #
  
  Any process used to prepare data for training the model must be performed when making predictions on new data with the final model.
  
  This means coefficients used in scaling (e.g. min/max) are really part of the model and must be chosen mindfully.
  
  Reply
  - mitillo September 10, 2017 at 11:04 am #
    
    thak you for your answer
    
    Reply
Asmaa October 4, 2017 at 12:29 pm #

Hi Jason

I am applying normalization for network attacker data. i used min/max normalization. but in the real data there is some features have a large values. if i want to apply standard deviation normalization. should i apply only one normalization type? or can i apply min/max for all data and then apply standard deviation for all data. what is the sequence and is it wrong if i apply standard deviation normalization only on the large value features?

Reply
- Jason Brownlee October 4, 2017 at 3:38 pm #
  
  I would recommend trying both approaches and see what works best for your data and models.
  
  Reply
Peter October 23, 2017 at 7:17 pm #

I don’t understand the two comands.
X = dataset[:,0:8]
Y = dataset[:,8]

Reply
- Jason Brownlee October 24, 2017 at 5:29 am #
  
  This is called array slicing, I will have a post on this topic on the blog tomorrow.
  
  Here, we are selecting the columns 0-7 for input and 8 for output.
  
  Reply
Wisdom November 7, 2017 at 10:27 pm #

Dear Dr. Jason Brownlee, i have prepared my own dataset on hand writing from different people, and i prepared the images in 28X28 pixel so the problem is how i am going to prepare the training and testing data set so as i will then write the code to recognize the data?

Reply
- Jason Brownlee November 8, 2017 at 9:22 am #
  
  Sounds great.
  
  Reply
  - Wisdom November 24, 2017 at 12:48 pm #
    
    my idea is can you help me how i have to do that? and how i have to read my images data set and training data set using tensorflow?
    
    Reply
    - Jason Brownlee November 24, 2017 at 3:09 pm #
      
      Perhaps this example would help:
      https://machinelearningmastery.com/object-recognition-convolutional-neural-networks-keras-deep-learning-library/
      
      Reply
Wisdom November 26, 2017 at 10:09 pm #

That is a great link that shows how to use the existing CIFAR-10, thank you for that, but as i tried to mention it above, i have handwritten images prepared in 28×28 pixels, so how i have to prepare the training set (how to label my dataset)? it can be .csv or .txt file, i need the way how i have to prepare training set and access in tensorflow like MNIST?

Reply
- Jason Brownlee November 27, 2017 at 5:51 am #
  
  The images will be standard image formats like jpg or png, the labels will be in a csv file, perhaps associated with each filename.
  
  Reply
Jourdan Go December 26, 2017 at 7:38 pm #

Hi Jason. First of all, great work with the tutorials.

Here’s something I don’t understand though. What’s the difference between rescaling data and normalizing data? It seems like they’re both making sure that all values are between 0 and 1 ?
So what’s the difference?

Thanks.
Please email me the answers as well since i do not check this blog often

jourdandgo@gmail.com

Reply
- Jason Brownlee December 27, 2017 at 5:17 am #
  
  Normalizing is a type of rescaling.
  
  Reply
Andualem Alemu January 2, 2018 at 11:22 pm #

Hello Sir!! I am planning a research work which is about music genera classification. My work includes preparing the dataset for the type of music I want to use as there are no public dataset for those music. My problem is I don’t know how to prepare music dataset. I have red a lot about spectrogram. But, what are the best mechanisms to prepare music dataset? Is it only spectrogram I have to use or I have alternate choices?

Reply
- Jason Brownlee January 3, 2018 at 5:37 am #
  
  Sorry, I cannot help you with music datasets. I hope to cover the topic in the future.
  
  Reply
Hai Nguyen January 9, 2018 at 7:22 am #

Hi, if I would like to scale a image with dim=3x64x64. How to use StandardScaler() to do that? Thank you

Reply
- Jason Brownlee January 9, 2018 at 3:17 pm #
  
  Sorry, the standard scalar is not for scaling images.
  
  Reply
  - Hai Nguyen January 10, 2018 at 3:35 am #
    
    so, to improve the performance of training images, which scale method we should use? Or, just divide train set to a value, for example, train_x/255, …?
    
    Reply
    - Jason Brownlee January 10, 2018 at 5:30 am #
      
      Try a suite of transforms and see what works best for your data.
      
      Reply
Shabnam January 31, 2018 at 5:19 am #

Hi Jason, thanks for your posts.
I have a question about data preprocessing. Can we have multiple inputs with different shape? for example two different files, one including bit vectors, one including matrixes?
If so, how can we use them for ML algorithms?

Reply
- Shabnam January 31, 2018 at 7:30 am #
  
  Basically, I want to add additional information to data, so classifier can use for better prediction.
  
  Reply
- Jason Brownlee January 31, 2018 at 9:49 am #
  
  For most algorithms the input data must be reshaped and padded to be homogeneous.
  
  Reply
  - Shabnam February 1, 2018 at 4:21 am #
    
    Thanks for your response. Yes, I understand that. This extra information is like a metadata that gives information about the structure that generates the data. Therefore, it is a separate type that gives mores information about the system. Is there any way to apply it to ML algorithms?
    
    Reply
    - Jason Brownlee February 1, 2018 at 7:25 am #
      
      Sure, perhaps you could use a multiple input model or an ensemble of models.
      
      Reply
      - Shabnam February 4, 2018 at 6:02 pm #
        
        Do you have any link/reference suggestion that I can read more about it? I could not find a good resource yet. Thanks in advance.
John Reed February 25, 2018 at 6:10 am #

Hi Jason,

What is Y used for? I realize the comment and description say it’s the output column, but after slicing the ‘class’ column to it, I’m not seeing Y used for anything in the four examples. Commenting it out does not seem to have any effect. Is it just a placeholder for later? If so, why did we assign ‘class’ data to it instead of creating an empty array?

Thanks,

John

Reply
- Jason Brownlee February 25, 2018 at 7:46 am #
  
  We are not preparing the class variable in these examples. It is needed during modeling.
  
  Reply
Donatas February 26, 2018 at 2:06 am #

Thanks for great article. I would like to ask a question regarding using simple nearest neighbors algorithm from scikit learn library with standard settings. I have a list of data columns from salesforce leads table giving few metrics for total time spent on page, total emails opened, as well as alphabetical values such as – source of the lead with values signup, contact us etc., as well as country of origin information.

So far I have transformed all non-numerical data to numerical form in the simple way 0, 1, 2, 3, 4 for each unique value. With this approach scoring accuracy seams to reach 70% at its best. Now I want to go one step further and either normalize or standardize the data set, but can’t really decide which route to take. So far I have decided to go with safest advice and standardize all data. But then I have worries about some scenarios, for example certain fields will have long ranges of data, i.e. those representing each country, or those that show number of emails sent. On another hands other fields like source, will have numerical values 0, 1, 2, 3 and no more, but the field itself does have very high correlation to the outcome of winning lead or loosing lead.

I would be very grateful if you could point me to the right direction and perhaps without too much diving into small details, what would be the common sense approach.

Also, is it possible to use both methods for data set, i.e. standardize data first, and then normalize.

Thanks,
Donatas

Reply
- Jason Brownlee February 26, 2018 at 6:07 am #
  
  Good question.
  
  The data preparation methods must scale with the data. Perhaps for counts you can estimate the largest possible/reasonable count that you can use to normalize the count by, or perhaps invert the count, e.g. 1/n.
  
  Reply
Diehumblex March 12, 2018 at 5:47 am #

Hi @jason can you please tell why normalizer result and rescaling (0-1) results are different. isn’t there a standard way of doing so which should give the same result irrespective of the class used (i.e MinMaxScaler or normalizer).

Reply
- Jason Brownlee March 12, 2018 at 6:34 am #
  
  I don’t follow, sorry. Can you give more context?
  
  Reply
Tunde March 27, 2018 at 6:21 pm #

Hi Sír. I have a housing datasets whose target variable is a positively skewed distribution. So far that’s the only variable I have seen to be skewed although I think there will be more. Now I have read that there is need to make this distribution approximately a normal distribution using log transformation. But the challenge I’m facing right now is how to perform log transformation on the price feature in the housing dataset. I’d like to if there is a scikit-learn library for this and if not how should I go about it? More so I plan on using linear regression to predict housing prices for this dataset.

Reply
- Jason Brownlee March 28, 2018 at 6:24 am #
  
  You can use a boxcox transform to fix the skew:
  https://machinelearningmastery.com/power-transform-time-series-forecast-data-python/
  
  Reply
Payal April 3, 2018 at 12:22 am #

Hi Jason
I am using MinMaxScaler preprocessing technique to normalize my data. I have data of 200 patients, where each patient data for single electrode is 20 seconds i.e. 10240 sample.Then, the dimension of my data is 200*10240. I want to rescale my data row-wise but MinMaxScaler scale the data column wise which may not be correct for my data as i want to rescale my data accordingly 1*10240.
What changes are required in order to operate row wise independently of other electrode?

Reply
- Jason Brownlee April 3, 2018 at 6:37 am #
  
  In general, each column of data represents a separate feature that may have a different scale or units from other columns.
  
  For this reason, it is good practice to scale data column-wise.
  
  Does that help?
  
  Reply
itisha April 3, 2018 at 10:24 pm #

HEllo sir,
I have colleted 1000 tweets on demonetization. Then i am extracting different features like pos based, lexiocn based fetaures, morphological features, ngram features.So different feature vectors are created for each type and then they are stacked column wise. I have divided dataset of 1000 tweets into 80% as training and 20% as testing. I have trained svm classifier but accuracy is not more than 60%.
How should i improve accuracy or which feature selection should i need to use?

Thanks

Reply
- Jason Brownlee April 4, 2018 at 6:13 am #
  
  Here are some ideas:
  https://machinelearningmastery.com/machine-learning-performance-improvement-cheat-sheet/
  
  Reply
Md Jewele Islam April 16, 2018 at 4:54 am #

Big Question is for me. Why should we use random values for weight and Bias value?

Reply
- Jason Brownlee April 16, 2018 at 6:13 am #
  
  Good question.
  
  Neural nets use random initial values for the weights. This is by design. It allows the learning algorithn (batch/mini-batch/stochastic gradient descent) to explore the weight space from a different starting point each time the model is evaluated, removing bias in the training process.
  
  It is why it is a good idea to evaluate a neural net by training multiple instances of the model:
  https://machinelearningmastery.com/evaluate-skill-deep-learning-models/
  
  Does that help?
  
  Reply
kumar April 20, 2018 at 3:27 pm #

X = array[:,0:8]
Y = array[:,8] I have doubt here X is only 0 to right feature and Y is target 8th column right .

Reply
- Jason Brownlee April 21, 2018 at 6:41 am #
  
  Confirm it by inspecting the data, it is correct.
  
  Learn more about array slicing and ranges in python here:
  https://machinelearningmastery.com/index-slice-reshape-numpy-arrays-machine-learning-python/
  
  Reply
Sam May 4, 2018 at 4:43 am #

Thank you so much for all of your help – I have learned a ton from all of your posts!

I have a project where I have 54 input variables, and 8 output variables. I have decent results from what I have learned from you. However, I have standardized all my input variables, and I think I could achieve better performance if I only standardize some of them. Meaning, 5 of the input columns are the same variable type as the outputs, I think it would be better not to scaler this. Additionally, I one of the inputs in the month of the year – I do not think that that needs to be standardized either.

Does my thought process to do selective preprocessing make any sense? Is it possible to do this?

Thank you

Reply
- Jason Brownlee May 4, 2018 at 7:49 am #
  
  You’re welcome Sam.
  
  Perhaps. I would recommend designing and running careful experiments to test your idea. Let the results guide you rather than “best practice”.
  
  Reply
C. Blaise Mitsutama May 12, 2018 at 1:07 am #

Hi, Jason,

Really helpful article. I tried to access the Pima Indian diabetes dataset and it’s no longer available to download at the link provided due to permission restrictions.

Reply
- Jason Brownlee May 12, 2018 at 6:34 am #
  
  Note that I provided an alternate link in the post.
  
  Here it is again:
  https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv
  
  Reply
  - C. Blaise Mitsutama May 12, 2018 at 3:31 pm #
    
    Somehow I missed that. Thanks! 🙂
    
    Reply
    - Jason Brownlee May 13, 2018 at 6:34 am #
      
      No probs.
      
      Reply
Vaibhav July 18, 2018 at 9:35 pm #

I have n-dimensional binary data. Suggest some good classifier for binary data-set.

Reply
- Jason Brownlee July 19, 2018 at 7:51 am #
  
  Try a suite of methods and see what works best for your specific dataset.
  
  Reply
Suprasad Kamath July 25, 2018 at 12:43 pm #

Hello Jason, I follow your posts very closely as I am studying machine learning on my own. With respect to scaling/normalizing data, I always have a dilemma. When do I use what? Is there any way to know beforehand which regression/classification models will benefit from scaling or normalizing data? For which models its not required to scale or normalize data?

Reply
- Jason Brownlee July 25, 2018 at 2:40 pm #
  
  Good question, I answer it here:
  https://machinelearningmastery.com/faq/single-faq/when-should-i-standardize-and-normalize-data
  
  Reply
Swati August 1, 2018 at 3:14 am #

Hi Jason,

Thank you for this post. It was very helpful. I have a question on the normalization/standardization approach,when the dataset contains both numeric and categorical features. I am converting the categorical features into dummy values (contains 0 or 1). Should the numeric features be standardized along with the dummy variables?
or, 2) Should the numeric features be only be standardized?

Please provide your thoughts.

Reply
- Jason Brownlee August 1, 2018 at 7:49 am #
  
  No need to scale dummy variables.
  
  Reply
Rowena Low August 8, 2018 at 10:51 am #

Hello Jason,

Thank you for sharing your expertise! I am a complete newbie to Python but have programmed before in stats software like EViews. Are the datasets in sklearn.database formatted differently? I tried to run the following code:

# Load the data
from sklearn.datasets import load_iris
iris = load_iris()

from matplotlib import pyplot as plt

# The indices of the features that we are plotting
x_index = 0
y_index = 1

# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])

plt.figure(figsize=(5, 4))
plt.scatter(iris.data[:, x_index], iris.data[:, y_index], c=iris.target)
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index])

plt.tight_layout()
plt.show()

It’s fine with the dataset from sklearn. But once I use pandas.read_csv to load the iris dataset from a url and then run the code, it just gives me tons of angry text. If I were to use pandas.read_csv, how should I format and store the data such that the aforementioned scipy code would work? Thank you so much!

Reply
- Jason Brownlee August 8, 2018 at 2:17 pm #
  
  Perhaps try this tutorial to get a handle on loading CVS files:
  https://machinelearningmastery.com/load-machine-learning-data-python/
  
  Reply
Pradeep Singh August 9, 2018 at 5:23 am #

Hi! Jason , thanks for the great tutorial. i have a question, would like to hear from you.
from the knowledge statistics to normalize the data we use the formulae

[ X – min(X) ]/( [max(X) – min(X) ] ).

by using this formulae when i am trying to compute the normalised data the answer i am getting is different from that obtained by the Normalizer class. Why is it so.

Reply
- Jason Brownlee August 9, 2018 at 7:45 am #
  
  Normalizer is normalizing the vector length I believe.
  
  Use the MinMax scaler instead.
  
  Reply
Mona August 23, 2018 at 4:32 am #

How to build a single module which can Rescale data,Standardize data,Normalizedata,Binarize data. Please explain

Reply
- Jason Brownlee August 23, 2018 at 6:18 am #
  
  You can create a Pipeline:
  https://machinelearningmastery.com/automate-machine-learning-workflows-pipelines-python-scikit-learn/
  
  Reply
munaza August 30, 2018 at 9:33 pm #

Thank you so much

Reply
- Jason Brownlee August 31, 2018 at 8:10 am #
  
  I’m happy it helped.
  
  Reply
Neel September 1, 2018 at 7:35 pm #

I am a new learner with Python ,your comments are of great help Jason .

May you guide how i can comment on class distribution whether ot is balanced or unbalanced and can print the total number of 1’s and 0’s in the training and testing data.

Reply
- Jason Brownlee September 2, 2018 at 5:30 am #
  
  I have advice on working with imbalanced data here:
  https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
  
  Reply
Simeon September 10, 2018 at 3:21 pm #

Hi Jason,
your articles are awesome, thank you very much I am subscribed for forever.
I have a question after scaling the input for my regression model and created the model I need to scale again my input data when I use this ckpt file, how can I pass this scaling to the place where I will use the model? Via the TF session?

Maybe more clearly what I want to do

when I train I do (my data is multiple samples in a csv file):
scaler = StandardScaler().fit(X_train)
X_standard = scaler.transform(X_train)

when I validate I do (my data is multiple samples in a csv file):
scaler = StandardScaler().fit(X_validate)
X_standard = scaler.transform(X_validate)

But here it comes the problem using the saved the model I want to restore the model with a single sample as an input:

X_scaler = preprocessing.StandardScaler()
X_test_test = X_scaler.fit_transform(X_test)

But it gives me output [[0,0,0,0,0,]] and because of that I cannot predict anything with the model object. Can you tell me where I am making mistake?

Reply
- Jason Brownlee September 11, 2018 at 6:26 am #
  
  I’m not sure I follow the question, sorry. What are you trying to do?
  
  Reply
  - Simeon September 11, 2018 at 2:46 pm #
    
    I am making a model using TensorFlow and before batching the data for training I am scaling it using StandardScaler.
    After my model is created I want to restore the checkpoint and make a prediction, but the data that I am inputting is not scaled.
    So my question is how to scale the same way the data when restoring a model?
    Because when I am training the data I am scaling among all the data, the whole data.csv file but later when restoring the model my input is a single sample.
    
    I hope is more clear now.
    
    Reply
    - Jason Brownlee September 12, 2018 at 8:10 am #
      
      You must save the scaler as well, or the coefficients used by the scaler.
      
      Reply
      - Simeon September 12, 2018 at 11:45 am #
        
        Can you show me an example? I have done this serializing in using pickle and then using it again but does not look like the most pretty option.
      - Jason Brownlee September 12, 2018 at 2:38 pm #
        
        You can use pickle.
        
        I think I have some examples of saving scalers and tokenizers with some of the NLP tutorials.
Shrutimoy November 5, 2018 at 12:45 am #

Hey Jason,

I am preprocessing CIFAR-10 data by the sklearn StandardScaler(copy = False, with_mean = True, with_std= True). Then I am doing dimensionality reduction by pca followed by lda on the principal components. The problem is that if I do dimensionality reduction without preprocessing everything works fine, however if I do so after preprocessing I am getting “Memory Error”. I am using svd solver for pca and and eigen solver with auto shrinkage for lda(linear discriminants). Do you have any idea about what might be the cause of this problem?
** I tried min max scaling without calling the library function. Even then I am getting the same error.
Thank You

Reply
- Jason Brownlee November 5, 2018 at 6:16 am #
  
  No idea, sorry. Perhaps post your code and error to stackoverflow?
  
  Reply
Najam ul Qadir November 12, 2018 at 9:42 pm #

Hi Jason. I have recently derived the exact Hessian and gradient for the LM algorithm to train feedforward neural networks via direct differentiation. I have also applied data preprocessing to the input data via two of the techniques you have suggested – scaling and normalization. In some datasets, I have observed faster convergence than the neural network toolobox in MATLAB. You can also view the published paper at:
https://www.omicsonline.org/open-access/direct-differentiation-based-hessian-formulation-for-training-multilayerfeed-forward-neural-networks-using-the-lm-algorithmperform-2229-8711-1000223.pdf

Is it possible I can send you my currently ongoing papers for review once it is finished.
Thanks and best regards,
Najam

Reply
- Jason Brownlee November 13, 2018 at 5:45 am #
  
  Nice work.
  
  Sorry, I don’t have the capacity to review your paper.
  
  Reply
Maxim November 27, 2018 at 4:02 am #

Great article! Do you have insights can multiple scalers/normalizers be used consequently? I mean applying them after each other to the original dataset.

Reply
- Jason Brownlee November 27, 2018 at 6:39 am #
  
  I have some advice for their order here:
  https://machinelearningmastery.com/machine-learning-data-transforms-for-time-series-forecasting/
  
  Reply
Shwethaa January 4, 2019 at 7:45 pm #

Is there any ways to report ML result set via mails?

Reply
- Jason Brownlee January 5, 2019 at 6:52 am #
  
  What do you mean exactly?
  
  You can write code to send emails, but that is an engineering question, not machine learning.
  
  Reply
Ahmad Allawi March 1, 2019 at 8:19 pm #

Dear Dr. Brownlee esq.
You are kindly requested to reply to my, rather simple question:
In ANN, the input data is scaled and the output of the ANN is between (0) and (1). How to convert the output back to real numbers, like power, impedance, voltages, currents etc.
Your reply is very highly appreciated,
Kind Regards
Ahmad Allawi
Consultant Electrical Engineer
PTC Community Member-Mathcad Div.,

Reply
- Jason Brownlee March 2, 2019 at 9:32 am #
  
  In scikit-learn you can achieve this with: encoder.inverse_transform()
  
  Reply
maunish March 26, 2019 at 5:51 pm #

Hey jason thanks for the turorial.

I tried all the given preprocessing technique and then trained the data using SVC

i found that MinMaxScale is giving the higest acuracy

so my question is wether we should always go for preprocessing technique which gives higest

accuracy ?

Reply
- Jason Brownlee March 27, 2019 at 8:57 am #
  
  Yes, exactly.
  
  Use pre-processing techniques that give the best model performance.
  
  Reply
danial April 3, 2019 at 7:28 pm #

hi jason
can i do this in prediction problem?
this in python
data_set = data_set_ / 100000
before prediction i did this
after i did predict*100000
is that right

Reply
- Jason Brownlee April 4, 2019 at 7:50 am #
  
  Yes, if the largest value ever to be seen is 100K.
  
  Reply
  - Danial April 4, 2019 at 8:31 am #
    
    It Means it’s OK if 1st divide by 100000 than multiply at the end with prediction result? . 1st scale the data and than resclae at the end. I didn’t get meaning about to see data 100K
    Can you explain or share some link. Thanks
    
    Reply
    - Danial April 4, 2019 at 8:33 am #
      
      Without scale and resclae error is big. I was thinking if I don’t sclae and rescale result should be same but not the error is big.
      
      Reply
    - Jason Brownlee April 4, 2019 at 2:07 pm #
      
      Yes, that sounds fine.
      
      Reply
Kayal April 14, 2019 at 3:56 pm #

What is the difference between rescaling and normalising data?

Reply
- Jason Brownlee April 15, 2019 at 7:50 am #
  
  Normalizing is a type of rescaling.
  
  Reply
Sang-Ho Kim April 15, 2019 at 10:54 am #

Hi, Jason,

one quick question. Should I standardscaler both input and labels? or only input should work fine. Any difference between them? which one do you recommend? Thanks.

Sang-Ho

Reply
- Sang-Ho Kim April 15, 2019 at 11:57 am #
  
  Oops! the similar question was already answered above. please ignore this question. Thanks
  
  Reply
  - Jason Brownlee April 15, 2019 at 2:39 pm #
    
    No problem.
    
    Reply
- Jason Brownlee April 15, 2019 at 2:39 pm #
  
  For regression, yes this can be a good idea.
  
  I recommend testing with and without scaling inputs and outputs to confirm it lifts model skill.
  
  Reply
Palak Halvadia April 20, 2019 at 5:01 am #

Most of the reference articles I have followed used some kind of numeric data in their data set. How do I work with the non numeric data for making a recommendation system. I look forward very eagerly for your help. Thank you in advance.

Reply
- Jason Brownlee April 20, 2019 at 7:43 am #
  
  If the data is categorical, you can encode it as integer, one hot encoding or use an embedding.
  
  Does that help?
  
  Reply
  - Palak Halvadia April 21, 2019 at 6:55 am #
    
    Can you share an example for this which includes the creation of recommendation system also?
    
    Reply
    - Jason Brownlee April 21, 2019 at 8:29 am #
      
      Thanks for the suggestion, I hope to cover the topic in the future.
      
      Reply
Ain April 28, 2019 at 2:20 pm #

Is there any chance/situation, we got the higher value of rating prediction than actual rating? Let’s say i use the data set contain of 1-5 rating, and once i apply the collaborative filtering techniques, i got the prediction rating for some items is 5.349++ ?

Reply
- Jason Brownlee April 29, 2019 at 8:14 am #
  
  Perhaps, it depends on your choice of model.
  
  If your model does this, e.g. a neural net, you might want write some code to help interpret the prediction, e.g. round it down.
  
  Reply
Ain April 29, 2019 at 1:16 pm #

I thought at first it could be wrong to have the result of higher prediction value in recommender system. Thank you so much for the explanation.

Reply
danial April 30, 2019 at 5:04 pm #

maxVal = max(data_set[0:num, 0]).astype(‘float32’)

predict, len(test)+1-i) – maxVal

hi jason, can i ask you is that right above 2 lines of code if i do?beause if i dont minus maxval true and predicted have big difference,,one curve near to maximum value other is near to low values

Reply
- Jason Brownlee May 1, 2019 at 7:00 am #
  
  Perhaps post to stackoverflow?
  
  Reply
  - Danial May 1, 2019 at 2:09 pm #
    
    I think so yes. Because the maximum value in data set is 55000 and without minus of maxVal function maximum predicted value goes to 10000 and shows predicted near to 10000 and true near to 55000. And Kindly suggest some link how to use one week or one year predicted values to predict 24 hours ahead or one week ahead in CNN?
    
    Reply
Patrick Sibanda May 25, 2019 at 8:01 pm #

Is there a way I can detect and ignore outliers within the process of re-sampling a data-set using

df1= df.resample(‘3T’).agg(dict(Tme=’first’, Vtc=’mean’, Stc=’mean’))

This line groups the data by “Tme” and computes the mean of the “Vtc” and “Stc” using all the values of the “Vtc” that fell at a particular “Tme”. But some of these data points are outliers. Is there anything I cann do within the .agg() so that I can ignore the outliers whn evaluating the mean

see my problem on stackoverflow here

https://stackoverflow.com/questions/56276509/detect-and-exclude-outliers-when-resampling-by-mean-pandas

Reply
- Jason Brownlee May 26, 2019 at 6:44 am #
  
  Perhaps use a statistical method after gathering the sample?
  
  Reply
Kuldeep May 26, 2019 at 12:59 am #

While Preparing data for machine learning, we have 3 options: we should either Rescale data,Standardize data or Normalize data. In case of Pima Indian diabetes dataset, which option should i select ? because the data visualization shows that some attributes have nearly Gaussian distribution, some have exponential distribution. So when attributes have mixed distribution, if i opt for standardization it will be good for those attributes which have Gaussian distribution and not for exponential distribution attributes as these attributes we should go for log or cube root . So how do we handle this??

Reply
- Jason Brownlee May 26, 2019 at 6:46 am #
  
  It depends on the models used.
  
  I recommend testing different data preparation methods with different models to rapidly find a combination that works well.
  
  Reply
John May 29, 2019 at 1:51 am #

This is neat, but it’s all applied to the training data. What if we create this model and are happy with it, but we then get a new batch of data every night that we want to score? Is there a way to save ALL of these pre-processing treatments in such a fashion that they can be applied directly to a new set of data, hopefully in just a couple lines of code. (e.g. the vtreat package in R can do some data treatments and you can save the “treatment plan” then apply it at scoring time of any new data).

Is there an easy recipe for doing that in Python?

Reply
- Jason Brownlee May 29, 2019 at 8:47 am #
  
  Yes, you can use a pipeline:
  https://machinelearningmastery.com/automate-machine-learning-workflows-pipelines-python-scikit-learn/
  
  You could re-train a model each night and compare it to the old model to see what works best.
  
  Reply
Abhi June 12, 2019 at 9:55 pm #

Hi Mr.Jason,
I am working on dataset containing ‘javascript code samples’.I have to preprocess the dataset so that it will be available for further model.For this type of dataset what type of encoding should i use.

Reply
- Jason Brownlee June 13, 2019 at 6:17 am #
  
  Perhaps compare a bag of words and a word embedding?
  
  Reply
Abhi June 13, 2019 at 3:50 pm #

Thanks for your reply.I think, since dataset is not for document analysis so i am in doubt whether BOW will work or not.May you please clearify a bit.

Reply
- Jason Brownlee June 14, 2019 at 6:36 am #
  
  Perhaps try it and see?
  
  Reply
Shahzaib July 31, 2019 at 6:15 pm #

ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’).

I need your help to resolve this error I am beginner.

Reply
- Jason Brownlee August 1, 2019 at 6:46 am #
  
  Perhaps try removing the NaN values from your data prior to modeling?
  
  Reply
krs reddy November 6, 2019 at 5:09 pm #

you can also include RobustScaler() which scales features using statistics that are robust to outliers.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html

Reply
- Jason Brownlee November 7, 2019 at 6:34 am #
  
  Great suggestion!
  
  Reply
Melkamu November 16, 2019 at 12:53 am #

Thank you so much, sir. get a lot of knowledge from your post. but my question will be what is the benefit of data reprocessing for machine learning to make prediction model

Reply
- Jason Brownlee November 16, 2019 at 7:26 am #
  
  Thanks.
  
  To better expose the structure of the problem to the learning algorithms.
  
  Reply
kanni December 10, 2019 at 10:05 pm #

i have a data 400001×16 different types of sensors data how to choose input shape and classify the one class neural network??
training and testing different data …..
these are different types of sensors data how to put the input shape??
how to classify in neural network??

Reply
- Jason Brownlee December 11, 2019 at 6:58 am #
  
  If these are time series, then this will show you how to reshape the data:
  https://machinelearningmastery.com/faq/single-faq/what-is-the-difference-between-samples-timesteps-and-features-for-lstm-input
  
  Reply
  - kanni December 11, 2019 at 5:12 pm #
    
    these are unlabels data so how to possable classification??
    
    Reply
    - Jason Brownlee December 12, 2019 at 6:11 am #
      
      If you want to model your data as classification, you must have labels to learn from.
      
      Reply
Shubham February 27, 2020 at 7:58 pm #

How to create a data set a for a particular object automatically without manual entry

Reply
- Jason Brownlee February 28, 2020 at 6:05 am #
  
  I don’t know.
  
  Reply
Rachel March 3, 2020 at 2:47 am #

Which method should I use if I have too many 0’s and 1’s in my dataset.

Reply
- Jason Brownlee March 3, 2020 at 6:02 am #
  
  I recommend testing a suite of data preparation and modeling algorithms in order to discover what works best for your specific dataset.
  
  Reply
Nhu April 22, 2020 at 5:59 am #

Hi
i like ur post to much. i have a question. Pls help me
i have a dataset. i want clustering customers by Kmeans. So i have preprocess my data
Revency Frequency Monetary
302 312 5288.63
31 196 3079.1
2 251 7187.34
95 28 948.25
Data Rescaling, Data Normalization,Data Standardization. of the three methods above. Which method should I choose to preprocess for my dataset

Reply
- Jason Brownlee April 22, 2020 at 6:10 am #
  
  Perhaps start with normalization.
  
  Reply
Nhu April 22, 2020 at 3:11 pm #

Thank you so much so much so much. i read to many ur post . really very very great

Reply
- Jason Brownlee April 23, 2020 at 5:56 am #
  
  Thanks!
  
  Reply
Nhu April 25, 2020 at 2:33 am #

Hi
pls help me
i have a dataset. i want clustering customers by Kmeans. So i have preprocess my data. i used ” Normalization method” then Kmeans on data just preprocessing. but result differents to much with a other method clustering. so i dont sure . my code is wrong or right when i make preprocess by ” Normalization method”.So can u see my code and tell for me. it right or wrong (i only start study in Python and sorry for my english is not good)
Revency Frequency Monetary
302 312 5288.63
31 196 3079.1
2 251 7187.34
95 28 948.25

#PRE-PROCESSING ———————————————–
col_names = [‘R’,’F’, ‘M’]
#Step 3: Normalize Data
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
tx_user[col_names] = normalizer.fit_transform(tx_user[col_names])

Reply
- Jason Brownlee April 25, 2020 at 7:01 am #
  
  Yes, results will be dependent upon the scale of the input data.
  
  Reply
Nhu April 25, 2020 at 3:12 pm #

Thanks. have a beautifull day to u

Reply
- Jason Brownlee April 26, 2020 at 6:04 am #
  
  You’re welcome.
  
  Reply
Dipen Thakrar June 17, 2020 at 4:40 pm #

Greate Job…

Reply
- Jason Brownlee June 18, 2020 at 6:21 am #
  
  Thanks!
  
  Reply
Abhi Bhagat August 30, 2020 at 5:27 pm #

Can u explain the difference between Normalizing data
and Re-scaling data ?

My doubt arises when we rescaled between 0 to 1. Why arent the 2 same?

Also

“Normalizing in scikit-learn refers to rescaling EACH observation (row) to have a length of 1 (called a unit norm in linear algebra).”
“The rows are normalized to length 1.”

I didnt got these lines.
please explain.

Reply
- Jason Brownlee August 31, 2020 at 6:11 am #
  
  Normalization is a specific type of rescaling. Rescaling is a broader term and might include standardization and other techniques.
  
  Normalization as a term is confusing. In linear algebra it means making the magnitude of the vector 1, this is where sklearn get’s the name and refers to scaling each column as “minmaxscaler”.
  
  In statistics, we refer to normalizing a feature (column) as normalziation. This is the common name that I use for the minmaxscaler.
  
  Reply
Qizal Ashfaq November 26, 2020 at 3:15 am #

How to invert values after using “NORMALIZER”, I mean how to get original values back?

Reply
- Jason Brownlee November 26, 2020 at 6:35 am #
  
  Use:
  
  ... result = transform.inverse_transform(...)
  
  1
  2
  
  ...
  result = transform.inverse_transform(...)
  
  Reply
  - Qizal Ashfaq November 27, 2020 at 1:18 am #
    
    ok. Thank you.
    
    Reply
No Name January 29, 2021 at 1:56 am #

How can I visualize data that is words. For example, I have a csv file with words, but I cant visualize it.

Reply
- Jason Brownlee January 29, 2021 at 6:06 am #
  
  I don’t know about visualizing words, sorry.
  
  Reply
Abhishek August 18, 2022 at 1:35 pm #

Hi Jason, i have normalized my data which i sampled with LHS so it’s histogram looks like uniformly distributed not normally so there is any need of data transformation, if yes then what transformation should i do for my problem.

Reply
- James Carmichael August 19, 2022 at 7:36 am #
  
  Hi Abhishek…A survey of various transformations can be found in the following resource:
  
  https://www.analyticsvidhya.com/blog/2020/07/types-of-feature-transformation-and-scaling/
  
  Reply

Navigation

How To Prepare Your Data For Machine Learning in Python with Scikit-Learn

Need For Data Preprocessing

Need help with Machine Learning in Python?

Preprocessing Machine Learning Recipes

1. Rescale Data

2. Standardize Data

3. Normalize Data

4. Binarize Data (Make Binary)

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To
Your Own Projects

More On This Topic

171 Responses to How To Prepare Your Data For Machine Learning in Python with Scikit-Learn

Leave a Reply Click here to cancel reply.

Navigation

Need For Data Preprocessing

Need help with Machine Learning in Python?

Preprocessing Machine Learning Recipes

1. Rescale Data

2. Standardize Data

3. Normalize Data

4. Binarize Data (Make Binary)

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To Your Own Projects

More On This Topic

171 Responses to How To Prepare Your Data For Machine Learning in Python with Scikit-Learn

Leave a Reply Click here to cancel reply.

Finally Bring Machine Learning To
Your Own Projects