How To Prepare Your Data For Machine Learning in Python with Scikit-Learn

Many machine learning algorithms make assumptions about your data.

It is often a very good idea to prepare your data in such way to best expose the structure of the problem to the machine learning algorithms that you intend to use.

In this post you will discover how to prepare your data for machine learning in Python using scikit-learn.

Let’s get started.

How To Prepare Your Data For Machine Learning in Python with Scikit-Learn

How To Prepare Your Data For Machine Learning in Python with Scikit-Learn
Photo by Vinoth Chandar, some rights reserved.

Need For Data Preprocessing

You almost always need to preprocess your data. It is a required step.

A difficulty is that different algorithms make different assumptions about your data and may require different transforms. Further, when you follow all of the rules and prepare your data, sometimes algorithms can deliver better results without the preprocessing.

Generally, I would recommend creating many different views and transforms of your data, then exercise a handful of algorithms on each view of your dataset. This will help you to flush out which data transforms might be better at exposing the structure of your problem in general.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Preprocessing Machine Learning Recipes

This section lists 4 different data preprocessing recipes for machine learning.

All of the recipes were designed to be complete and standalone.

You can copy and paste them directly into your project and start working.

The Pima Indian diabetes dataset is used in each recipe. This is a binary classification problem where all of the attributes are numeric and have different scales. It is a great example of dataset that can benefit from pre-processing.

You can learn more about this data set on the UCI Machine Learning Repository webpage.

Each recipe follows the same structure:

  1. Load the dataset from a URL.
  2. Split the dataset into the input and output variables for machine learning.
  3. Apply a preprocessing transform to the input variables.
  4. Summarize the data to show the change.

The transforms are calculated in such a way that they can be applied to your training data and any samples of data you may have in the future.

The scikit-learn documentation has some information on how to use various different preprocessing methods. You can review the preprocess API in scikit-learn here.

1. Rescale Data

When your data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale.

Often this is referred to as normalization and attributes are often rescaled into the range between 0 and 1. This is useful for optimization algorithms in used in the core of machine learning algorithms like gradient descent. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors.

You can rescale your data using scikit-learn using the MinMaxScaler class.

After rescaling you can see that all of the values are in the range between 0 and 1.

2. Standardize Data

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.

It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression and linear discriminate analysis.

You can standardize data using scikit-learn with the StandardScaler class.

The values for each attribute now have a mean value of 0 and a standard deviation of 1.

3. Normalize Data

Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm in linear algebra).

This preprocessing can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as K-Nearest Neighbors.

You can normalize data in Python with scikit-learn using the Normalizer class.

The rows are normalized to length 1.

4. Binarize Data (Make Binary)

You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.

This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.

You can create new binary attributes in Python using scikit-learn with the Binarizer class.

You can see that all values equal or less than 0 are marked 0 and all of those above 0 are marked 1.

Summary

In this post you discovered how you can prepare your data for machine learning in Python using scikit-learn.

You now have recipes to:

  • Rescale data.
  • Standardize data.
  • Normalize data.
  • Binarize data.

Your action step for this post is to type or copy-and-paste each recipe and get familiar with data preprocesing in scikit-learn.

Do you have any questions about data preprocessing in Python or this post? Ask in the comments and I will do my best to answer.


Frustrated With Python Machine Learning?

Master Machine Learning With Python

Develop Your Own Models in Minutes

…with just a few lines of scikit-learn code

Discover how in my new Ebook:
Machine Learning Mastery With Python

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.


24 Responses to How To Prepare Your Data For Machine Learning in Python with Scikit-Learn

  1. suhel May 18, 2016 at 2:40 pm #

    Hey Jason,

    On Normalizing, do you need to do this if you are planning on using euclidean, or cosine distance measures to find similar items in a dataframe?

    e.g. you have a vector where each column has some attributes about the product, and you want to find other products that have similar attributes.

    Keen to hear your thoughts

    Thanks
    SM

  2. Ernest Quisbert December 20, 2016 at 9:41 pm #

    Excellent!

  3. Akshay January 5, 2017 at 11:44 pm #

    Hi Jason,
    Thanks for the post and the website overall. It really explains a lot.
    I have a question regarding preparing the data ,if I am to normalize my Input data, does the precision of the values have an effect ? Will it make the weight matrix more sparse while training with higher precision if the training data is not very high?

    In that case should I be limiting the precision depending on the amount of training data?

    I am interested in sequence classification for EEG, In my case I intend to try out RNN . I was planning on normalizing the data since I wish the scaling to be performed on each individual input sequence.

    Hoping to hear from you,thanks !

    • Jason Brownlee January 6, 2017 at 9:12 am #

      Great question Akshay.

      I don’t have a clear answer for you. It may. I have not seen it have an effect, but I would not rule it out.

      If you’re worried, I would recommend testing with samples of your data at different precisions and different transforms and evaluate the effect.

      I expect the configuration of your model will be a much larger leaver on performance.

      • Akshay January 6, 2017 at 1:39 pm #

        Hi Jason,Thank you for the reply.

        I intend to build an RNN from scratch for the application similar to sentiment analysis (Many to one). I am a bit confused about the final stage. while training, when I feed a single sequence(belong to one of the class) to the training set , do I apply softmax to the last output of the network alone and compute the loss and leave the rest unattended?
        Where exactly is the many to “ONE” represented?

        • Jason Brownlee January 7, 2017 at 8:22 am #

          Sorry Akshay, I don’t have example of implementing an RNN from scratch.

          My advice would be to peek into the source code for standard deep learning library like Keras.

  4. DImos May 24, 2017 at 8:04 am #

    Should one normalize the test and train datasets separately? or does he have to normalize the whole dataset, before splitting it?

    • Jason Brownlee June 2, 2017 at 11:30 am #

      Yes. Normalize the train dataset and use the min/max from train to normalize the test set.

      • Rizwan Mian, PhD August 1, 2017 at 1:25 pm #

        In this case, min/max of test set might be smaller or bigger than min/max of the training set. If they are, would it cause a problem to the validation?

        • Jason Brownlee August 2, 2017 at 7:43 am #

          You should estimate them using domain knowledge if possible, otherwise, estimate from train and clip test data if they exceed the known bounds.

  5. Roy July 10, 2017 at 7:08 am #

    Hi Jason, I often read about people normalize on the input features, but not on output, why?

    Should we normalize on the output features as well if the output have a wide range of scale too? from 1e-3 to 1e3

    • Roy July 10, 2017 at 7:11 am #

      BTW, it is for a regression problem.

      • Jason Brownlee July 11, 2017 at 10:22 am #

        You can normalize the output variable in regression too, but you will need to reverse the scaling of predictions in order to make use of them or quote error scores in a meaningful way (e.g. meaningful to your problem).

    • Jason Brownlee July 11, 2017 at 10:21 am #

      Normalizing input variables is an intent to make the underlying relationship between input and output variables easier to model.

  6. Rizwan Mian, PhD August 1, 2017 at 1:23 pm #

    Your tutorials are awesome. 🙂

    I have converted rescaledX to a dataframe and plotted histogram for rescaling, standardization and normalization. They all seem to be scaling down the magnitude of an attribute to a small range — 0 to 1 in case of rescaling and normalization.
    – are they doing similar transformation i.e. scaling down attributes so they become comparable?
    – do you only apply one method in any given situation?
    – which would be appropriate in which situation?

    Thanking in advance.

  7. Vijay Jayaraman August 1, 2017 at 6:24 pm #

    Hi Jason, I really like your posts. I was looking for some explanation on using power transformations on data for scaling. Like using logarithms and exponents and stuff like that. I would really like to understand what it does to the data and how we as data scientist can be power users of such scaling techniques

    • Jason Brownlee August 2, 2017 at 7:50 am #

      The best advice is to try a suite of transforms and see what results in the more predictive model.

      I have a few posts on power transforms like log and boxcox, try the search feature.

  8. Noor August 15, 2017 at 7:39 pm #

    Hi Jason,thanks for your all posts , I have question related to Multilayer Perceptron classification algorithm

    if we want to apply this algorithm on mixed data set (numeric and nominal).

    EX (23,125,75,black,green) this data presents the age ,length,weight ,Hair color, Eye color Respectively.

    For numeric attributes we will normalize the data to be in the same range.
    what about nominal attributes?
    Do we need to transform nominal attributes to binary attributes?

    • Jason Brownlee August 16, 2017 at 6:32 am #

      I would recommend either using an integer encoding or a one hot encoding.

      It is common to use a one hot encoding.

      I have many posts on the topic.

Leave a Reply