Automate Machine Learning Workflows with Pipelines in Python and scikit-learn

There are standard workflows in a machine learning project that can be automated.

In Python scikit-learn, Pipelines help to to clearly define and automate these workflows.

In this post you will discover Pipelines in scikit-learn and how you can automate common machine learning workflows.

Let’s get started.

  • Update Jan/2017: Updated to reflect changes to the scikit-learn API in version 0.18.
Automate Machine Learning Workflows with Pipelines in Python and scikit-learn

Automate Machine Learning Workflows with Pipelines in Python and scikit-learn
Photo by Brian Cantoni, some rights reserved.

Pipelines for Automating Machine Learning Workflows

There are standard workflows in applied machine learning. Standard because they overcome common problems like data leakage in your test harness.

Python scikit-learn provides a Pipeline utility to help automate machine learning workflows.

Pipelines work by allowing for a linear sequence of data transforms to be chained together culminating in a modeling process that can be evaluated.

The goal is to ensure that all of the steps in the pipeline are constrained to the data available for the evaluation, such as the training dataset or each fold of the cross validation procedure.

You can learn more about Pipelines in scikit-learn by reading the Pipeline section of the user guide. You can also review the API documentation for the Pipeline and FeatureUnion classes in the pipeline module.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Pipeline 1: Data Preparation and Modeling

An easy trap to fall into in applied machine learning is leaking data from your training dataset to your test dataset.

To avoid this trap you need a robust test harness with strong separation of training and testing. This includes data preparation.

Data preparation is one easy way to leak knowledge of the whole training dataset to the algorithm. For example, preparing your data using normalization or standardization on the entire training dataset before learning would not be a valid test because the training dataset would have been influenced by the scale of the data in the test set.

Pipelines help you prevent data leakage in your test harness by ensuring that data preparation like standardization is constrained to each fold of your cross validation procedure.

The example below demonstrates this important data preparation and model evaluation workflow. The pipeline is defined with two steps:

  1. Standardize the data.
  2. Learn a Linear Discriminant Analysis model.

The pipeline is then evaluated using 10-fold cross validation.

Running the example provides a summary of accuracy of the setup on the dataset.

Pipeline 2: Feature Extraction and Modeling

Feature extraction is another procedure that is susceptible to data leakage.

Like data preparation, feature extraction procedures must be restricted to the data in your training dataset.

The pipeline provides a handy tool called the FeatureUnion which allows the results of multiple feature selection and extraction procedures to be combined into a larger dataset on which a model can be trained. Importantly, all the feature extraction and the feature union occurs within each fold of the cross validation procedure.

The example below demonstrates the pipeline defined with four steps:

  1. Feature Extraction with Principal Component Analysis (3 features)
  2. Feature Extraction with Statistical Selection (6 features)
  3. Feature Union
  4. Learn a Logistic Regression Model

The pipeline is then evaluated using 10-fold cross validation.

Running the example provides a summary of accuracy of the pipeline on the dataset.

Summary

In this post you discovered the difficulties of data leakage in applied machine learning.

You discovered the Pipeline utilities in Python scikit-learn and how they can be used to automate standard applied machine learning workflows.

You learned how to use Pipelines in two important use cases:

  1. Data preparation and modeling constrained to each fold of the cross validation procedure.
  2. Feature extraction and feature union constrained to each fold of the cross validation procedure.

Do you have any questions about data leakage, Pipelines or this post? Ask your questions in the comments and I will do my best to answer.


Frustrated With Python Machine Learning?

Master Machine Learning With Python

Develop Your Own Models in Minutes

…with just a few lines of scikit-learn code

Discover how in my new Ebook:
Machine Learning Mastery With Python

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.


15 Responses to Automate Machine Learning Workflows with Pipelines in Python and scikit-learn

  1. Ebrahimi September 23, 2016 at 6:47 am #

    Hi
    thanks for your good post!
    When we use pipeline, train data (9 folds) will be normalized and then the parameters used to normalize train data, is used to normalize test data? This post suggest do this:
    http://stats.stackexchange.com/questions/174823/how-to-apply-standardization-normalization-to-train-and-testset-if-prediction-I

    Thanks

  2. Ebrahimi September 23, 2016 at 6:52 am #

    Could you please provide us an example where pipeline is used for data preparation and feature selection? Oversampling (ADASYN) should also be done just on train data. Is it possible to do these three altogether?

    Thank you very much

  3. Ebrahimi September 23, 2016 at 7:15 am #

    thanks
    I found my answer here:
    http://stats.stackexchange.com/questions/228774/cross-validation-of-a-machine-learning-pipeline

  4. Chaks October 5, 2016 at 6:48 pm #

    what happens if Y has more than one column of data ( replacing Y = array[:,8]) and how to do it using your method? Thanks & regards, Chaks

    • Jason Brownlee October 6, 2016 at 9:32 am #

      Great question, the idea of predicting multiple outputs.

      I do not have any examples and I’m unsure of whether sklearn supports this behavior. I know that individual algorithms do support this, such as neural networks.

  5. Chaks October 6, 2016 at 12:05 am #

    How to get precision, recall values with pipelines?

  6. Pratik Patil February 1, 2017 at 9:39 pm #

    Hi Jason,
    After getting a best fit in through Keras/Scikit-wrapper with Pipeline to standardize; how do I access the weights of keras regressor/classifier present in Pipeline? I want to print those weights to text or csv. But I am having difficulty accesing those weights.

    • Jason Brownlee February 2, 2017 at 1:56 pm #

      I’m not sure Pratik.

      You may want to transform the data separately and use a standalone Keras model where you can access the weights directly.

  7. Dan March 6, 2017 at 1:17 am #

    Thanks for the great Post Jason.

    For my understanding. The Feature Union allows us to put to feature extraction methonds into the pipeline which remain independent, right. Because using PCA to get 3 features and then selecting the best 6 ones would make no sense. But how exactly are they combined? Can you elaborate on that or recommend a good source?

    ps: Keras vs tflearn vs Tensorflow? I find Keras the most easiest one for a beginner like me.
    Thanks

    • Jason Brownlee March 6, 2017 at 10:59 am #

      Hi Dan,

      The union of features just adds them to one large dataset as new columns for you to work on/use later in the pipeline.

  8. Dror Atariah July 24, 2017 at 4:39 pm #

    In the second example, why don’t you add to the pipeline a normalization step; something like

    estimators.append((‘standardize’, StandardScaler()))

    from the first example?

    • Jason Brownlee July 25, 2017 at 9:34 am #

      For sure you could. In the second example, I was trying to demonstrate something else besides scaling.

  9. Jan September 8, 2017 at 7:09 pm #

    Hi Jason!

    Thx for this article. Once again, it is the missing part of the puzzle!

    You write “preparing your data using normalization or standardization on the entire training dataset before learning would not be a valid test”. You refer to standardizing the entire data set before splitting into train/validation and independent test set, yes?

    Is there any more information on when and where to standardize the data in supervised learning tasks – some kind of flow chart on how to avoid data leakage for the most common workflows?

    Regards!

Leave a Reply