Automate Machine Learning Workflows with Pipelines in Python and scikit-learn

By Jason Brownlee on August 28, 2020 in Python Machine Learning 84

There are standard workflows in a machine learning project that can be automated.

In Python scikit-learn, Pipelines help to to clearly define and automate these workflows.

In this post you will discover Pipelines in scikit-learn and how you can automate common machine learning workflows.

Kick-start your project with my new book Machine Learning Mastery With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Jan/2017: Updated to reflect changes to the scikit-learn API in version 0.18.
Update March/2018: Added alternate link to download the dataset as the original appears to have been taken down.

Automate Machine Learning Workflows with Pipelines in Python and scikit-learn
Photo by Brian Cantoni, some rights reserved.

Pipelines for Automating Machine Learning Workflows

There are standard workflows in applied machine learning. Standard because they overcome common problems like data leakage in your test harness.

Python scikit-learn provides a Pipeline utility to help automate machine learning workflows.

Pipelines work by allowing for a linear sequence of data transforms to be chained together culminating in a modeling process that can be evaluated.

The goal is to ensure that all of the steps in the pipeline are constrained to the data available for the evaluation, such as the training dataset or each fold of the cross validation procedure.

You can learn more about Pipelines in scikit-learn by reading the Pipeline section of the user guide. You can also review the API documentation for the Pipeline and FeatureUnion classes in the pipeline module.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Pipeline 1: Data Preparation and Modeling

An easy trap to fall into in applied machine learning is leaking data from your training dataset to your test dataset.

To avoid this trap you need a robust test harness with strong separation of training and testing. This includes data preparation.

Data preparation is one easy way to leak knowledge of the whole training dataset to the algorithm. For example, preparing your data using normalization or standardization on the entire training dataset before learning would not be a valid test because the training dataset would have been influenced by the scale of the data in the test set.

Pipelines help you prevent data leakage in your test harness by ensuring that data preparation like standardization is constrained to each fold of your cross validation procedure.

The example below demonstrates this important data preparation and model evaluation workflow. The pipeline is defined with two steps:

Standardize the data.
Learn a Linear Discriminant Analysis model.

The pipeline is then evaluated using 10-fold cross validation.

# Create a pipeline that standardizes the data then creates a model
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# create pipeline
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('lda', LinearDiscriminantAnalysis()))
model = Pipeline(estimators)
# evaluate pipeline
seed = 7
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

# Create a pipeline that standardizes the data then creates a model

from pandas import read_csv

from sklearn.model_selection import KFold

from sklearn.model_selection import cross_val_score

from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import Pipeline

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# load data

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = read_csv(url, names=names)

array = dataframe.values

X = array[:,0:8]

Y = array[:,8]

# create pipeline

estimators = []

estimators.append(('standardize', StandardScaler()))

estimators.append(('lda', LinearDiscriminantAnalysis()))

model = Pipeline(estimators)

# evaluate pipeline

seed = 7

kfold = KFold(n_splits=10, random_state=seed)

results = cross_val_score(model, X, Y, cv=kfold)

print(results.mean())

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example provides a summary of accuracy of the setup on the dataset.

0.773462064252

1	0.773462064252

Pipeline 2: Feature Extraction and Modeling

Feature extraction is another procedure that is susceptible to data leakage.

Like data preparation, feature extraction procedures must be restricted to the data in your training dataset.

The pipeline provides a handy tool called the FeatureUnion which allows the results of multiple feature selection and extraction procedures to be combined into a larger dataset on which a model can be trained. Importantly, all the feature extraction and the feature union occurs within each fold of the cross validation procedure.

The example below demonstrates the pipeline defined with four steps:

Feature Extraction with Principal Component Analysis (3 features)
Feature Extraction with Statistical Selection (6 features)
Feature Union
Learn a Logistic Regression Model

The pipeline is then evaluated using 10-fold cross validation.

# Create a pipeline that extracts features from the data then creates a model
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# create feature union
features = []
features.append(('pca', PCA(n_components=3)))
features.append(('select_best', SelectKBest(k=6)))
feature_union = FeatureUnion(features)
# create pipeline
estimators = []
estimators.append(('feature_union', feature_union))
estimators.append(('logistic', LogisticRegression()))
model = Pipeline(estimators)
# evaluate pipeline
seed = 7
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

# Create a pipeline that extracts features from the data then creates a model

from pandas import read_csv

from sklearn.model_selection import KFold

from sklearn.model_selection import cross_val_score

from sklearn.pipeline import Pipeline

from sklearn.pipeline import FeatureUnion

from sklearn.linear_model import LogisticRegression

from sklearn.decomposition import PCA

from sklearn.feature_selection import SelectKBest

# load data

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = read_csv(url, names=names)

array = dataframe.values

X = array[:,0:8]

Y = array[:,8]

# create feature union

features = []

features.append(('pca', PCA(n_components=3)))

features.append(('select_best', SelectKBest(k=6)))

feature_union = FeatureUnion(features)

# create pipeline

estimators = []

estimators.append(('feature_union', feature_union))

estimators.append(('logistic', LogisticRegression()))

model = Pipeline(estimators)

# evaluate pipeline

seed = 7

kfold = KFold(n_splits=10, random_state=seed)

results = cross_val_score(model, X, Y, cv=kfold)

print(results.mean())

Running the example provides a summary of accuracy of the pipeline on the dataset.

0.776042378674

1	0.776042378674

Summary

In this post you discovered the difficulties of data leakage in applied machine learning.

You discovered the Pipeline utilities in Python scikit-learn and how they can be used to automate standard applied machine learning workflows.

You learned how to use Pipelines in two important use cases:

Data preparation and modeling constrained to each fold of the cross validation procedure.
Feature extraction and feature union constrained to each fold of the cross validation procedure.

Do you have any questions about data leakage, Pipelines or this post? Ask your questions in the comments and I will do my best to answer.

84 Responses to Automate Machine Learning Workflows with Pipelines in Python and scikit-learn

Ebrahimi September 23, 2016 at 6:47 am #

Hi
thanks for your good post!
When we use pipeline, train data (9 folds) will be normalized and then the parameters used to normalize train data, is used to normalize test data? This post suggest do this:
http://stats.stackexchange.com/questions/174823/how-to-apply-standardization-normalization-to-train-and-testset-if-prediction-I

Thanks

Reply
Ebrahimi September 23, 2016 at 6:52 am #

Could you please provide us an example where pipeline is used for data preparation and feature selection? Oversampling (ADASYN) should also be done just on train data. Is it possible to do these three altogether?

Thank you very much

Reply
Ebrahimi September 23, 2016 at 7:15 am #

thanks
I found my answer here:
http://stats.stackexchange.com/questions/228774/cross-validation-of-a-machine-learning-pipeline

Reply
Chaks October 5, 2016 at 6:48 pm #

what happens if Y has more than one column of data ( replacing Y = array[:,8]) and how to do it using your method? Thanks & regards, Chaks

Reply
- Jason Brownlee October 6, 2016 at 9:32 am #
  
  Great question, the idea of predicting multiple outputs.
  
  I do not have any examples and I’m unsure of whether sklearn supports this behavior. I know that individual algorithms do support this, such as neural networks.
  
  Reply
Chaks October 6, 2016 at 12:05 am #

How to get precision, recall values with pipelines?

Reply
- Jason Brownlee October 6, 2016 at 9:38 am #
  
  Hi Chaks,
  
  Just as you would for any classifier.
  
  This example may help:
  http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
  
  Reply
Pratik Patil February 1, 2017 at 9:39 pm #

Hi Jason,
After getting a best fit in through Keras/Scikit-wrapper with Pipeline to standardize; how do I access the weights of keras regressor/classifier present in Pipeline? I want to print those weights to text or csv. But I am having difficulty accesing those weights.

Reply
- Jason Brownlee February 2, 2017 at 1:56 pm #
  
  I’m not sure Pratik.
  
  You may want to transform the data separately and use a standalone Keras model where you can access the weights directly.
  
  Reply
Dan March 6, 2017 at 1:17 am #

Thanks for the great Post Jason.

For my understanding. The Feature Union allows us to put to feature extraction methonds into the pipeline which remain independent, right. Because using PCA to get 3 features and then selecting the best 6 ones would make no sense. But how exactly are they combined? Can you elaborate on that or recommend a good source?

ps: Keras vs tflearn vs Tensorflow? I find Keras the most easiest one for a beginner like me.
Thanks

Reply
- Jason Brownlee March 6, 2017 at 10:59 am #
  
  Hi Dan,
  
  The union of features just adds them to one large dataset as new columns for you to work on/use later in the pipeline.
  
  Reply
Dror Atariah July 24, 2017 at 4:39 pm #

In the second example, why don’t you add to the pipeline a normalization step; something like

estimators.append((‘standardize’, StandardScaler()))

from the first example?

Reply
- Jason Brownlee July 25, 2017 at 9:34 am #
  
  For sure you could. In the second example, I was trying to demonstrate something else besides scaling.
  
  Reply
  - Cray July 5, 2018 at 10:09 pm #
    
    scaler would come after feature union?
    
    Reply
    - Jason Brownlee July 6, 2018 at 6:42 am #
      
      Sounds good.
      
      Reply
Jan September 8, 2017 at 7:09 pm #

Hi Jason!

Thx for this article. Once again, it is the missing part of the puzzle!

You write “preparing your data using normalization or standardization on the entire training dataset before learning would not be a valid test”. You refer to standardizing the entire data set before splitting into train/validation and independent test set, yes?

Is there any more information on when and where to standardize the data in supervised learning tasks – some kind of flow chart on how to avoid data leakage for the most common workflows?

Regards!

Reply
- Jason Brownlee September 9, 2017 at 11:54 am #
  
  Ideally, all data prep should happen on or from the training dataset only.
  
  I have more notes here:
  https://machinelearningmastery.com/data-leakage-machine-learning/
  
  Reply
Franco Arda November 1, 2017 at 6:39 pm #

Another awesome post Jason! “Data leakage” during pre-processing or feature extraction is a nasty trap that’s rarely being covered in ML courses ….

Reply
- Jason Brownlee November 2, 2017 at 5:08 am #
  
  Thanks Franco.
  
  Reply
Duccio A November 28, 2017 at 6:10 am #

Thank you for the great post and the book!
If you were to try multiple models (say LinearRegression, Lasso and Ridge), would you repeat line 16-24 from the first example for each model you want to test?

Thank you

Reply
- Jason Brownlee November 28, 2017 at 8:42 am #
  
  Yes, exactly.
  
  Reply
Leon C February 7, 2018 at 8:51 am #

I’m confused by combining PCs and selected best features together for prediction. Principal components are combinations of the original features. Does it make sense to do feature union on PCA and kernel PCA, or in some other case, feature union on stepwise and backwards?

Reply
- Jason Brownlee February 7, 2018 at 9:33 am #
  
  It is just an example. It may not make sense for your data.
  
  Reply
Laura February 15, 2018 at 9:42 pm #

I love you ! (Seriously, I’m a beginner and everytime I look for something your blog pops up and I find what I’m looking for in an incredibly clear way !)

Reply
- Jason Brownlee February 16, 2018 at 8:33 am #
  
  I’m glad to hear the material helps Laura.
  
  Reply
Marko Dinic March 3, 2018 at 12:34 am #

Hi Jason, great article. I have a couple of questions, though.

1) Is this true in case of text classification as well? For example, creating bag of words, or better, tf-idf features depends highly on all the documents present in the corpus. My understanding is that we would fit something like tf-idf transformer on the training set, ‘learn’ idf based on training data and use the same transformer to transform the test data (now using tf from the concrete test document and ‘learned’ idf from the training corpus) to determine the accuracy. This also holds for new data, coming in real time. Is my understanding correct?

2) If you have train/test/validation splitting, do you determine transformation parameters only on train dataset and use it on test and validation in the same manner?

3) How would you combine k-fold validation with concept of train/test/validation? Would you do cross-validation only on train dataset and have ‘one dataset less’, something like train (80%) on which you do k-fold and test (20%) which you only check at the end? Or you would do train (60%) and do k-fold on it, validation (20%) for hyperparameters and test (20%) for final check?

I hope questions are clear enough and that there’s not too much of them 🙂 Once again, great article.

Reply
- Jason Brownlee March 3, 2018 at 8:15 am #
  
  Yes, the same transforms must be used when fitting a model and making predictions on new data.
  
  Yes, transforms are only prepared on data used to fit the model.
  
  Perhaps checkout this post on how to evaluate models:
  https://machinelearningmastery.com/difference-test-validation-datasets/
  
  And this post:
  https://machinelearningmastery.com/evaluate-skill-deep-learning-models/
  
  I hope that gives you some solid ideas.
  
  Reply
- Daniel Penalva January 21, 2019 at 10:27 am #
  
  Hi, why you split test/validation ?
  
  Seems to be no sense to me. They are not the same thing ?
  
  Reply
  - Jason Brownlee January 21, 2019 at 12:02 pm #
    
    No, they are different:
    https://machinelearningmastery.com/difference-test-validation-datasets/
    
    Reply
Navin March 21, 2018 at 7:12 am #

hi great article,
what is the best way to include model summary (coeffecients, intercepts etc) in a pipeline.
even if its not included in the pipeline, how can you access individual pipeline elements to extract relevant information

Reply
- Jason Brownlee March 21, 2018 at 3:04 pm #
  
  Good question.
  
  You can keep a ref to the model prior to adding it to the list used in the pipeline.
  
  Reply
  - Venkatesh Gandi January 20, 2020 at 5:09 pm #
    
    Can you give an example for the better understanding please?
    
    Reply
    - Jason Brownlee January 21, 2020 at 7:07 am #
      
      No, sorry.
      
      Reply
Neha May 1, 2018 at 8:11 am #

Hi Jason,

Thanks again for all your posts. They have become such an indispensable resource for me. I had a question on Pipeline 2: Feature Extraction and Modeling line no 19; are we missing mentioning a seed for the random state here? Should I replace this line by :
features.append((‘pca’, PCA(n_components=3, random_state=7)))

Reply
Abhilash Srivastava May 16, 2018 at 11:00 am #

Wonderfully written Jason. Thanks!

Reply
- Jason Brownlee May 17, 2018 at 6:22 am #
  
  Thanks, I’m glad it helped.
  
  Reply
kern August 20, 2018 at 5:55 pm #

Hi Jason

I’m confused about the scoring aspect. How would one score new data set after pipeline + cross validation?

here’s my e.g.

estimators = []
estimators.append((‘standardize’, StandardScaler()))
estimators.append((‘mlp’, KerasClassifier(build_fn=create_large_model, nb_epoch=250,\
validation_split=0.15,batch_size=25, verbose=2)))

pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=108)
results = cross_val_score(pipeline, X_train, y_train, cv=kfold)

which trains the model fine but if I run the following

print(accuracy_score(y_train, pipeline.predict(X_train)))

I get an error
“NotFittedError: This StandardScaler instance is not fitted yet. Call ‘fit’ with appropriate arguments before using this method.”

Reply
- Jason Brownlee August 21, 2018 at 6:13 am #
  
  We do not score new data.
  
  Once we have chosen a model, we can fit a new model on all training data then use the model to make predictions on new data.
  
  Perhaps this post will help:
  https://machinelearningmastery.com/make-predictions-scikit-learn/
  
  Reply
  - Kern August 21, 2018 at 9:17 pm #
    
    thanks for the quick reply. I used the “score” word incorrectly. I meant, use the cross validated model on the training set to get predictions on new data.
    
    As I understand pipelining currently, any pre processing step(s) such as scaling / normalising etc use the fit_transform method on the training data and save the transformation parameters so that they can be re-used when predicting new data (with the transform method) or have I misunderstood it altogether?
    
    My confusion stems from the point that, when I’ve used some pre-processing on the training data followed by cross validation in a pipeline, the model weights or parameters will be available in the “pipeline” object in my example above, hence they could be used further.
    
    If I remove the cross val step, I can use the pipeline.predict to get predictions on new data set.
    
    Reply
    - Jason Brownlee August 22, 2018 at 6:13 am #
      
      All models prepared during CV are discarded. CV is a process for estimating the performance of the model on unseen data.
      
      Once estimated, and if we are happy, we can fit a final model and start making predictions. This includes any data transforms.
      
      Perhaps this post will help:
      https://machinelearningmastery.com/train-final-machine-learning-model/
      
      Reply
Kyriacos November 7, 2018 at 2:11 am #

Hi Jason. First off, thank you for the informative blog post, I have a question though.

Say we in our workflow we have some feature scaling. We do cross-validation using the pipeline as described by you in this post. Finally, I find a model I’m happy with and I train it with all the train data.

Now let’s say I have new data coming in, which I’ve never seen before, and I’d like to make some predictions. Prior to making predictions, I have to do feature scaling on the new data.

In order to scale the new data, which data should I use in my scaler? Only data from the new data? Data from train & new? Or only from train data?

Reply
- Jason Brownlee November 7, 2018 at 6:09 am #
  
  It is best to perform feature scaling within each fold of the cross validation process.
  
  Once you have chosen a model, a final model is fit on all available data, including preparing a scale transform on all available data. When new data comes in the same transform object/coefficients can be used.
  
  Reply
Daniel Penalva January 21, 2019 at 5:36 am #

Suggestion:

Edit to substitute things like ” … entire training dataset … ” for “entire dataset” or it will be confusing people to think that train dataset is equal to the entire dataset, what is not true, the former has test dataset altogheter

Reply
- Daniel Penalva January 21, 2019 at 5:42 am #
  
  srry i mean the last one has the test dataset too.
  
  Reply
- Jason Brownlee January 21, 2019 at 11:57 am #
  
  Thanks for the suggestion Daniel.
  
  Reply
Sneha February 11, 2019 at 2:55 pm #

Hi,
Great post Json.
I would like to know what is the purpose of pipelines, if we can do train and test split first on the entire dataset,then apply preprocessing steps on the train set, resulting encoder or scaler objects generated can be pickled, which can then be unpickled for the test data set ?

Reply
- Sneha February 11, 2019 at 3:00 pm #
  
  *Jason, extremely sorry for the typo.
  
  Reply
- Jason Brownlee February 12, 2019 at 7:51 am #
  
  To automate the correct order/application of data transforms to data prior to modeling.
  
  You can do it manually if you wish.
  
  Reply
Sagar Gori February 12, 2019 at 4:59 pm #

Jason , As you mentioned “Importantly, all the feature extraction and the feature union occurs within each fold of the cross validation procedure.”

Can you please elaborate on this point with some explanation using examples. Since I am trying to know how feature extraction and feature union will take place if I use 5 fold cross validation approach.

Does it mean that from each iteration of K – fold cross validation you get list of features on training data. This process continues with remaining iterations and then you combine all features from each iteration and final list would be union of all features ?

Reply
- Jason Brownlee February 13, 2019 at 7:53 am #
  
  It means that each train/test split for each fold is separate and that data preparation is performed in a way that prevents data leakage:
  https://machinelearningmastery.com/data-leakage-machine-learning/
  
  Reply
Nandini Nuthalapati March 31, 2019 at 6:38 am #

Hi Jason, great post! Can we still include StandardScalar in the pipeline if we have some categorical features in the data or we don’t want to Standardize some of our numerical features? or we need to do standardize the required features separately?

Reply
- Jason Brownlee March 31, 2019 at 9:32 am #
  
  It might be better to handle them separately.
  
  Reply
Prem Alphonse June 7, 2019 at 5:32 pm #

Hi Jason, If we save the pipeline (with preprocess + normalisation + model) , whether it can be used on single test record in future to flow through the same steps.

Reply
- Jason Brownlee June 8, 2019 at 6:48 am #
  
  Yes.
  
  Learn how here:
  https://machinelearningmastery.com/make-predictions-scikit-learn/
  
  Reply
Prem June 12, 2019 at 8:05 am #

Hi Jason,
I did the train test split in raw data,

Using Xtrain I did data preprocessing and built model and saved the pipeline,

When I pass Xtest to pipeline, showing error as not all categories in train set columns where present in test set.

Do the preprocessing to be done first then build the pipeline?

Thanks

Reply
- Jason Brownlee June 12, 2019 at 8:09 am #
  
  Perhaps your test set differs from the training dataset?
  
  You must ensure that the training dataset contains all cases to be expected by the model in test/prediction.
  
  Reply
  - Prem June 12, 2019 at 8:17 am #
    
    Thanks Jason, much appreciated for the quick reply.
    
    Reply
Prem Alphonse June 12, 2019 at 9:10 am #

Hi Jason,

In a category column, say marital status we have levels (married, single, divorced, separated), when I do the get_dummies I get 4 dummy columns, with this I build data preprocessing and model pipeline from X_train.

Say in X_test if the marital status have only 2 (married, single), when I pass this to same above pipeline, while data preprocessing the get_dummies create only 2 columns, so the model showing shape error (as we don’t have other two column categories)

May I know how to deal this please?

Thanks

Reply
- Jason Brownlee June 12, 2019 at 2:22 pm #
  
  I recommend using a label encoder or a one hot encoder and fitting the encoder on the training dataset.
  
  Reply
bob zigon July 19, 2019 at 9:23 am #

Jason
I one of your other lessons, you talk about scaling the inputs and targets when performing linear regression. I can imagine putting 2 scalers in the pipeline, but how does one scaler get applied to the inputs while the other is applied to the target?

Reply
- Jason Brownlee July 19, 2019 at 9:28 am #
  
  Great question. They are all or nothing I believe.
  
  You would have to do the operations manually to subsets of the data.
  
  Reply
Venkatesh Gandi January 20, 2020 at 5:25 pm #

Hi Jason, Thanks for the detailed explanation.

Can you please let me know what is the difference between ColumnTransformer() and the FeatureUnion() methods?

Reply
- Jason Brownlee January 21, 2020 at 7:08 am #
  
  FeatureUnion combines columns, like an hstack.
  
  Column transformer will apply any arbitrary operations to subsets of features, then hstack the results.
  
  Reply
Venkatesh Gandi January 22, 2020 at 6:25 am #

OK, Thanks for the clarification.

Reply
- Jason Brownlee January 22, 2020 at 6:31 am #
  
  You’re welcome.
  
  Reply
MF March 13, 2020 at 1:04 am #

Hi Jason,

Is it possible to use the pipeline to create a first step to import & load dataset from the URL?
Please advise. Thanks.

Reply
- Jason Brownlee March 13, 2020 at 8:18 am #
  
  Maybe, if you code it yourself.
  
  Reply
  - MF March 13, 2020 at 11:39 pm #
    
    Hi Jason,
    
    Thanks for the reply.
    Can you show me how to write that in pipeline?
    Appreciate your help. Really.
    
    Reply
    - Jason Brownlee March 14, 2020 at 8:12 am #
      
      Sorry, no.
      
      Reply
      - MF March 14, 2020 at 3:37 pm #
        
        Do you have ideas of any URL I can refer to for guidance? Thanks.
      - Jason Brownlee March 15, 2020 at 6:11 am #
        
        Perhaps try posting your question on stackoverflow.
Shiyun Jiang April 30, 2020 at 2:20 am #

Hi Jason

I am not quite understand why you need to use two feature extraction in the FeatureUnion? What about the 3 components from PCA are the same as 3 of the 6 selected features in SelectKBest? Does that mean we are duplicating the work? Or the model will train on the PCA first and then train on SelectKBest.After they compare each other and see which is better? Also, why not choose equal numbers when you apply the number of components in the selection method?Thanks

Reply
- Jason Brownlee April 30, 2020 at 6:50 am #
  
  We don’t, it’s just an example of how to use the pipeline.
  
  Reply
  - Shiyun Jiang April 30, 2020 at 7:16 am #
    
    So normally should we choose equal number?
    
    Reply
    - Jason Brownlee April 30, 2020 at 11:35 am #
      
      No, you should choose the number of features that result in the best performance on your test harness.
      
      Reply
      - Shiyun Jiang May 1, 2020 at 9:30 pm #
        
        ok great. So potentially can be 6 features in SelectKBest is good and 3 features in PCA is good?
      - Jason Brownlee May 2, 2020 at 5:44 am #
        
        You must use controlled experiments to discover what works best for a given dataset.
JG April 19, 2021 at 4:56 am #

Hi Jason,

Definitively Sklearn Pipeline is a powerful module!. thanks for this intro.

performing some experiment on your code, here I share my results:

– I do not see necessary to use FeatureUnion. The already features list can be used directly.

– Eliminating featureUnion from Pipeline steps list, get the same results on the logisticRegression model score. So, I do not see the purpose to include this features union …

– if shuffle it is not set up to True in KFold function (as it is your case ) the random_state has not any meaning as argument of the function

thank you very much for this tutorial

my question:

1) Pipeline is to be used as a whole ensemble of process and ready to implement kfold, cross_val_score, etc. OK.
But if I am interested on knowing partial results of any of the steps (PCA, SelectKbest, ColumnTransformer… in terms of their specific results, I do not know how to extract them from the pipeline, I guess it is not possible

2) working with keras models, in the case of a single model for regression and classification, I do not see how to apply the keras wrapper over SkLearn to use e.g. Pipeline

about your code :
I

Reply
- Jason Brownlee April 19, 2021 at 5:54 am #
  
  You would not use a pipeline if you were interested in the intermediate outputs from each transform.
  
  The keras model in a wrapper could be used as the last step in a pipeline.
  
  Reply
JG April 19, 2021 at 5:19 pm #

thanks Jason

Reply
- Jason Brownlee April 20, 2021 at 5:54 am #
  
  You’re welcome.
  
  Reply
Muhammad Usama Zahid May 2, 2022 at 5:01 am #

You are an inspiration.Explained the topic so elegantly and now the concept is crystal clear.Thanks Jason!

Reply
Hasti January 9, 2023 at 3:40 am #

Hello everyone, I used this code for pipeline and I see this error, please help me

pipe = Pipeline([(‘scaler’, StandardScaler()),
(‘Logestic’,LogisticRegression()),
(‘SVM’,SVC())])
Error:
All intermediate steps should be transformers and implement fit and transform or be the string ‘passthrough’ ‘LogisticRegression()’ (type ) doesn’t

Reply
- James Carmichael January 9, 2023 at 8:24 am #
  
  Hi Hasti…In researching this error, I came across the following resource:
  
  https://stackoverflow.com/questions/48758383/all-intermediate-steps-should-be-transformers-and-implement-fit-and-transform
  
  Reply

Navigation

Automate Machine Learning Workflows with Pipelines in Python and scikit-learn

Pipelines for Automating Machine Learning Workflows

Need help with Machine Learning in Python?

Pipeline 1: Data Preparation and Modeling

Pipeline 2: Feature Extraction and Modeling

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To
Your Own Projects

More On This Topic

84 Responses to Automate Machine Learning Workflows with Pipelines in Python and scikit-learn

Leave a Reply Click here to cancel reply.

Navigation

Pipelines for Automating Machine Learning Workflows

Need help with Machine Learning in Python?

Pipeline 1: Data Preparation and Modeling

Pipeline 2: Feature Extraction and Modeling

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To Your Own Projects

More On This Topic

84 Responses to Automate Machine Learning Workflows with Pipelines in Python and scikit-learn

Leave a Reply Click here to cancel reply.

Finally Bring Machine Learning To
Your Own Projects