How to Use the ColumnTransformer for Data Preparation

Last Updated on

You must prepare your raw data using data transforms prior to fitting a machine learning model.

This is required to ensure that you best expose the structure of your predictive modeling problem to the learning algorithms.

Applying data transforms like scaling or encoding categorical variables is straightforward when all input variables are the same type. It can be challenging when you have a dataset with mixed types and you want to selectively apply data transforms to some, but not all, input features.

Thankfully, the scikit-learn Python machine learning library provides the ColumnTransformer that allows you to selectively apply data transforms to different columns in your dataset.

In this tutorial, you will discover how to use the ColumnTransformer to selectively apply data transforms to columns in a dataset with mixed data types.

After completing this tutorial, you will know:

  • The challenge of using data transformations with datasets that have mixed data types.
  • How to define, fit, and use the ColumnTransformer to selectively apply data transforms to columns.
  • How to work through a real dataset with mixed data types and use the ColumnTransformer to apply different transforms to categorical and numerical data columns.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

Use the ColumnTransformer for Numerical and Categorical Data in Python

Use the ColumnTransformer for Numerical and Categorical Data in Python
Photo by Kari, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Challenge of Transforming Different Data Types
  2. How to use the ColumnTransformer
  3. Data Preparation for the Abalone Regression Dataset

Challenge of Transforming Different Data Types

It is important to prepare data prior to modeling.

This may involve replacing missing values, scaling numerical values, and one hot encoding categorical data.

Data transforms can be performed using the scikit-learn library; for example, the SimpleImputer class can be used to replace missing values, the MinMaxScaler class can be used to scale numerical values, and the OneHotEncoder can be used to encode categorical variables.

For example:

Sequences of different transforms can also be chained together using the Pipeline, such as imputing missing values, then scaling numerical values.

For example:

It is very common to want to perform different data preparation techniques on different columns in your input data.

For example, you may want to impute missing numerical values with a median value, then scale the values and impute missing categorical values using the most frequent value and one hot encode the categories.

Traditionally, this would require you to separate the numerical and categorical data and then manually apply the transforms on those groups of features before combining the columns back together in order to fit and evaluate a model.

Now, you can use the ColumnTransformer to perform this operation for you.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

How to use the ColumnTransformer

The ColumnTransformer is a class in the scikit-learn Python machine learning library that allows you to selectively apply data preparation transforms.

For example, it allows you to apply a specific transform or sequence of transforms to just the numerical columns, and a separate sequence of transforms to just the categorical columns.

To use the ColumnTransformer, you must specify a list of transformers.

Each transformer is a three-element tuple that defines the name of the transformer, the transform to apply, and the column indices to apply it to. For example:

  • (Name, Object, Columns)

For example, the ColumnTransformer below applies a OneHotEncoder to columns 0 and 1.

The example below applies a SimpleImputer with median imputing for numerical columns 0 and 1, and SimpleImputer with most frequent imputing to categorical columns 2 and 3.

Any columns not specified in the list of “transformers” are dropped from the dataset by default; this can be changed by setting the “remainder” argument.

Setting remainder=’passthrough’ will mean that all columns not specified in the list of “transformers” will be passed through without transformation, instead of being dropped.

For example, if columns 0 and 1 were numerical and columns 2 and 3 were categorical and we wanted to just transform the categorical data and pass through the numerical columns unchanged, we could define the ColumnTransformer as follows:

Once the transformer is defined, it can be used to transform a dataset.

For example:

A ColumnTransformer can also be used in a Pipeline to selectively prepare the columns of your dataset before fitting a model on the transformed data.

This is the most likely use case as it ensures that the transforms are performed automatically on the raw data when fitting the model and when making predictions, such as when evaluating the model on a test dataset via cross-validation or making predictions on new data in the future.

For example:

Now that we are familiar with how to configure and use the ColumnTransformer in general, let’s look at a worked example.

Data Preparation for the Abalone Regression Dataset

The abalone dataset is a standard machine learning problem that involves predicting the age of an abalone given measurements of an abalone.

You can download the dataset and learn more about it here:

The dataset has 4,177 examples, 8 input variables, and the target variable is an integer.

A naive model can achieve a mean absolute error (MAE) of about 2.363 (std 0.092) by predicting the mean value, evaluated via 10-fold cross-validation.

We can model this as a regression predictive modeling problem with a support vector machine model (SVR).

Reviewing the data, you can see the first few rows as follows:

We can see that the first column is categorical and the remainder of the columns are numerical.

We may want to one hot encode the first column and normalize the remaining numerical columns, and this can be achieved using the ColumnTransformer.

First, we need to load the dataset. We can load the dataset directly from the URL using the read_csv() Pandas function, then split the data into two data frames: one for input and one for the output.

The complete example of loading the dataset is listed below.

Note: if you have trouble loading the dataset from a URL, you can download the CSV file with the name ‘abalone.csv‘ and place it in the same directory as your Python file and change the call to read_csv() as follows:

Running the example, we can see that the dataset is loaded correctly and split into eight input columns and one target column.

Next, we can use the select_dtypes() function to select the column indexes that match different data types.

We are interested in a list of columns that are numerical columns marked as ‘float64‘ or ‘int64‘ in Pandas, and a list of categorical columns, marked as ‘object‘ or ‘bool‘ type in Pandas.

We can then use these lists in the ColumnTransformer to one hot encode the categorical variables, which should just be the first column.

We can also use the list of numerical columns to normalize the remaining data.

Next, we can define our SVR model and define a Pipeline that first uses the ColumnTransformer, then fits the model on the prepared dataset.

Finally, we can evaluate the model using 10-fold cross-validation and calculate the mean absolute error, averaged across all 10 evaluations of the pipeline.

Tying this all together, the complete example is listed below.

Running the example evaluates the data preparation pipeline using 10-fold cross-validation.

Your specific results may vary given the stochastic learning algorithm and differences in library versions.

In this case, we achieve an average MAE of about 1.4, which is better than the baseline score of 2.3.

You now have a template for using the ColumnTransformer on a dataset with mixed data types that you can use and adapt for your own projects in the future.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

API

Summary

In this tutorial, you discovered how to use the ColumnTransformer to selectively apply data transforms to columns in datasets with mixed data types.

Specifically, you learned:

  • The challenge of using data transformations with datasets that have mixed data types.
  • How to define, fit, and use the ColumnTransformer to selectively apply data transforms to columns.
  • How to work through a real dataset with mixed data types and use the ColumnTransformer to apply different transforms to categorical and numerical data columns.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Modern Data Preparation!

Data Preparation for Machine Learning

Prepare Your Machine Learning Data in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Data Preparation for Machine Learning

It provides self-study tutorials with full working code on:
Feature Selection, RFE, Data Cleaning, Data Transforms, Scaling, Dimensionality Reduction, and much more...

Bring Modern Data Preparation Techniques to
Your Machine Learning Projects


See What's Inside

20 Responses to How to Use the ColumnTransformer for Data Preparation

  1. Venkatesh Gandi January 20, 2020 at 7:06 am #

    Wow this function with the pipeline seems to be magical. I really wanted to know what pipelines can do(the power of pipelines). I have seen the documentation of pipeline(), it’s not as simple as you explain 🙂 I really likes your way of explanation. Have you written any blogs on introduction to pipelines which can start with simple and explain complicated pipelines. if so, please share the links.

    if not, Can you please share some references where we can learn more about the sklearn pipeline?

  2. rahul malik April 22, 2020 at 8:18 am #

    I have followed your example , I am having a dataset with 3 columns as feature and 1 as label and all are categorical, no numeric values. I have printed X,y and seems they are correct but out is coming as MAE: nan (nan). Can you please suggest what wrong I am doing.

    • Jason Brownlee April 22, 2020 at 10:12 am #

      Perhaps check if you have a nan in your input data?

  3. Grzegorz Kępisty April 30, 2020 at 4:47 pm #

    Very useful utility from sklearn!

    I was wondering how to get the full list of transformations that can be applied to ColumnTransformer and the best reference I found is below:
    https://scikit-learn.org/stable/modules/preprocessing.html#
    Do you know maybe some broader source for this topic?

    Regards!

    • Jason Brownlee May 1, 2020 at 6:32 am #

      That is a great start.

      No, but I have a ton of tutorials on this theme written and scheduled. Get ready!

  4. ashar138 May 19, 2020 at 2:57 pm #

    Love this. Makes you realize how ColumnTransformer can make your life radically easy.
    But what I didn’t get is are the transformations applied in series ?

    Cause I want to use an Imputer first and then normalize/onehot it.


    transformers =[ (imputer cat) , (imputer num) , (normalize num), (onehot cat)]

    I assume this can be done ?

    • Jason Brownlee May 20, 2020 at 6:19 am #

      It sure does!

      Yes, you can apply different sequences to different subsets of features.

      • ashar138 May 20, 2020 at 5:58 pm #

        Turns out you can\t :\
        it throws error when dealing with NaN values, even though the imputer is the first transformation.
        Need to use Imputer first and then Minmax AGAIN

        • Jason Brownlee May 21, 2020 at 6:12 am #

          I give an example of imputing missing values and transforming sequentially using the columntransformer here:
          https://machinelearningmastery.com/results-for-standard-classification-and-regression-machine-learning-datasets/

          Specifically the Auto Imports example.

          • ashar138 May 26, 2020 at 1:12 pm #

            Okay I am a little confused now..

            So what we do is we create a Pipeline, feed it to a ColumnTransformer and we feed that to yet another Pipeline.

            But then why do we even need Pipeline ? For example,

            trans1 = ColumnTransformer([('catimp', SimpleImputer(strategy='most_frequent'), cat_var),
            ('enc', OneHotEncoder(handle_unknown='ignore'), cat_var),
            ('imp', SimpleImputer(strategy= 'median'),num_var )],
            remainder='passthrough')

            steps = [('c', Pipeline(steps=[('s',SimpleImputer(strategy='most_frequent')),
            ('oe',OneHotEncoder(handle_unknown='ignore'))]), cat_var),
            ('n', SimpleImputer(strategy='median'), num_var)]

            trans2 = ColumnTransformer(transformers=steps, remainder='passthrough')

            Why are trans1 and trans2 different ?

          • Jason Brownlee May 26, 2020 at 1:24 pm #

            Good question. You don’t have to use it, it’s just another tool we have available.

            Each group of features can be prepared with a pipeline.

            We can also have a master pipeline with data prep and modeling – then use cross-validation to evaluate it.

          • ashar138 May 26, 2020 at 1:45 pm #

            I see.. Think I need more time with it..

            Actually the example I shared below produces 2 different results. The one without the pipeline gives a NaN found error. Perhaps the Pipeline is needed to execute in a sequence while ColumnTransformer doesn’t do that ?

            Another issue I observed was while doing transformations of train and valid datasets.
            The resultant train dataset returned a scipy.sparse.csr.csr_matrix while the valid data just returned an ndarray.

            I reduced the sparse_threshold but that results in a feature mismatch while predicting on the validation dataset.

            Anyways I have bothered you enough already, I will figure it out somehow 🙂

          • Jason Brownlee May 27, 2020 at 7:41 am #

            Transforms like one hot encoding will create a sparse matrix by default, you can set an argument to force them to create a dense matrix instead.

  5. Manideep May 27, 2020 at 4:45 am #

    if i have to use simple imputer and onehotencoder both for a set of categorical columns.could u please tell me what should i do?

  6. MS July 1, 2020 at 1:47 am #

    numerical_x = reduced_df.select_dtypes(include=[‘int64’, ‘float64’]).columns
    categorical_x = reduced_df.select_dtypes(include=[‘object’, ‘bool’]).columns

    t = [((‘le’,LabelEncoder(),categorical_x),(‘ohe’,OneHotEncoder(),categorical_x),
    (‘catimp’, SimpleImputer(strategy=’most_frequent’),categorical_x)),
    ((‘num’,SimpleImputer(strategy=’median’),numerical_x),(‘sts’,StandardScaler(), numerical_x))]

    col_transform = ColumnTransformer(transformers=t)
    dt= DecisionTreeClassifier()
    pl= Pipeline(steps=[(‘prep’,col_transform), (‘dt’, dt)])
    pl.fit(reduced_df,y_train)
    pl.score(reduced_df,y_train)

    the above code gives this error=>

    names, transformers, _ = zip(*self.transformers)
    ValueError: not enough values to unpack (expected 3, got 2)

    can u please help me

  7. Volkan Yurtseven July 1, 2020 at 5:50 am #

    In the example just after the paragraph “This is the most likely use case as it ensures that the transforms are performed automatically on the raw data when fitting the model and when making predictions, such as when evaluating the model on a test dataset via cross-validation or making predictions on new data in the future..” you say
    model.fit(train_X, train_y),
    but i think it should be:
    pipeline.fit(train_X, train_y)

Leave a Reply