3 Ways to Encode Categorical Variables for Deep Learning

Machine learning and deep learning models, like those in Keras, require all input and output variables to be numeric.

This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.

The two most popular techniques are an integer encoding and a one hot encoding, although a newer technique called learned embedding may provide a useful middle ground between these two methods.

In this tutorial, you will discover how to encode categorical data when developing neural network models in Keras.

After completing this tutorial, you will know:

  • The challenge of working with categorical data when using machine learning and deep learning models.
  • How to integer encode and one hot encode categorical variables for modeling.
  • How to learn an embedding distributed representation as part of a neural network for categorical variables.

Let’s get started.

How to Encode Categorical Data for Deep Learning in Keras

How to Encode Categorical Data for Deep Learning in Keras
Photo by Ken Dixon, some rights reserved.

Tutorial Overview

This tutorial is divided into five parts; they are:

  1. The Challenge With Categorical Data
  2. Breast Cancer Categorical Dataset
  3. How to Ordinal Encode Categorical Data
  4. How to One Hot Encode Categorical Data
  5. How to Use a Learned Embedding for Categorical Data

The Challenge With Categorical Data

A categorical variable is a variable whose values take on the value of labels.

For example, the variable may be “color” and may take on the values “red,” “green,” and “blue.”

Sometimes, the categorical data may have an ordered relationship between the categories, such as “first,” “second,” and “third.” This type of categorical data is referred to as ordinal and the additional ordering information can be useful.

Machine learning algorithms and deep learning neural networks require that input and output variables are numbers.

This means that categorical data must be encoded to numbers before we can use it to fit and evaluate a model.

There are many ways to encode categorical variables for modeling, although the three most common are as follows:

  1. Integer Encoding: Where each unique label is mapped to an integer.
  2. One Hot Encoding: Where each label is mapped to a binary vector.
  3. Learned Embedding: Where a distributed representation of the categories is learned.

We will take a closer look at how to encode categorical data for training a deep learning neural network in Keras using each one of these methods.

Breast Cancer Categorical Dataset

As the basis of this tutorial, we will use the so-called “Breast cancer” dataset that has been widely studied in machine learning since the 1980s.

The dataset classifies breast cancer patient data as either a recurrence or no recurrence of cancer. There are 286 examples and nine input variables. It is a binary classification problem.

A reasonable classification accuracy score on this dataset is between 68% and 73%. We will aim for this region, but note that the models in this tutorial are not optimized: they are designed to demonstrate encoding schemes.

You can download the dataset and save the file as “breast-cancer.csv” in your current working directory.

Looking at the data, we can see that all nine input variables are categorical.

Specifically, all variables are quoted strings; some are ordinal and some are not.

We can load this dataset into memory using the Pandas library.

Once loaded, we can split the columns into input (X) and output (y) for modeling.

Finally, we can force all fields in the input data to be string, just in case Pandas tried to map some automatically to numbers (it does try).

We can also reshape the output variable to be one column (e.g. a 2D shape).

We can tie all of this together into a helpful function that we can reuse later.

Once loaded, we can split the data into training and test sets so that we can fit and evaluate a deep learning model.

We will use the train_test_split() function from scikit-learn and use 67% of the data for training and 33% for testing.

Tying all of these elements together, the complete example of loading, splitting, and summarizing the raw categorical dataset is listed below.

Running the example reports the size of the input and output elements of the train and test sets.

We can see that we have 191 examples for training and 95 for testing.

Now that we are familiar with the dataset, let’s look at how we can encode it for modeling.

How to Ordinal Encode Categorical Data

An ordinal encoding involves mapping each unique label to an integer value.

As such, it is sometimes referred to simply as an integer encoding.

This type of encoding is really only appropriate if there is a known relationship between the categories.

This relationship does exist for some of the variables in the dataset, and ideally, this should be harnessed when preparing the data.

In this case, we will ignore any possible existing ordinal relationship and assume all variables are categorical. It can still be helpful to use an ordinal encoding, at least as a point of reference with other encoding schemes.

We can use the OrdinalEncoder() from scikit-learn to encode each variable to integers. This is a flexible class and does allow the order of the categories to be specified as arguments if any such order is known.

Note: I will leave it as an exercise for you to update the example below to try specifying the order for those variables that have a natural ordering and see if it has an impact on model performance.

The best practice when encoding variables is to fit the encoding on the training dataset, then apply it to the train and test datasets.

The function below, named prepare_inputs(), takes the input data for the train and test sets and encodes it using an ordinal encoding.

We also need to prepare the target variable.

It is a binary classification problem, so we need to map the two class labels to 0 and 1.

This is a type of ordinal encoding, and scikit-learn provides the LabelEncoder class specifically designed for this purpose. We could just as easily use the OrdinalEncoder and achieve the same result, although the LabelEncoder is designed for encoding a single variable.

The prepare_targets() integer encodes the output data for the train and test sets.

We can call these functions to prepare our data.

We can now define a neural network model.

We will use the same general model in all of these examples. Specifically, a MultiLayer Perceptron (MLP) neural network with one hidden layer with 10 nodes, and one node in the output layer for making binary classifications.

Without going into too much detail, the code below defines the model, fits it on the training dataset, and then evaluates it on the test dataset.

If you are new to developing neural networks in Keras, I recommend this tutorial:

Tying all of this together, the complete example of preparing the data with an ordinal encoding and fitting and evaluating a neural network on the data is listed below.

Running the example will fit the model in just a few seconds on any modern hardware (no GPU required).

The loss and the accuracy of the model are reported at the end of each training epoch, and finally, the accuracy of the model on the test dataset is reported.

Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieved an accuracy of about 70% on the test dataset.

Not bad, given that an ordinal relationship only exists for some of the input variables, and for those where it does, it was not honored in the encoding.

This provides a good starting point when working with categorical data.

A better and more general approach is to use a one hot encoding.

How to One Hot Encode Categorical Data

A one hot encoding is appropriate for categorical data where no relationship exists between categories.

It involves representing each categorical variable with a binary vector that has one element for each unique label and marking the class label with a 1 and all other elements 0.

For example, if our variable was “color” and the labels were “red,” “green,” and “blue,” we would encode each of these labels as a three-element binary vector as follows:

  • Red: [1, 0, 0]
  • Green: [0, 1, 0]
  • Blue: [0, 0, 1]

Then each label in the dataset would be replaced with a vector (one column becomes three). This is done for all categorical variables so that our nine input variables or columns become 43 in the case of the breast cancer dataset.

The scikit-learn library provides the OneHotEncoder to automatically one hot encode one or more variables.

The prepare_inputs() function below provides a drop-in replacement function for the example in the previous section. Instead of using an OrdinalEncoder, it uses a OneHotEncoder.

Tying this together, the complete example of one hot encoding the breast cancer categorical dataset and modeling it with a neural network is listed below.

The example one hot encodes the input categorical data, and also label encodes the target variable as we did in the previous section. The same neural network model is then fit on the prepared dataset.

Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, the model performs reasonably well, achieving an accuracy of about 72%, close to what was seen in the previous section.

A more fair comparison would be to run each configuration 10 or 30 times and compare performance using the mean accuracy. Recall, that we are more focused on how to encode categorical data in this tutorial rather than getting the best score on this specific dataset.

Ordinal and one hot encoding are perhaps the two most popular methods.

A newer technique is similar to one hot encoding and was designed for use with neural networks, called a learned embedding.

How to Use a Learned Embedding for Categorical Data

A learned embedding, or simply an “embedding,” is a distributed representation for categorical data.

Each category is mapped to a distinct vector, and the properties of the vector are adapted or learned while training a neural network. The vector space provides a projection of the categories, allowing those categories that are close or related to cluster together naturally.

This provides both the benefits of an ordinal relationship by allowing any such relationships to be learned from data, and a one hot encoding in providing a vector representation for each category. Unlike one hot encoding, the input vectors are not sparse (do not have lots of zeros). The downside is that it requires learning as part of the model and the creation of many more input variables (columns).

The technique was originally developed to provide a distributed representation for words, e.g. allowing similar words to have similar vector representations. As such, the technique is often referred to as a word embedding, and in the case of text data, algorithms have been developed to learn a representation independent of a neural network. For more on this topic, see the post:

An additional benefit of using an embedding is that the learned vectors that each category is mapped to can be fit in a model that has modest skill, but the vectors can be extracted and used generally as input for the category on a range of different models and applications. That is, they can be learned and reused.

Embeddings can be used in Keras via the Embedding layer.

For an example of learning word embeddings for text data in Keras, see the post:

One embedding layer is required for each categorical variable, and the embedding expects the categories to be ordinal encoded, although no relationship between the categories is assumed.

Each embedding also requires the number of dimensions to use for the distributed representation (vector space). It is common in natural language applications to use 50, 100, or 300 dimensions. For our small example, we will fix the number of dimensions at 10, but this is arbitrary; you should experimenter with other values.

First, we can prepare the input data using an ordinal encoding.

The model we will develop will have one separate embedding for each input variable. Therefore, the model will take nine different input datasets. As such, we will split the input variables and ordinal encode (integer encoding) each separately using the LabelEncoder and return a list of separate prepared train and test input datasets.

The prepare_inputs() function below implements this, enumerating over each input variable, integer encoding each correctly using best practices, and returning lists of encoded train and test variables (or one-variable datasets) that can be used as input for our model later.

Now we can construct the model.

We must construct the model differently in this case because we will have nine input layers, with nine embeddings the outputs of which (the nine different 10-element vectors) need to be concatenated into one long vector before being passed as input to the dense layers.

We can achieve this using the functional Keras API. If you are new to the Keras functional API, see the post:

First, we can enumerate each variable and construct an input layer and connect it to an embedding layer, and store both layers in lists. We need a reference to all of the input layers when defining the model, and we need a reference to each embedding layer to concentrate them with a merge layer.

We can then merge all of the embedding layers, define the hidden layer and output layer, then define the model.

When using a model with multiple inputs, we will need to specify a list that has one dataset for each input, e.g. a list of nine arrays each with one column in the case of our dataset. Thankfully, this is the format we returned from our prepare_inputs() function.

Therefore, fitting and evaluating the model looks like it does in the previous section.

Additionally, we will plot the model by calling the plot_model() function and save it to file. This requires that pygraphviz and pydot are installed, which can be a pain on some systems. If you have trouble, just comment out the import statement and call to plot_model().

Tying this all together, the complete example of using a separate embedding for each categorical input variable in a multi-input layer model is listed below.

Running the example prepares the data as described above, fits the model, and reports the performance.

Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, the model performs reasonably well, matching what we saw for the one hot encoding in the previous section.

As the learned vectors were trained in a skilled model, it is possible to save them and use them as a general representation for these variables in other models that operate on the same data. A useful and compelling reason to explore this encoding.

To confirm our understanding of the model, a plot is created and saved to the file embeddings.png in the current working directory.

The plot shows the nine inputs each mapped to a 10 element vector, meaning that the actual input to the model is a 90 element vector.

Note: Click to the image to see the large version.

Plot of the Model Architecture With Separate Inputs and Embeddings for each Categorical Variable

Plot of the Model Architecture With Separate Inputs and Embeddings for each Categorical Variable
Click to Enlarge.

Common Questions

This section lists some common questions and answers when encoding categorical data.

Q. What if I have a mixture of numeric and categorical data?

Or, what if I have a mixture of categorical and ordinal data?

You will need to prepare or encode each variable (column) in your dataset separately, then concatenate all of the prepared variables back together into a single array for fitting or evaluating the model.

Q. What if I have hundreds of categories?

Or, what if I concatenate many one hot encoded vectors to create a many thousand element input vector?

You can use a one hot encoding up to thousands and tens of thousands of categories. Also, having large vectors as input sounds intimidating, but the models can generally handle it.

Try an embedding; it offers the benefit of a smaller vector space (a projection) and the representation can have more meaning.

Q. What encoding technique is the best?

This is unknowable.

Test each technique (and more) on your dataset with your chosen model and discover what works best for your case.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Posts

API

Dataset

Summary

In this tutorial, you discovered how to encode categorical data when developing neural network models in Keras.

Specifically, you learned:

  • The challenge of working with categorical data when using machine learning and deep learning models.
  • How to integer encode and one hot encode categorical variables for modeling.
  • How to learn an embedding distributed representation as part of a neural network for categorical variables.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop Deep Learning Projects with Python!

Deep Learning with Python

 What If You Could Develop A Network in Minutes

...with just a few lines of Python

Discover how in my new Ebook:
Deep Learning With Python

It covers end-to-end projects on topics like:
Multilayer PerceptronsConvolutional Nets and Recurrent Neural Nets, and more...

Finally Bring Deep Learning To
Your Own Projects

Skip the Academics. Just Results.

See What's Inside

60 Responses to 3 Ways to Encode Categorical Variables for Deep Learning

  1. Rahil Shaikh November 22, 2019 at 5:28 am #

    It’s midnight over here in India and My eyes are just shutting off … But the moment I read the subject of your mail… I knew I had to read to this!!
    Y u ask?
    For the embedding explanation . Although I’ll have my doubts lined up as and when I try it out…
    … I wish to express my gratitude towards the Amazing knowledge you share with the world!

    • Jason Brownlee November 22, 2019 at 6:13 am #

      Thanks, I hope it helps on your next project!

    • ANTHONY LINS February 27, 2020 at 12:02 pm #

      Hi Jason,

      I would like to know How to handle with multi-class and multi-label in the same dataset, for example hair color, skin color and eye color (multi-output) defined by the set of attributes for genetic markups as genes (CC, TC, etc).
      If you have any kind of reference for a problem like that, I appreciate your help.

      Thanks in advance.

    • Akash Saha April 6, 2020 at 4:37 pm #

      Hi sir,
      Recently i am facing some doubt regarding how to encode a categorical data that is given in the form of bins.Actually i want to use “Age” data to build my decision tree.In some blogs it is recommended to convert continous numerical data into categorical bin data before using DT.
      My doubt is after i created the bins using Discretization , how to use this bin data to build my decision Tree?? Thank in advance!

      • Jason Brownlee April 7, 2020 at 5:39 am #

        The transformed data is then used as input to the model.

  2. martin November 22, 2019 at 9:33 am #

    Hi, Jason: Regarding “the embedding expects the categories to be ordinal encoded”, is this true for any type of entity embedding? That is, the ‘ordinal’ encoding is a must.

    • Jason Brownlee November 22, 2019 at 2:07 pm #

      Excellent question!

      Yes, but the mapping from labels to integers does not have to be meaningful.

      I should probably have said “label encoded” or “integer encoded” to sound less scary. Sorry.

  3. martin November 22, 2019 at 10:39 am #

    Although it says “expects the categories to be ordinal encoded”, the inputs are still prepared with LabelEncoder(), not OrdinalEncoder(). Why is that?

    • Jason Brownlee November 22, 2019 at 2:08 pm #

      Another top question, thanks!

      They both do the same thing.

      Label encoder is for one column – explicitly. Ordinal encoder is for a variable number of columns.

  4. martin November 22, 2019 at 5:30 pm #

    Another question is why two Embedding objects are of different types? One is from this tutorial, and its type is “Tensor(“embedding_1/embedding_lookup/Identity_2:0″, shape=(None, 1, 5), dtype=float32)” due to functional api, and its shape is (None, 1, 5). In the other tutorial, https://machinelearningmastery.com/what-are-word-embeddings/, the Embedding object is “”, and it doesn’t even have the ‘_shape’ variable. The reason I am asking this is because in the 2nd tutorial it must use a Flatten() layer, but in 1st tutorial, it doesn’t use it. Both are embedding objects, and why their internal attributes are different?

    • Jason Brownlee November 23, 2019 at 6:45 am #

      Often embeddings are used with sequences of words as input.

      Here, we have one embedding for one category. No flatten required.

  5. eppane November 23, 2019 at 1:06 am #

    Hello! Very helpful article, especially the Embedding part. One question in which I ran into during my own application:

    Do you have any suggestions how to deal with a situation where the test set has unseen labels when compared to the training set? For an example below, as we fit the LabelEncoder with training data,

    le.fit(X_train[:, i])
    # encode
    train_enc = le.transform(X_train[:, i])
    —> test_enc = le.transform(X_test[:, i])

    The last line would throw value error about the test set containing previously unseen labels. One option would be fitting the LabelEncoder with all data but that results into information leak which is undesirable. In reality the test set (or validation set) can certainly have previously unseen labels.

    Cheers!

    • Jason Brownlee November 23, 2019 at 6:53 am #

      Thanks!

      Excellent question!!!

      Yes, you can remove rows with unknown labels (painful), or map unknown labels to an “unknown” vector in the embedding, typically vector at index 0 can be reserved for this – in NLP applications.

      This will require more careful encoding of labels to integers, might be best to write a custom function to ensure it is consistent.

      Does that help?

      • eppane November 25, 2019 at 8:07 am #

        Hello!

        Thank you for the quick and helpful response! Removing rows with unknown labels is unfortunately out of the question. I was afraid that it will require custom encoding. But it is an intriguing problem and might be crucial for my application, where I am analyzing flow-based network data and trying to encode IP-addresses and ports. An ideal solution would be, while new data points are introduced, updating the labels dynamically in a way that they can be fed to the neural net (autoencoder in this case).

        Cheers, keep up the awesome work!

      • Mikkel Hansen December 9, 2019 at 9:37 pm #

        Do you happen to have a solution for this for an XGBoost model (I am using Sklearn’s OrdinalEncoder).

  6. Zineb_Morocco November 24, 2019 at 5:22 am #

    Hi,

    The article is very helpful to understand the embedding technique. I recommend you to follow and run the examples to obtain deep perception of it.
    I would like just to add that this technique is now widely used in many fields such as NLP, the biological field, image processing, especially where the data are structured as a graph by simulating the node to a word and the edge to a sentence.

    Thank you Jason.

  7. Niall Xie November 27, 2019 at 2:59 am #

    hello, i’m getting an error when I try to use the last code with my dataset breast cancer. how can I fix this error? Thanks ValueError: y contains previously unseen labels: [‘clump_thickness’]

    • Jason Brownlee November 27, 2019 at 6:12 am #

      Sorry to hear that.

      If you are experiencing this issue with a OneHotEncoder, you can set handle_unknown to ‘ignore’, e.g.

  8. Markus December 6, 2019 at 6:58 am #

    Hi

    In this article it says:

    Note: I will leave it as an exercise for you to update the example below to try specifying the order for those variables that have a natural ordering and see if it has an impact on model performance.

    I tried that, and the model didn’t perform better than around 70% accuracy. Have you also tried that out? No improvement in accuracy is also what you would expect?

    Another question: It’s not possible to specifying the order ONLY for those variables that have a natural ordering, you either need to specify it for all or none of them, or am I wrong? I used the categories parameter for that purpose, see:

    https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html

    Thanks

    • Jason Brownlee December 6, 2019 at 1:37 pm #

      Very cool, thanks for trying.

      No, I have not tried.

      No, you can process each variable separately and union / vstack them back together prior to modeling. It’s a pain and the reason why I left it as an exercise.

  9. Ken December 10, 2019 at 8:34 pm #

    how to encode a column with 10 thousand unique values stating the column to be city names ,
    for unsupervised clustering thechniques

    • Jason Brownlee December 11, 2019 at 6:52 am #

      Great question!

      I would recommend an learned embedding with a neural net, then compare results to other methods, like one hot, hashing, etc.

  10. Ken December 12, 2019 at 12:26 am #

    Thank you,

    • Ken December 12, 2019 at 12:36 am #

      Any resource or solution available on Neural net embeddings?

      • Jason Brownlee December 12, 2019 at 6:26 am #

        The above tutorials shows how.

        Also, I have many other examples on the blog, use the search box.

        What are you having trouble with exactly?

    • Jason Brownlee December 12, 2019 at 6:25 am #

      You’re welcome.

    • Hussain December 24, 2019 at 5:48 pm #

      Have you attended the NIPS conference ?

  11. mskilic January 12, 2020 at 1:43 am #

    Thanks for this excellent and very useful article.

    But I want to ask a question about the encoding phase. Do train and test data encoding separately is logical? For example, if a feature has different categorical values in test and train data, is it possible to trust the model? Or the model runs correctly? Maybe it will be more optimum, splitting the data as train and test after encoding…

    Best regards,

    • Jason Brownlee January 12, 2020 at 8:06 am #

      You’re welcome.

      The training set must be sufficiently representative of the problem – e.g. contain one example of each variable.

  12. scott January 25, 2020 at 6:03 am #

    Hi Jason,

    I am trying to apply your embedding code to some of my data. All y and x variables in the dataset(s) are string data types.

    When I run the section:

    in_layers = list()
    em_layers = list()
    for i in range(len(X_train_enc)):
    # calculate the number of unique inputs
    n_labels = len(unique(X_train_enc[i]))
    # define input layer
    in_layer = Input(shape=(1,))
    # define embedding layer
    em_layer = Embedding(n_labels, 10)(in_layer)
    # store layers
    in_layers.append(in_layer)
    em_layers.append(em_layer)

    I get the error:

    —————————————————————————
    NameError Traceback (most recent call last)
    in
    2 in_layers = list()
    3 em_layers = list()
    —-> 4 for i in range(len(X_train_enc)):
    5 # calculate the number of unique inputs
    6 n_labels = len(unique(X_train_enc[i]))

    NameError: name ‘X_train_enc’ is not defined

    Do have any ideas as to what the problem may be?

    Thanks.

    scott

    • Jason Brownlee January 25, 2020 at 8:45 am #

      You may have skipped some lines.

      Perhaps start with the working example and slowly adapt it to use your own dataset.

  13. scott January 28, 2020 at 1:52 am #

    Hi Jason,

    I have been trying to adapt your embedding code to a multi-label classification, but have been unsuccessful. I am trying to predict 14 binary labels, using 89 categorical predictors. How would I have to change your code to account for a multi-label problem? Thank you.

    • Jason Brownlee January 28, 2020 at 7:57 am #

      Sounds great.

      All of the encoding schemes are for the input variables. No change required really, you can use them directly.

      • scott January 29, 2020 at 1:07 am #

        Thanks for your prompt reply Jason. I am not sure exactly what you mean by “encoding schemes are for the input variables”.

        Regardless, I have gone through your tutorial again, but still have not been able to figure out how to adapt it to a multi-label case.

        As I mentioned, I am trying to predict 14 binary labels, with 89 features. So instead of a single outcome vector/matrix outcome nx1, I have a matrix of nx14. I am not sure how to account for this in your code.

        I believe that I may be encountering problems in a few areas of your code while trying to adapt it to multi-label classification. I will try to explain these below.

        I am not sure why you need to convert to 2d array in your example code below, and how I should change this if I am predicting 14, not 1, label.

        – # reshape target to be a 2d array
        – y = y.reshape((len(y), 1))

        Also, my 14 targets are already in binary form and have a string datatype – 14 individual columns coded with 0s and 1s – so I’m not sure if I even need to include the following code. I believe I still need to generate the y_train_enc and y_test_enc, but am not sure of what the format they should be.

        – # prepare target
        – def prepare_targets(y_train, y_test):
        – le = LabelEncoder()
        – le.fit(y_train)
        – y_train_enc = le.transform(y_train)
        – y_test_enc = le.transform(y_test)
        – return y_train_enc, y_test_enc

        Also, this part confuses me. I don’t know exactly why you format the output as 3d (?array?). How would the following code be changed to account for 14 label outcomes/predictions?

        – # make output 3d
        – y_train_enc = y_train_enc.reshape((len(y_train_enc), 1, 1))
        – y_test_enc = y_test_enc.reshape((len(y_test_enc), 1, 1))

        Also, in your example you had 9 categorical features and 1 binary outcome.
        I have 89 features and 14 binary outcomes. I am confused as to why you specify “10” in the code snippet below. In my case, would I have to specify 103 (89+14)?

        – # define embedding layer
        – em_layer = Embedding(n_labels, 10)(in_layer)

        Also, what does “n_labels” actually represent
        (i.e., not sure what this means – len(unique(X_train_enc[i]))),
        and in the case of your example what the number would actually be?

        Finally, is the “10” in your following code dictated because your example includes a total of 10 variables (9 features and 1 outcome)?

        – dense = Dense(10, activation=’relu’, kernel_initializer=’he_normal’)(merge)

        In the end, I also want to be able to produce an nx14 pd dataframe that contains the class (0/1, not probabilities) predictions for all 14 labels, from which I can use scikit learn to produce multi-label performance metrics.

        I have not been able to find another example that comes close to doing what I need to do – namely use categorical feature embedding in a multi-label classification case.
        I greatly appreciate you taking the time to answer my questions. I have always found your tutorials and posts extremely valuable.

        Scott

        • Jason Brownlee January 29, 2020 at 6:43 am #

          That is a huge comment, I cannot read/process it all.

          It sounds like you are trying to encode the target variable rather than the the input variable. This tutorial is focused on encoding the input variables.

          If you want to encode a target variable with n classes for multi-label classification, you must use this:
          https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html

          • scott January 29, 2020 at 11:06 pm #

            Jason,

            Sorry for the long comment.

            I am not trying to encode the target – I have 14 binary target columns – 14 labels.

            Perhaps you can answer just this one question.

            Why do you format the output as 3d (?array?). How would the following code from your example be changed to handle 14 binary label outcomes/predictions (as opposed to your example which is classification of 1 binary label?

            – # make output 3d
            – y_train_enc = y_train_enc.reshape((len(y_train_enc), 1, 1))
            – y_test_enc = y_test_enc.reshape((len(y_test_enc), 1, 1))

            My guess is that I would change the last 1 to 14:

            – y_train_enc = y_train_enc.reshape((len(y_train_enc), 1, 14))
            – y_test_enc = y_test_enc.reshape((len(y_test_enc), 1, 14))

            Am I correct?

            Thanks again.

            Scott

          • Jason Brownlee January 30, 2020 at 6:52 am #

            We don’t make any output arrays 3d in this post. Are you referring to a different post perhaps?

            For some models, like encoder-decoder models we need to have a 3d output, e.g. an output sequence for each input sequence. I think this is what you are referring to.

            If so, you would have n samples, t time steps, and f features, where f features would be the 14 labels.

  14. scott January 30, 2020 at 11:53 pm #

    Jason.

    Then what do lines 59-61 in your full embedding code do?

    They seem to indicate converting 2d to 3d. Below is the code lines 59-61 in your embedding example.

    # make output 3d
    y_train_enc = y_train_enc.reshape((len(y_train_enc), 1, 1))
    y_test_enc = y_test_enc.reshape((len(y_test_enc), 1, 1))

    Thanks again.

    • Jason Brownlee January 31, 2020 at 7:53 am #

      Ah I see. Thanks, I missed that.

      Umm, I think the embedding should be flattened before going into the dense. We don’t so the structure stays 3d all the way to output. It’s not really 3d, just 1 time step and 1 feature per sample.

      You could try and wrestle with the 3d output or try adding a flatten layer after the concat embeddings. I think that would do try trick off the top of my head. I believe I did that when working with nlp models:

  15. scott February 1, 2020 at 1:58 am #

    Jason,

    I figured my main problem it out. I just needed to change the 3rd dim from 1 to 14 here, assuming I have 14 labels to predict.

    Just a FYI – below is part of the code I used to run the embedding model and then calculate multilabel metrics (using sklearn.metrics).

    Starting with 14 labels and 89 predictors, as pandas dataframes:

    Hope example is of use to somebody

    Thanks again for you help Jason.

    scott

  16. Scott February 8, 2020 at 12:55 am #

    Hi, again, Jason,

    I appears that you use 10 as the embedding dimension, in your “prepare head” section of code. Does it mean that you are using 10 as the dimension for ALL the categorical variables?

    If so, I would like to assign a different embedding dimension for each categorical variable, given the number of levels in each categorical variable can be different. I think my following code would do this but I’m not sure it integrates with your code properly – note the key part I added is “cat_embsizes[cat]” which replaces your entry of “10”, which represents the embedding dimension. Here I calculate the number of unique values of each categorical variable, calculate dimension I want to use for each variable, then apply that list to your “prepare head” code. Please let me know what you think – will this work? Thanks again.

  17. Dana February 13, 2020 at 11:58 pm #

    I think the sentence “In this case, we will ignore any possible existing ordinal relationship and assume all variables are categorical. It can still be helpful to use an ordinal encoding, at least as a point of reference with other encoding schemes.” is out of its place.
    Am I wrong?

  18. kiki March 12, 2020 at 7:00 pm #

    hi. i want to ask you how to split x and y variable without ‘,’ which i take the data from excel without delimeter. i always got an error on this type

    ValueError: Error when checking target: expected dense_18 to have shape (1,) but got array with shape (31,)

    need an explanation on this. thank you

  19. Cesar March 18, 2020 at 1:00 pm #

    Hi Jason,
    Awesome post thanks!

    I want to ask something, I was trying to make the hot and ordinal encode and , but my dataseth has both type variables, categoricals and continuos, so as you said , for the ordinal I preprocessed each column and concatenate all the variables back together into a single array, and it works, but doing the same with hot encode , didn’t work, when I try to make “np.concatenate((x_cont,x_cat)” I got an error “”all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 0 dimension(s)”” and I know how to solve.

    Thanks!

  20. Pierre March 21, 2020 at 3:46 pm #

    Hi Jason,
    Other question concerning cases with numeric and categorical data with the Learned Embedding technique: do I need to add an input layer to the model for the numerical data (in addition to the layers for the categorical variables)?
    Thanks!

Leave a Reply