How to One Hot Encode Sequence Data in Python

Machine learning algorithms cannot work with categorical data directly.

Categorical data must be converted to numbers.

This applies when you are working with a sequence classification type problem and plan on using deep learning methods such as Long Short-Term Memory recurrent neural networks.

In this tutorial, you will discover how to convert your input or output sequence data to a one hot encoding for use in sequence classification problems with deep learning in Python.

After completing this tutorial, you will know:

  • What an integer encoding and one hot encoding are and why they are necessary in machine learning.
  • How to calculate an integer encoding and one hot encoding by hand in Python.
  • How to use the scikit-learn and Keras libraries to automatically encode your sequence data in Python.

Let’s get started.

How to One Hot Encode Sequence Classification Data in Python

How to One Hot Encode Sequence Classification Data in Python
Photo by Elias Levy, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

  1. What is One Hot Encoding?
  2. Manual One Hot Encoding
  3. One Hot Encode with scikit-learn
  4. One Hot Encode with Keras

What is One Hot Encoding?

A one hot encoding is a representation of categorical variables as binary vectors.

This first requires that the categorical values be mapped to integer values.

Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.

Worked Example of a One Hot Encoding

Let’s make this concrete with a worked example.

Assume we have a sequence of labels with the values ‘red’ and ‘green’.

We can assign ‘red’ an integer value of 0 and ‘green’ the integer value of 1. As long as we always assign these numbers to these labels, this is called an integer encoding. Consistency is important so that we can invert the encoding later and get labels back from integer values, such as in the case of making a prediction.

Next, we can create a binary vector to represent each integer value. The vector will have a length of 2 for the 2 possible integer values.

The ‘red’ label encoded as a 0 will be represented with a binary vector [1, 0] where the zeroth index is marked with a value of 1. In turn, the ‘green’ label encoded as a 1 will be represented with a binary vector [0, 1] where the first index is marked with a value of 1.

If we had the sequence:

We could represent it with the integer encoding:

And the one hot encoding of:

Why Use a One Hot Encoding?

A one hot encoding allows the representation of categorical data to be more expressive.

Many machine learning algorithms cannot work with categorical data directly. The categories must be converted into numbers. This is required for both input and output variables that are categorical.

We could use an integer encoding directly, rescaled where needed. This may work for problems where there is a natural ordinal relationship between the categories, and in turn the integer values, such as labels for temperature ‘cold’, warm’, and ‘hot’.

There may be problems when there is no ordinal relationship and allowing the representation to lean on any such relationship might be damaging to learning to solve the problem. An example might be the labels ‘dog’ and ‘cat’

In these cases, we would like to give the network more expressive power to learn a probability-like number for each possible label value. This can help in both making the problem easier for the network to model. When a one hot encoding is used for the output variable, it may offer a more nuanced set of predictions than a single label.

Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Manual One Hot Encoding

In this example, we will assume the case where we have an example string of characters of alphabet letters, but the example sequence does not cover all possible examples.

We will use the input sequence of the following characters:

We will assume that the universe of all possible inputs is the complete alphabet of lower case characters, and space. We will therefore use this as an excuse to demonstrate how to roll our own one hot encoding.

The complete example is listed below.

Running the example first prints the input string.

A mapping of all possible inputs is created from char values to integer values. This mapping is then used to encode the input string. We can see that the first letter in the input ‘h’ is encoded as 7, or the index 7 in the array of possible input values (alphabet).

The integer encoding is then converted to a one hot encoding. This is done one integer encoded character at a time. A list of 0 values is created the length of the alphabet so that any expected character can be represented.

Next, the index of the specific character is marked with a 1. We can see that the first letter ‘h’ integer encoded as a 7 is represented by a binary vector with the length 27 and the 7th index marked with a 1.

Finally, we invert the encoding of the first letter and print the result. We do this by locating the index of in the binary vector with the largest value using the NumPy argmax() function and then using the integer value in a reverse lookup table of character values to integers.

Note: output was formatted for readability.

Now that we have seen how to roll our own one hot encoding from scratch, let’s see how we can use the scikit-learn library to perform this mapping automatically for cases where the input sequence fully captures the expected range of input values.

One Hot Encode with scikit-learn

In this example, we will assume the case where you have an output sequence of the following 3 labels:

An example sequence of 10 time steps may be:

This would first require an integer encoding, such as 1, 2, 3. This would be followed by a one hot encoding of integers to a binary vector with 3 values, such as [1, 0, 0].

The sequence provides at least one example of every possible value in the sequence. Therefore we can use automatic methods to define the mapping of labels to integers and integers to binary vectors.

In this example, we will use the encoders from the scikit-learn library. Specifically, the LabelEncoder of creating an integer encoding of labels and the OneHotEncoder for creating a one hot encoding of integer encoded values.

The complete example is listed below.

Running the example first prints the sequence of labels. This is followed by the integer encoding of the labels and finally the one hot encoding.

The training data contained the set of all possible examples so we could rely on the integer and one hot encoding transforms to create a complete mapping of labels to encodings.

By default, the OneHotEncoder class will return a more efficient sparse encoding. This may not be suitable for some applications, such as use with the Keras deep learning library. In this case, we disabled the sparse return type by setting the sparse=False argument.

If we receive a prediction in this 3-value one hot encoding, we can easily invert the transform back to the original label.

First, we can use the argmax() NumPy function to locate the index of the column with the largest value. This can then be fed to the LabelEncoder to calculate an inverse transform back to a text label.

This is demonstrated at the end of the example with the inverse transform of the first one hot encoded example back to the label value ‘cold’.

Again, note that input was formatted for readability.

In the next example, we look at how we can directly one hot encode a sequence of integer values.

One Hot Encode with Keras

You may have a sequence that is already integer encoded.

You could work with the integers directly, after some scaling. Alternately, you can one hot encode the integers directly. This is important to consider if the integers do not have a real ordinal relationship and are really just placeholders for labels.

The Keras library offers a function called to_categorical() that you can use to one hot encode integer data.

In this example, we have 4 integer values [0, 1, 2, 3] and we have the input sequence of the following 10 numbers:

The sequence has an example of all known values so we can use the to_categorical() function directly. Alternately, if the sequence was 0-based (started at 0) and was not representative of all possible values, we could specify the num_classes argument to_categorical(num_classes=4).

A complete example of this function is listed below.

Running the example first defines and prints the input sequence.

The integers are then encoded as binary vectors and printed. We can see that the first integer value 1 is encoded as [0, 1, 0, 0] just like we would expect.

We then invert the encoding by using the NumPy argmax() function on the first value in the sequence that returns the expected value 1 for the first integer.

Further Reading

This section lists some resources for further reading.

Summary

In this tutorial, you discovered how to encode your categorical sequence data for deep learning using a one hot encoding in Python.

Specifically, you learned:

  • What integer encoding and one hot encoding are and why they are necessary in machine learning.
  • How to calculate an integer encoding and one hot encoding by hand in Python.
  • How to use the scikit-learn and Keras libraries to automatically encode your sequence data in Python.

Do you have any questions about preparing your sequence data?
Ask your questions in the comments and I will do my best to answer.

Develop LSTMs for Sequence Prediction Today!

Long Short-Term Memory Networks with Python

Develop Your Own LSTM models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more…

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

Skip the Academics. Just Results.

Click to learn more.


70 Responses to How to One Hot Encode Sequence Data in Python

  1. Natallia Lundqvist July 12, 2017 at 7:02 pm #

    Thank you for this post! It came really timely. A question. Utility to_categorical(data) accepts vector as input. What would one do in case you have a 2D tensor as input? Would you loop over number of samples (could be several hundred of thousands entries + do the model training in a loop) or one would have to do seq = tokenizer.texts_to_sequences(inputseq) and then tokenizer.sequences_to_matrix(seq, mode=’binary’)?? The last option will give a 2D tensor as output in form array([ 0., 1., 1., …, 0., 0., 0.]). In that case model training goes as usual, but for decoding predictions one would have to loop to find all maximums (argmax give ONLY the first maximum).

    • Jason Brownlee July 13, 2017 at 9:52 am #

      Ouch, it is hard to give good advice without specifics. Choose a formulation that preserves the structure of your sequence.

  2. Franco July 14, 2017 at 5:30 am #

    Awesome post again Jason! Thank you. I’ve just added it to my ML checklist.

  3. Rohit July 14, 2017 at 10:22 am #

    how do we perform integer encoding in keras ? Is there any inbuild function like LabelEncoder in scikit ?

  4. Moe July 24, 2017 at 2:20 pm #

    Hi Jason,

    Great article!

    What if the sequence was not representative of all possible values, and we don’t know the num of classes to set the num_classes argument?

    • Jason Brownlee July 25, 2017 at 9:26 am #

      Thanks.

      Yes, that is a challenge.

      Off the cuff, you may need to re-encode data in the future. Or choose an encoding that has “space” for new values that you have not seen yet. I expect there is some good research on this topic.

  5. Malcolm August 17, 2017 at 9:44 pm #

    Informative post as always!

    I understand how this is used to train a model.
    However, how can you ensure that data you want to get predictions for is encoded in the same way as the training data? E.g. ‘hot’ maps to column 3.

    For my application once the model is trained it will need to provide predictions at a later date and on different machines.

    • Jason Brownlee August 18, 2017 at 6:19 am #

      Great question!

      It must be consistent. Either you use the same code, and/or you save the “model” that performs the transform.

  6. sasi September 11, 2017 at 4:16 pm #

    Thank u!!!!!! great work……….

  7. NVK September 29, 2017 at 3:37 am #

    Hey there,
    Great tutorial as always!
    Quick question though – what if my dataset contains both categorical and continuous values? Wouldn’t OH encoding encode the entire dataset, when all I really need is just the categorical columns encoded?
    Also would the recommended flow be to (a) scale the numeric data only (b) encode the whole dataset?
    Or (a) encode the whole dataset and then (b) scale all values?

    • Jason Brownlee September 29, 2017 at 5:09 am #

      I would recommend only encoding the categorical variables.

  8. ed October 4, 2017 at 5:03 am #

    Hi there,

    I am working on model productization. To reduce the data processing gap between experiment and production, would like to embed one hot encoding in Keras model. The questions are

    1. When using “to_categorical”, it will convert categorical on the fly and it seems break the encoding. For example, we have “apple”, “orange” and “banana” when training model. After rollout the model, there include unseen category such as “lemon”. How can we handle this kind of scenario?

    2. As want to embed the encoding in Keras. Therefore, trying to not using sklearn label encoder and one hot encoder. Does it possible to do that?

    Please kindly advise.

    • Jason Brownlee October 4, 2017 at 5:50 am #

      Perhaps you can develop your own small function to perform the encoding and perform it consistently?

  9. Adhaar Sharma October 5, 2017 at 8:38 am #

    Hi Jason,

    Love the website! Had a quick question:

    If I have a X dataset of 6 categorical attributes, where each attribute has lets say the following number of unique categories respectively [4, 4, 4, 3, 3, 3]. Also lets say there are 600 instances

    When I LabelEncode and then OneHotEncode this X dataset, I now rightly get a sequence of 21 0-1 values after the toarray() method on the fit_transform.

    However, when I input the X dataset like that, i.e. with a shape of (600, 21), I get a much worse error than if I had just left it LabelEncoded and with a shape of (600, 6).

    My question is am I doing something wrong? Should I be re-grouping the sequence of 21 integers that I get back into their respective clusters? For ex: I get this array for the first row as a result of the OneHotEncode:
    [0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0]
    so should I group the digits back into something like this
    [(0, 0, 0, 1), (0, 0, 0, 1), (0, 0, 0, 1), (0, 0, 1), (0, 0, 1), (0, 1, 0)]
    where now we again have a (600, 6) shape for my input dataset?

    I tried creating a numpy array with this formulation but the sci-kit decision tree classifier checks and tries to convert any numpy array where the dtype is an object, and thus the tuples did not validate.

    Essentially, I want to know whether the (600, 21) shape is causing any data loss being in that format. And if it is what is the best way to regroup the encodings into their respective attributes so I can lower my error.

    Thank you!

    • Jason Brownlee October 5, 2017 at 5:21 pm #

      Hmmm.

      The data should be 500, 21 after the encoding, so far so good.

      No need to group.

      Skill better or worse depends on the algorithm and your specific data. Everything is a test to help us discover what works for our problem.

      Consider trying more algorithms.
      Consider trying encoding only some variables and leaving others as-is or integer encode.
      Brainstorm more things to try, see this post for ideas:
      http://machinelearningmastery.com/machine-learning-performance-improvement-cheat-sheet/

      I hope that helps as a start.

      • Adhaar Sharma October 6, 2017 at 5:30 am #

        Okay, great! I just wasn’t sure whether to debug the onehot encoder or try other things. Looks like trying other things is the way to go.

        Thanks you so much for your help Jason!

  10. Prashant October 10, 2017 at 4:24 am #

    Hii,
    Great tutorial as always!
    I am having a doubt…Why we are reshaping the integer_encoded vector
    “integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)”?

    • Jason Brownlee October 10, 2017 at 7:53 am #

      Great question, because the sklearn tools expect 2D data as input.

  11. Dolunay December 11, 2017 at 1:18 am #

    Hello,

    Thanks a lot for your great posts!
    There are some studies using only the index of the words without turning them into one-hot,

    such as:
    [7, 4, 11, 11, 14, 26, 22, 14, 17, 11, 3]

    they just use this sequence as input. I’d like to ask that since we generate indexes starting from 0, wouldn’t it be a problem when we’re doing 0-padding?

    Thanks a lot!

    • Jason Brownlee December 11, 2017 at 5:27 am #

      The 0 is often reserved for “no word” or “unknown word”.

      • Dolunay December 11, 2017 at 6:08 am #

        Ok, but in one-hot form, why didn’t we reserve 0 position for unknown word?

        • Jason Brownlee December 11, 2017 at 4:49 pm #

          This happens automatically when using libraries such as Keras to perform the encoding. We can also do that manually.

          • Dolunay December 11, 2017 at 7:40 pm #

            I’m doing it manually, at some point I became suspicious whether I was doing wrong, this is how I create char-index dictionaries.

            ch_ind = dict((c, i+1) for i, c in enumerate(s_chars))
            ind_ch = dict((i+1, c) for i, c in enumerate(s_chars))

            then this is how I create my one hot encoding,

            X = np.zeros((MAX_LEN, len(ch_ind)))

            for i, ch in enumerate(line[:MAX_LEN]):
            X[i, (ch_ind[ch])] = 1

            return X

            This means that I won’t have the following encoding, since I dont have a char at point 0,

            [1 0 0 0 …]

            But does this really achieve saving point 0 for padding?

            Thanks a lot for your time!

  12. Satish Chilloji January 8, 2018 at 5:07 am #

    Hi Jason,

    What if we have categorical variables with levels more than 500? How to deal with this, Is it OHE? Which will in return gives high number columns.

  13. Ion January 10, 2018 at 5:11 am #

    how to proceed when all input data are categories?

    • Jason Brownlee January 10, 2018 at 5:31 am #

      You can use a label encoder and perhaps even a one hot encoder on all of them.

  14. Adwait Chandorkar January 15, 2018 at 6:54 am #

    Hello Dr.Jason,

    I have a CSV dataset where some of the values are floating point values while the rest are labels. If i use one-hot encoding for the labels, they will be converted into binary vectors. However, the other values are floating.

    I want to train a classifier on this data. Is it correct to train a classifier using a dataset with combination of binary vectors and floating point values? Shouldn’t all the parameters in the dataset should be of the same data-type?

    • Jason Brownlee January 15, 2018 at 7:02 am #

      Yes, train on the combined original and encoded variables.

  15. Gunter January 26, 2018 at 5:32 pm #

    Hi Jason,

    your articles are an excellent source for newbies like me! Thank you indeed for putting so much energy into it, Sir!
    I found that pandas.Dataframe has a nice method to do one-hot encoding as well by using get_dummies which is very handy ( https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html ).

  16. Sacha_yacoubi February 11, 2018 at 3:53 am #

    Dear jason,
    Once we use one_hot_encoder to represent categorical attributes that take a large number of values ,we can end up having a very large matrix. Could you please explain the idea behind “Embeddings” in Neural networks to overcome this issue?

    Regards.

  17. vish February 19, 2018 at 11:13 am #

    Hi Jason,
    Great post as always. Since categorical(non-numeric) data in most cases has to be One Hot Encoded, just wondering why there is no direct method which takes categorical data and returns a one-hot-encoded data set directly instead of the user always having to call label Encoder to get the data integer encoded first?

    • Jason Brownlee February 19, 2018 at 3:06 pm #

      Sometimes there is benefit on working with the integer encoded values instead.

  18. maryam March 1, 2018 at 8:08 am #

    Hi Jason,
    I appreciate the clear tutorial u have written, especially for a novice like me.
    I want to gain a probability output instead of the label of class as an output.
    I know I should apply “”model.add(Dense(2, activation=’sigmoid’))”” instead of applying “”model.add(Dense(1, activation=’sigmoid’))”” and also i should use commands which in “One Hot Encode with Keras”.
    But I doubt whether I should put class labels in data = [1, 3, 2, 0, 3, 2, 2, 1, 0, 1] or not. I mean instead of data = [1, 3, 2, 0, 3, 2, 2, 1, 0, 1] should I write data =[0,1]. I have a binary classification problem for text dataset.
    would u mind please answering me by a tutorial link or even by writing command?
    Waiting for your replying.

    • Jason Brownlee March 1, 2018 at 3:08 pm #

      You can make probability predictions by calling predict_proba() on your model.

      For a binary outcome, you can use a single node in the output layer and the outcome will be a 0 or 1, not a one hot encoded class.

      • Maryam March 2, 2018 at 7:43 am #

        Hi dear Jason,
        I appreciate the time you spend replying me.
        I do not want to achieve 0,1 outcome. I want to achieve possible output such as 90% for class “good”, 10% for class “bad”.
        I do not want to gain neither the label as output nor f-measure or accuracy.
        I want to gain similarity measure = probability measure for lstm, cnn, rnn. I should write k fold cross-validation to build a model which can achieve similarity measure for each label.
        Please give me a tutorial to write k fold-cross validation to building a model to achieving possible output by a similarity measure.

        “””If I want to mention briefly, I want to read about how I can write k-fold cross validation commands and also i want to know what the function of similarity measure is in Keras?”””
        Thank u for helping a novice so clear.
        Best wishes

        • Jason Brownlee March 2, 2018 at 3:19 pm #

          This is a probability outcome. You can achieve this with a binary outcome bu calling predict() and using the value and 1 – value to get the probabilities for class 1 and class 0 respectively.

          Cross validation is only used to estimate the skill of the model. After you have an estimate of model skill, the CV folds and models are discarded and you can fit a final model to be used to make predictions. Learn more about this here:
          https://machinelearningmastery.com/train-final-machine-learning-model/

          • Maryam March 2, 2018 at 10:40 pm #

            Hi Jason,
            Thank you for spending your time for replying.
            Hoping to solve the question.
            Best wishes.
            Maryam

  19. Tim Levine March 2, 2018 at 4:35 am #

    Jason,

    Great post. You’re making me into a machine learning ninja! Do you think one-hot encoding is a good choice for ordinal data like cold,warm,hot? That is by encoding into separate columns don’t you lose the relationship that those three are on a continiuum. I thought the general approach to data preparation is to expose my knowledge of each variable to the machine learning algorithm. It seems that one-hot encoding obfuscates it in this case. Do you think leaving those three as integers might be a better choice? What would you recommend?

    thanks,

    Tim

    • Jason Brownlee March 2, 2018 at 5:37 am #

      It can make the relationship simpler for an algorithm, vectorized rather than compound.

      In general, test and demonstrate the change improves model skill.

  20. Jarad March 30, 2018 at 10:25 am #

    Hi Jason, great tutorial! Do you know how I would go about combining multiple one-hot vectors of different lengths? Let’s say that in addition to each letter’s one-hot coding I also have other categories such as gender and country.

    Also, what if I need to combine these with an integer such as age?

    Cheers! 🙂

    • Jason Brownlee March 31, 2018 at 6:31 am #

      If I follow, you could have other variables “next to” the one hot encoded inputs to form a very long input vector.

  21. Abraham foto April 7, 2018 at 11:50 pm #

    what great work and makes life easier approach on ML

    Hi Jason, recently, iam working with a data that has 921179 rows and about 32 columns. out of the 32 columns, the 22 are Object types and i was trying to encode the dataset using label encoder and oneHotEncoder.
    1, each columns has at least 20 unique values. there will be hounders of categories together.
    2, is it a right way to encode with labelEncoder and then with one hot Encoder to such a big data.
    or if you can suggest me an easier approach.

    tnx inadvance!

    • Jason Brownlee April 8, 2018 at 6:21 am #

      Perhaps try a suite of approaches and evaluate them based on their impact on model skill.

      – try integer encoding.
      – try one hot encoding
      – try removing them
      – try grouping labels and then encoding.
      – …

      Let me know how you go.

  22. Abraham foto April 8, 2018 at 8:57 pm #

    thanks for the quick response!

    – i imported the dataset using pandas : df = pd.read_csv(‘name.csv’)
    – then i selected only object categories : df = df.select_dtypes(include=[‘object’])
    -label encoding df1.apply(LabelEncoder().fit_transform) – worked fine till this point.
    – then i tried to create dummy variables , this is where it went wrong
    – df.apply(OneHotEncoder(categorical_features=’all’).fit_transform) – i got error message like this

    File “”, line 1, in
    X = one.fit_transform(X1)

    File “/Users/afoto/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/data.py”, line 2019, in fit_transform
    self.categorical_features, copy=True)

    File “/Users/afoto/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/data.py”, line 1809, in _transform_selected
    X = check_array(X, accept_sparse=’csc’, copy=copy, dtype=FLOAT_DTYPES)

    File “/Users/afoto/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py”, line 433, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)

    ValueError: could not convert string to float: no

    • Jason Brownlee April 9, 2018 at 6:10 am #

      Perhaps confirm that you were applying the one hot encoding on the integer encoded var and not the original string variable?

  23. Abraham foto April 10, 2018 at 7:47 pm #

    thanks for a response.

    lately, i sow one tutorial on handling many categorical attributes. the approach is simple as follows;

    1, iterate through all attributes with object data type object and then identify the unique categories
    2, if the number of categories is more than 15 or 20, for instance a feature called nationality and we found out that the majority is 4000 = Americans, 200 europeans , 10 Africans and 50 Indians, 2 Chinese and etc.
    3, we create a dummy variables for the Americans and Europeans and other( where the other is the remaining counties)

    this approach i found it to be intuitive and easier to implement.
    looking forward for your invaluable comments and feedbacks.

    Thanks and Cheers!

  24. Ankit Tripathi April 17, 2018 at 12:00 am #

    Hey Jason,

    It was an excellent article with exhaustive coverage. However, I have one doubt. I used the following code to use one hot encode for some categorical variables, but, the model fit throws error after successfully using one hot encoding. There is no error if I use ordinal encoding. Here is the code:

    #ONE HOT ENCODING

    from sklearn.preprocessing import OneHotEncoder
    def one_hot_encode_features(df_train,df_test):
    features = [‘Fare’, ‘Cabin’, ‘Age’, ‘Sex’]
    #features = [ ‘Cabin’, ‘Sex’]
    df_combined = pd.concat([df_train[features], df_test[features]])
    for feature in features:
    le = preprocessing.LabelEncoder()
    onehot_encoder = OneHotEncoder()
    le = le.fit(df_combined[feature])
    integer_encoding_train=le.transform(df_train[feature])
    integer_encoding_test=le.transform(df_test[feature])
    integer_encoding_train = integer_encoding_train.reshape(len(integer_encoding_train), 1)
    integer_encoding_test = integer_encoding_test.reshape(len(integer_encoding_test), 1)
    df_train[feature] = onehot_encoder.fit_transform(integer_encoding_train)
    df_test[feature] = onehot_encoder.fit_transform(integer_encoding_test)
    return df_train, df_test
    data_train, data_test = one_hot_encode_features(data_train, data_test)

    #FITTING THE MODEL:
    from sklearn.naive_bayes import GaussianNB
    from sklearn.metrics import make_scorer, accuracy_score
    from sklearn.model_selection import GridSearchCV

    clf = GaussianNB()
    acc_scorer = make_scorer(accuracy_score)
    clf.fit(X_train, Y_train)

    • Jason Brownlee April 17, 2018 at 6:01 am #

      Sorry, I cannot debug your code.

      Perhaps try posting your code to the developers on stackoverflow?

      • Ankit Tripathi April 17, 2018 at 10:05 pm #

        Hey Jason,

        Here is a question not related to code debugging, I received a one hot encoded sparse matrix after following the steps in this article. However, I am finding trouble to add it to my training dataframe.

        f_train[feature] = onehot_encoder.fit_transform(integer_encoding_train) fills all the n rows with the same values. How can this be achieved correctly?

  25. Ankit Tripathi April 17, 2018 at 5:22 pm #

    Okay, thanks Jason

  26. Adarsh May 4, 2018 at 7:16 pm #

    is this method is advice able for large number of categorical variable say i have 1500 categories for one variable is it advisable to use one-hot encoding

  27. diehumblex May 11, 2018 at 11:44 pm #

    Hi Jason, first I integer encoded the classes then converted that to one hot encoding. during prediction, I used reverse and got the class. Now I also want the confidence of the class. I am using keras and on using predict_prob I get ‘Model’ object has no attribute ‘predict_proba’ as I am not using sequential. any suggestions how to get the confidence of class?

    • Jason Brownlee May 12, 2018 at 6:33 am #

      If you are using softmax or sigmoid as the activation functions on the output layer, you can use the values directly as probability-like values.

  28. Abraham foto May 14, 2018 at 9:11 pm #

    when we use one hot encoding with sklearn, how do we check if the code is free of dummy trap.
    what i meant is that if i have may features with numeric categorical values like feature x1: 3,2,1,5,3,4,2
    and feature x2: 1,2,3,2,1,3 and so on
    if i use one hot encoding all the categories in one go
    oneHot =OneHotEncode(category_feature=[the number of to be encoded] -> example feature 1,2,4
    the feature x1: have 4 categories and after one hot do we get 4 new features or 3 features. in get dummy we get 3, thus there is no dummy trap, hoe about in one hot encoder

    • Jason Brownlee May 15, 2018 at 7:55 am #

      You get n elements in the binary vector where n is the number of unique categories.

  29. shailendra May 24, 2018 at 3:22 pm #

    how can we use one hot encoding for next word prediction is there is post for doing this?
    thank you

  30. Emna May 24, 2018 at 9:47 pm #

    I have a small issue concerning using the onehot encoding. I have a list of tokens and this list contains string, numbers and even special characters like this ”. I have done one hot encoding to this list, fed it into autoencoder model. Then, I fed to the model an unseen one hot encoded list. Then, the output from the autoencoder model is fed to inverse one hot encoding function. The generated list is not even close to the unseen list although the accuracy of my model is 0.9. Do you have any idea what is the problem ? I have used the one hot encoding functions from this tutorial.

    • Jason Brownlee May 25, 2018 at 9:24 am #

      Sounds like a bug in your implementation. I have some ideas here:

      – Consider aggressively cutting the code back to the minimum required. This will help you isolate the problem and focus on it.
      – Consider cutting the problem back to just one or a few simple examples.
      – Consider finding other similar code examples that do work and slowly modify them to meet your needs. This might expose your misstep.
      – Consider posting your question and code to StackOverflow.

Leave a Reply