Why One-Hot Encode Data in Machine Learning?

Getting started in applied machine learning can be difficult, especially when working with real-world data.

Often, machine learning tutorials will recommend or require that you prepare your data in specific ways before fitting a machine learning model.

One good example is to use a one-hot encoding on categorical data.

  • Why is a one-hot encoding required?
  • Why can’t you fit a model on your data directly?

In this post, you will discover the answer to these important questions and better understand data preparation in general in applied machine learning.

Let’s get started.

Why One-Hot Encode Data in Machine Learning?

Why One-Hot Encode Data in Machine Learning?
Photo by Karan Jain, some rights reserved.

What is Categorical Data?

Categorical data are variables that contain label values rather than numeric values.

The number of possible values is often limited to a fixed set.

Categorical variables are often called nominal.

Some examples include:

  • A “pet” variable with the values: “dog” and “cat“.
  • A “color” variable with the values: “red“, “green” and “blue“.
  • A “place” variable with the values: “first”, “secondandthird“.

Each value represents a different category.

Some categories may have a natural relationship to each other, such as a natural ordering.

The “place” variable above does have a natural ordering of values. This type of categorical variable is called an ordinal variable.

What is the Problem with Categorical Data?

Some algorithms can work with categorical data directly.

For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation).

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

This means that categorical data must be converted to a numerical form. If the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application.

How to Convert Categorical Data to Numerical Data?

This involves two steps:

  1. Integer Encoding
  2. One-Hot Encoding

1. Integer Encoding

As a first step, each unique category value is assigned an integer value.

For example, “red” is 1, “green” is 2, and “blue” is 3.

This is called a label encoding or an integer encoding and is easily reversible.

For some variables, this may be enough.

The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.

For example, ordinal variables like the “place” example above would be a good example where a label encoding would be sufficient.

2. One-Hot Encoding

For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.

In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

In the “color” variable example, there are 3 categories and therefore 3 binary variables are needed. A “1” value is placed in the binary variable for the color and “0” values for the other colors.

For example:

The binary variables are often called “dummy variables” in other fields, such as statistics.

Further Reading


In this post, you discovered why categorical data often must be encoded when working with machine learning algorithms.


  • That categorical data is defined as variables with a finite set of label values.
  • That most machine learning algorithms require numerical input and output variables.
  • That an integer and one hot encoding is used to convert categorical data to integer data.

Do you have any questions?
Post your questions to comments below and I will do my best to answer.

106 Responses to Why One-Hot Encode Data in Machine Learning?

  1. Varun July 28, 2017 at 6:27 am #

    You didn’t mention that if we have a categorical variable with 3 categories, we only need to define 2 one-hot variables to save us from linear dependency.

  2. Navdeep July 28, 2017 at 6:49 am #

    HHi jason.I truly following you alot and really appreciate your effort and ease of tutorials.just a question,How one hot encoding would work for multilabel class and in coming tutorials could you help in featureselection of text data for muticlass and multilabel classification using keras.i tried multiclass for 90 datapoints. And used keras for mlp,cnn and rnn where each datapoint is long paragraph with labels but accuracy i got is 37.5 prcent. Let me know if you have any suggestions

  3. Espoir July 28, 2017 at 7:18 am #

    What are the cons of one hot encoding ??? Supposed that you have some categorical features with each one with 500 or more differents values !! So when you do one hot encoding you will have many colums in the dataset does it still good for a machine learning algorithm ???

    • Jason Brownlee July 28, 2017 at 8:40 am #

      Great question!

      The vectors can get very large, e.g. the length of all words in your vocab in an NLP problem.

      Large vectors make the method slow (increased computational complexity).

      In these cases, a dense representation could be used, e.g. word embeddings in NLP.

      • faadal July 30, 2017 at 1:11 am #

        Hi Jason, thanks again for your amazing pedagogy.

        Back to the Espoirt question, I face this problem with 84 user_ID. I do a OHE of them and, like you said when I fit the data with a SVM classifier, it’s look like I fall in a infinite loop. So taking in to account the fact that I am not in the NLP case, how can I fixe this ?


    • Vitor July 31, 2017 at 3:11 am #

      Very helpful post, Jason!
      Espoir raised my question here but I did not undestand how to apply your answer to my case. I have 11+ thousand different products id. The database has about 130 thousand entries. This easily leads to MemoryError when using OHE. What approach/solution should I look for?

      • Jason Brownlee July 31, 2017 at 8:18 am #


        Maybe you can use efficient sparse vector representations to cut down on memory?

        Maybe try exploring dense vector methods that are used in NLP. Maybe you can something like a word embedding and let the model (e.g. a neural net) learn the relationship between different input labels, if any.

  4. Sasikanth July 28, 2017 at 11:54 am #

    Hello Jason how do we retrieve the features back after OHE if we need to present it visually?

    • Jason Brownlee July 29, 2017 at 8:01 am #

      You can reverse the encoding with an argmax() (e.g. numpy.argmax())

  5. gezmi July 28, 2017 at 5:20 pm #

    Thank you for these wonderful posts!
    Does data have to be one-hot encoded for classification trees and random forests as well or they can handle data without it? Or just try which gives better results?

  6. Ravindra July 28, 2017 at 5:31 pm #

    Hi Jason, this post is very helpful, thank you!!
    Question- In general what happens to model performance, when we apply One Hot Encoding to a ordinal feature? Would you suggest only to use integer encoding in case of ordinal features?

    • Jason Brownlee July 29, 2017 at 8:07 am #

      It really depends on the problem and the meaning of the feature being encoded.

      If in doubt, test.

      • Ravindra July 29, 2017 at 4:16 pm #

        I see, thanks!

  7. Rajkumar Kaliyaperumal July 28, 2017 at 6:47 pm #

    hey Jason,
    As usual this is another useful post on feature representation of categorical variables. Since logistic regression fits a separation line on the data points of the form w1X1 + w2X2 +.. where X are features such as categorical variables- Places,color etc, and w are weights, intuitively X can take only numerical values for the line to fit. Is this a right intuition?

    • Jason Brownlee July 29, 2017 at 8:11 am #

      Yes, regression algorithms like logistic regression require numeric input variables.

      • Raj July 31, 2017 at 2:16 pm #

        Thanks a lot for your clarifying. I love your blogs and daily email digests. They help me to understand key concepts & practical tips easily.

  8. PabloRQ July 28, 2017 at 7:15 pm #


  9. ritika July 29, 2017 at 8:19 pm #

    very well explained..thanks

  10. Jie July 31, 2017 at 11:41 am #

    I love your blog!
    One question: if we use tree based methods like decision tree, etc. Do we still need one-hot encoding?
    Thanks you very much!

    • Jason Brownlee July 31, 2017 at 3:49 pm #

      No Jie. Most decision trees can work with categorical inputs directly.

  11. Andrew Jabbitt August 3, 2017 at 12:27 am #

    Hi Jason, loving the blog … a lot!

    I’m using your binary classification tutorial as a template (thanks!) for a retail sales data predictor. I’m basically trying to predict future hourly sales using product features and hourly weather forecasts, trained on historical sales and using above/below annual average sales as my binary labels.

    I have encoded my categorical data and I get good accuracy when training my data (87%+), but this falls down (to 26%) when I try to predict using an unseen, and much smaller data set.

    As far as I can see my problem is caused by encoding the categorical data – the same categories in my unseen set have different codes than in my model. Could this be the cause of my poor prediction performance: the encoded prediction categories are not aligned to those used to train and test the model? If so how do you overcome these challenges in practice?

    Hope it makes sense.

  12. Maurice BigMo Flynn August 9, 2017 at 3:05 pm #

    Appreciable and very helpful post, thank you!!!

    Question: What is the best way to one hot encode an array of categorical variables?

    I have also startup with a AI post you can also find some knowledge over there: Thebigmoapproach.com/

  13. tom August 28, 2017 at 4:33 pm #

    hi Jason:

    One question, take the “color” variable as an example,if the color is ‘red’ , then after one-hot encoding ,it becomes 1,0,0 . So,can we think that it generates three features from one feature?
    It has been added two columns,is that right?

  14. Zhida Li September 4, 2017 at 8:32 pm #

    Hi Jason, if my input data is [1 red 3 4 5], if use one hot encoder, red become [1,0,0], ]does it mean that the whole features of the input data is extended?
    input data now is [1 1 0 0 3 4 5]

    • Jason Brownlee September 7, 2017 at 12:35 pm #

      Sorry, I don’t follow. Perhaps you can restate your question?

      • Zhida Li December 1, 2017 at 8:10 pm #

        Hi Jason, Thank you for the reply.
        For example, if I have 4 feature of my input, [121 4 red 10; 100 3 green 7; 110 8 blue 6]
        For the first row, the value related to each feature–feature 1:121, feature 2:4, , feature: red, feature 4: 10.

        I want to use one hot encoder now, red = [1,0,0], green = [0,1,0], blue = [0,0,1].
        So my input become [121 4 1,0,0 10; 100 3 0,1,0] 7; 110 8 0,0,1 6] , after one hot encoding, we now have 6 features, so I use the new data for training, it that right?

  15. Peter Ken Bediako October 13, 2017 at 6:28 pm #

    Hello DR. Brownlee,

    I am training a model to detect attacks and i need someone like you to help me detect the mistakes in my code because my training is not producing any better results. Kindly alert me if you will be interested to help me.
    Thank you

    • Jason Brownlee October 14, 2017 at 5:41 am #

      Sorry, I do not have the capacity to review your code.

  16. Peter Ken Bediako October 13, 2017 at 8:20 pm #

    I am using Tensorflow developing the mode,and would want to know how your book can help me do that. since it makes reference to Keras. Thank you

    • Jason Brownlee October 14, 2017 at 5:44 am #

      My deep learning book shows how to bring deep learning to your projects using the Keras library. It does not cover tensorflow.

      Keras is a library that runs on top of tensorflow and is much easier to use.

  17. Yahia Elgamal October 20, 2017 at 12:26 am #

    This is not entirely correct as far as I understand. As Varun mentioned, you need to have one less column (n-1 columns). What has been described is dummy-encoding (which is not one-hot-encoding). There is a major problem with dummy encoding which is perfect collinearity with the intercept value. As the sum of all the dummy values of one category (n columns) is ALWAYS equal to 1. So it’s basically an intercept

    • Jason Brownlee October 20, 2017 at 5:38 am #

      Other way around I believe. Dummy encoding is n-1 columns, one hot has n columns.

  18. Sakthi October 20, 2017 at 11:13 pm #

    Hi Jason,I have 6 categorical values which are present in the data that I have. The data that I have has many missing categorical values that are left as empty strings. What to do if I have missing categorical values? Do I need to OHE them also? or how to deal with the categorial feature with missing values?
    I’m using sci-kit learn and trying out many algorithms for my dataset.

  19. Thomas October 26, 2017 at 7:23 am #

    Hi Jason,
    First thank you for your post !

    There is something i did not understand about your explanation : let’s take the color example (so red is 1, green is 2, blue is 3).

    I did not understand the “ordinal relationship between catégories” : does the One-Hot-Encode allow better accuracy for some learning algorithms than these categories? (So far here’s what I thought: the algorithm reads 1,2 or 3 instead of red, green or blue, and makes the necessary correlations for predictions, and that has no impact on the predictions accuracy.)

    • Jason Brownlee October 26, 2017 at 4:14 pm #

      Hmm. Sorry for not being clearer.

      Ordinal means ordered. Some categories are naturally ordered and in these cases some algorithms may give better results by using just an integer encoding.

      For problems where the categories are not ordered, the integer encoding may result in worse performance than one hot encoding because the algorithm may assume a false ordering based on the assigned numbers.

      Does that help?

      • Gana February 26, 2018 at 10:29 pm #

        Bit confused. In the case we ordered integer labels correctly, do we need one hot encoding? Actually it is bit stupid that labeling impacts to acc. I thought one hot labeling is for simplicity but you say that in the case of integer label which is not well ordered.

        Is there any reason why we use one hot encoding in the case we order integer labels correctly?

        I accept what u explained why we need to use integer encoding instead of character labeling.

        Thank you

        • Jason Brownlee February 27, 2018 at 6:28 am #

          I was saying that if your variable values are not ordinal and you treat them as ordinal when fitting the model (e.g. not use one hot encoding), you may loose skill.

          Does that help?

  20. Vlad November 12, 2017 at 9:34 pm #

    I have data from 20 000 stores. Each store has it’s integer ID. This ID is meaningless, just ID. Should I add 20 000 binary variables to datatset? And 20 000 neurons in input layer of LTSM? It sounds frightening…

    • Jason Brownlee November 13, 2017 at 10:15 am #

      No, drop the id unless you have a hunch that it is predictive (e.g. numbering maps to geographical region and regions have similar outcomes).

      • Vlad November 14, 2017 at 4:03 am #

        Ok, I have latitude and longitude of each store. Should I use them instead of ID? Similar question. I have 17 states of weather (cloudy, rainy, etc.). Should I replace them with 17 binary variables? Or should I try to give integer code to them to show similarity of heavy rain to rain and light rain, sunny to partial clouds and heavy clouds?

        • Jason Brownlee November 14, 2017 at 10:20 am #

          There are no rules, I would encourage you to try many different framings and see what works best for your specific data.

          I have some biases that I could suggest, but it would be better (your results would be better) if you use experiments to discover what works for your problem.

          • Vlad November 15, 2017 at 6:53 am #

            Yes, it’s right, thanks

  21. Ali November 15, 2017 at 7:49 am #

    Great post Jason! I’m glad I came across it. It really helped me to understand the need for one hot encoding. I’m new to machine learning and I am currently running xgboost in R for a classification problem.
    I have 2 questions:
    (1) If my target variable (the variable I want to predict) is categorical, should I also convert it into numeric form using hot encoding or will a simple label encoding suffice?
    (2) Are there specific R packages for one hot encoding of features?

    • Jason Brownlee November 15, 2017 at 9:59 am #

      It really depends on the method. It can help.

      Sorry, I don’t recall if you must encode variables for xgboost in R, it has been a long time.

      • DH June 6, 2018 at 10:36 am #

        Most R ML methods handle factors without the need for explicit one-hot coding. But xgboost is an exception, so you need to use a function like sparse.model.matrix() to encode your data set before passing it to xgboost. (This function actually encodes factors as “indicator variables” rather than one-hot encoding, but the general idea is the same.)

  22. Anu November 16, 2017 at 1:22 am #

    Hello Jason

    I have dataset having numeric and nominal type. It also has missing values. For nominal datatype, first I applied Labelencoder() to convert them into numeric values, but along with my two categories(normal, abnormal), it also assigns a code to NaN. In such scenario how can I impute values by its Mean?

  23. DEB November 21, 2017 at 12:52 am #

    Hi Jason,

    Since the number of columns created for a categorical column after applying OneHotEncoding is equal to the number of unique values in that categorical column; often it happens that the number of features in the tested model is not equal to the number of features on the dataset to be predicted after applying OHE similarly on the categorical fields. In such cases model throws an error while predicting since it expects equal number of features both in the training and to be predicted dataset. Can you please advise how to handle such situation ?

    • Jason Brownlee November 22, 2017 at 10:42 am #

      The same transform object used for training is then used for test or any other data. It can be saved to disk if need be.

  24. DEB November 22, 2017 at 4:28 pm #

    Hi Jason,

    I couldn’t get what do you mean by “same transform object”. ? The training dataset structure (number of initial features) is same both for Training and Testing/to-be-predict dataset. But the uniqueness of values under each feature/column may differ which is quite natural. Therefore OneHotEncoding or pandas get_dummies create different number of encoded features in Test/to-be-predict dataset than the training dataset. How to deal with this issue – that is what my question.

    Need your advise please.


    • Jason Brownlee November 23, 2017 at 10:27 am #

      Sorry. To be clearer, you can train the transform objects on the training data and use them on the test set and other new data.

      The transform objects may be the label encoder and the one hot encoder.

      The training data should be such that it covers all possible labels for a given feature.

      Does that help?

  25. Emily December 2, 2017 at 5:16 pm #

    Hi Jason,

    should I do one-hot encode for two level categorical variables? like variable only contains (yes. no) converts to two variable (0,1) and (1,0)


  26. Edward Bujak December 6, 2017 at 8:58 am #

    For One-Hot Encoding (OHE) of a categorical variable State with 4 values: NJ, NY, PA, DE

    We can remove one of them, say DE, to reduce complexity.
    So if NJ=0, and NY=0, and PA=0, then it is DE

    Is removing one recommended?

    This becomes more obvious in the case of a binary categorical variable.


    • Jason Brownlee December 6, 2017 at 9:09 am #

      If you can simplify the data, then I would recommend doing that.

      Always test the change on model skill though.

  27. Mike Dilger December 13, 2017 at 8:16 am #

    One Hot Encoding via pd.get_dummies() works when training a data set however this same approach does NOT work when predicting on a single data row using a saved trained model.

    For example, if you have a ‘Sex’ in your train set then pd.get_dummies() will create two columns, one for ‘Male’ and one for ‘Female’. Once you save a model (say via pickle for example) and you want to predict based on a single row you can only have either ‘Male’ or ‘Female’ in the row and therefore pd.get_dummies() will only create one column. When this occurs the number of columns no longer matching the number of columns you trained your model on and errors out.

    Do you know a solution to this issue? My actual need uses Zip Code rather than Sex which is more complex.

    • Jason Brownlee December 13, 2017 at 4:12 pm #

      I recommend using LabelEncoder and OneHotEncoders from sklearn on a reasonable sample of your data (all cases covered) and then pickle the encoders for later use.

      • Mike Dilger December 14, 2017 at 2:07 am #

        Thank you!!!

  28. FriendofFriend January 11, 2018 at 2:40 pm #

    Hi Jason,

    I am piggybacking on some of the other questions re: n-1 encoding and n encoding. I have a dataset where I predict price based on day of week using sklearn LinearRegression (also playing with Ridge). I used DictVectorizer in sklearn to prep my data and I end up with 7 columns for day of week, rather than 6. In some of the questions above, you indicate simpler is better…though you do say to “test the change on model skill.” Could you elaborate on that – for example, what are the practical implications of using one or the other for a dataset like mine (features = days of week; target = price)? My model seems to spit out a reasonable y-intercept, though I’m not sure exactly what the y-intercept is because my model has no [0, 0, 0, 0, 0, 0, 0] for day (i.e., no “reference” day).

    Is there a mathematical reason to use n-1 vs n encoding? I hope this makes sense. I’ve Googled like 50 times and can’t find an article that really gets into this. Thank you.

    • Jason Brownlee January 12, 2018 at 5:50 am #

      If your goal is the best model skill, then use whatever works to improve that skill.

      No need for idealized justifications.

  29. gezmi January 12, 2018 at 7:59 pm #

    Very helpful, thanks.

    May I ask that if I have 4 possible letters in a string that I would like to encode (let’s say A B C D), what is better for neural networks? One-ho or integer encoding. The groups have no order, so I would say one-hot but I do not know whether neural network could deal with integer encoding in this case (it would mean a quarter of the features as one-hot encoding).

    Thank you!

    • Jason Brownlee January 13, 2018 at 5:32 am #

      One hot if there is no ordinal relationship between the labels.

  30. Dhrumil January 23, 2018 at 4:09 pm #

    I am total noob to this so maybe a silly question but this can only be applied when categories are less in number and the problem is about classification right?

    • Jason Brownlee January 24, 2018 at 9:51 am #

      One hot encoding can be used on input features for any type of problem and on the output feature for classification problems.

  31. sujal padhiyar February 28, 2018 at 11:28 am #

    Hello jason i have question: If there are categorical variable like 1st class, 2nd class, 3rd class fror housing price prediction , if i am converting with OneHotcoding so how algorithm will judge the ranking part of housing ? Yes it does convert it into binary but does it also taking car of ranking of that categorical variable? & One more question is to get binary output “pd.get_dummy” is useful or One HotEncoder is useful ?

    • Jason Brownlee March 1, 2018 at 6:05 am #

      A one hot encoding is for a classification problem, not regression.

      The house price is a regression problem where we predict a quantity.

  32. Sweta Rani March 1, 2018 at 6:29 am #

    Hi Jason..I have a data with string values having more than 500 unique values. How can I encode it so that can pass it to Ml algorithm. Is this good candidate for categorical encoding?

  33. Ammar Hasan March 31, 2018 at 11:36 am #

    Hi Jason,

    This is a great post, thanks for providing such valuable info. My question is:

    If we have many colors in the color column, say 25 colors, what if we encode the colors in 3 columns with RGB values instead of 25 binary columns? Do you see any abnormality with this approach?

    • Jason Brownlee April 1, 2018 at 5:43 am #

      No problem, it would be a binary vector with 25 elements.

  34. abraham April 15, 2018 at 8:03 pm #

    Hi Jason.
    I am an intern in data science, no exprience in datas. Thanks for your posts and e-mails that boosted my confidence to start my intern on ML and Deep learning.

    currentley, i got a dateset of 10 GB and after i make a preliminary investagation on the data i found out the following.

    feature ‘x1’ has 78 unique categories
    feature ‘x2’ has 24 unique categories
    feature ‘x3’ has 24 unique categories
    feature ‘x4’ has 35 unique categories
    feature ‘x5’ has 40 unique categories
    feature ‘x6’ has 106 unique categories
    feature ‘x7’ has 285629 unique categories
    feature ‘x8’ has 523912 unique categories
    feature ‘x8’ has 27 unique categories
    feature .x9’ has 224 unique categories
    feature ‘x10’ has 108 unique categories
    feature ‘x11’ has 98 unique categories
    feature ‘x12’ has 10 unique categories

    feature ‘x13’ has 1508604 unique categories
    feature ‘x14’ has 15 unique categories
    feature ‘x15’ has 1323136 unique categories
    feature ‘x16’ has 3446828 unique categories feature ‘x17’ has 10 unique categories
    feature ‘x18’ has 200 unique categories
    feature ‘x19’ has 2575092 unique categories
    feature ‘x20’ has 197957 unique categories

    how you you deal with this data set….. it has categorical and int attributes. it a classification problem. just predict the out come either lets say 0 nor 1. how you handle the category or how would you encode this attributes.
    should i simply use label encoder , one hot encoder or dummies. is it possible to encode such a big categories after all?
    i am confused where to start.

    Looking forwards for your suggestions and help

  35. Yi Deng April 17, 2018 at 6:06 am #

    The article boils down to one sentence:

    “using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories)”

    And that’s enough said. Thanks.

    • Jason Brownlee April 17, 2018 at 6:13 am #

      Not quite.

      That applies to the integer encoding, not the one hot encoding.

      In fact, that is the problem that the one hot encoding will over come.

  36. Pranav Pandya May 20, 2018 at 7:15 am #

    I disagree with one hot encoding approach. I mean, it depends on the algorithm. My opinion is based on playing around with categorical data and various algorithms on many Kaggle competitions with real world data.

    For example, LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot coding, LightGBM can find the optimal split of categorical features. Such an optimal split can provide the much better accuracy than one-hot coding solution. (official documentation: http://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html)

    PS. Compared to other GBMs (native gbm, h2o gbm or even xgboost), lightgbm is far ahead in terms of speed and accuracy.

    • Jason Brownlee May 21, 2018 at 6:21 am #

      Thanks for the note Pranav.

    • DH June 6, 2018 at 10:44 am #

      Pranav, can you provide a link to any benchmark results which show superior speed and accuracy of lightgbm relative to xgboost?

  37. Chen Chien May 22, 2018 at 10:14 pm #

    Excuse me!

    I am confused by dtype of dummy variable versus normal numeric variable.

    Take Python for example,
    If I using get_dummies function in pandas to convert category variable to dummy variable, it will return binary, and the dtype of those dummy variables are integer. How python determine those “int” variables are dummy variable, why python would not confused those ‘int’ with other normal numeric variables?
    Would Python treat those dummy variables(dtype = int) as a numeric variable?
    (int type is one of numeric type and computable, isn’t it?)

    This question may be a bit stupid, but it really confusing me for a while…

    Thanks for your help!

    • Jason Brownlee May 23, 2018 at 6:26 am #

      This is the plan, to have the algorithm treat them like other numeric variables.

      Does that help?

      • Chen Chien May 23, 2018 at 3:13 pm #

        Although algorithm treat them like other numeric variables, but the model can work just like there have category variables and numeric variables, right?

        In R, I can make a variable as “factor” by “as.factor”, and give to the model directly, it’s very intuition, so when I using Python to do the same thing, I got confused, although I originally know it should be preprocess by get dummy…

        I think that’s concept of dummy variable…but I got a lost for how dummy variable work in programming language…

        Thank for your help !

        • Jason Brownlee May 24, 2018 at 8:07 am #

          With sklearn and keras, you must integer encode or one hot encode.

          This may not be the case with other libraries in Python this may not be the case.

          • Chen Chien May 24, 2018 at 1:11 pm #

            Thanks a lot!

  38. Winda Serikandi June 2, 2018 at 1:21 am #

    Hai @Jason Brownlee. How can we have constant dummy variables if we want to predict new dataset?

    I have a case, i have generated a model using variables that converted into dummy variables. It has 13 dummy variables after onehotEncoded. I make model from Neural Networks.
    Now i’m going to predict some rows of new dataset. Of course the data must be onehotEncoded. But the result is 9 dummy variables after onehotEncoded.

    I’m still confused how to understand this problem. It seems that there are unbalancing between those kind of dummy variables. Is there any solutions to solve this problem?

    • Jason Brownlee June 2, 2018 at 6:36 am #

      You must use the same encoding process on new data that was used for the training data.

      You might need to save the objects involved.

  39. Awais Ahmed June 4, 2018 at 2:13 am #

    Hello, @Jason Brownlee,

    Can you guide me about Info column w.r.t protocol column in .pcap (network capture file).

    How should I deal with Info column, at the end I have to apply classification.


  40. Rishi June 22, 2018 at 6:50 pm #

    Can you please list which kind of ML algorithms do not handle categorical variable properly and meed one-hot or dummy coding, unlike decision trees.

    • Jason Brownlee June 23, 2018 at 6:14 am #

      It depends on the implementation of the algorithm (e.g. the library).

      For example, in Python, pretty much all the sklearn implementations require input categorical variables to be encoded.

  41. will July 21, 2018 at 1:04 am #

    jason – thanks for all your help and previous responses to everyone’s questions, it’s genuinely appreciated. but here’s another…

    i’m looking to predict monthly household KwH usage using 100+ variables related to socio-economics, demographics, housing features, quantity/quality of appliances, etc. since many of the variables have naturally occurring NAs as a result of previously contingent “No” or 0 answers, I’m considering one hot encoding the categoricals to “get rid” of the NAs…i’m also going to bin the ordinal outcome variable to try out classification algorithms as well…thoughts?

    thanks again for your help!

Leave a Reply