Getting started in applied machine learning can be difficult, especially when working with real-world data.

Often, machine learning tutorials will recommend or require that you prepare your data in specific ways before fitting a machine learning model.

One good example is to use a one-hot encoding on categorical data.

- Why is a one-hot encoding required?
- Why can’t you fit a model on your data directly?

In this post, you will discover the answer to these important questions and better understand data preparation in general in applied machine learning.

Let’s get started.

## What is Categorical Data?

Categorical data are variables that contain label values rather than numeric values.

The number of possible values is often limited to a fixed set.

Categorical variables are often called nominal.

Some examples include:

- A “
*pet*” variable with the values: “*dog*” and “*cat*“. - A “
*color*” variable with the values: “*red*“, “*green*” and “*blue*“. - A “
*place*” variable with the values: “first”, “*second*”*and*“*third*“.

Each value represents a different category.

Some categories may have a natural relationship to each other, such as a natural ordering.

The “*place*” variable above does have a natural ordering of values. This type of categorical variable is called an ordinal variable.

## What is the Problem with Categorical Data?

Some algorithms can work with categorical data directly.

For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation).

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

This means that categorical data must be converted to a numerical form. If the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application.

## How to Convert Categorical Data to Numerical Data?

This involves two steps:

- Integer Encoding
- One-Hot Encoding

### 1. Integer Encoding

As a first step, each unique category value is assigned an integer value.

For example, “*red*” is 1, “*green*” is 2, and “*blue*” is 3.

This is called a label encoding or an integer encoding and is easily reversible.

For some variables, this may be enough.

The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.

For example, ordinal variables like the “place” example above would be a good example where a label encoding would be sufficient.

### 2. One-Hot Encoding

For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.

In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

In the “*color*” variable example, there are 3 categories and therefore 3 binary variables are needed. A “1” value is placed in the binary variable for the color and “0” values for the other colors.

For example:

1 2 3 4 |
red, green, blue 1, 0, 0 0, 1, 0 0, 0, 1 |

The binary variables are often called “dummy variables” in other fields, such as statistics.

## Further Reading

- Categorical variable on Wikipedia
- Nominal category on Wikipedia
- Dummy variable on Wikipedia

## Summary

In this post, you discovered why categorical data often must be encoded when working with machine learning algorithms.

Specifically:

- That categorical data is defined as variables with a finite set of label values.
- That most machine learning algorithms require numerical input and output variables.
- That an integer and one hot encoding is used to convert categorical data to integer data.

Do you have any questions?

Post your questions to comments below and I will do my best to answer.

You didn’t mention that if we have a categorical variable with 3 categories, we only need to define 2 one-hot variables to save us from linear dependency.

HHi jason.I truly following you alot and really appreciate your effort and ease of tutorials.just a question,How one hot encoding would work for multilabel class and in coming tutorials could you help in featureselection of text data for muticlass and multilabel classification using keras.i tried multiclass for 90 datapoints. And used keras for mlp,cnn and rnn where each datapoint is long paragraph with labels but accuracy i got is 37.5 prcent. Let me know if you have any suggestions

The one hot vector would have a length that would equal the number of labels, but multiple 1 values could be specified.

Thanks for the suggestion.

This post suggests ways to lift deep learning model skill:

http://machinelearningmastery.com/improve-deep-learning-performance/

What are the cons of one hot encoding ??? Supposed that you have some categorical features with each one with 500 or more differents values !! So when you do one hot encoding you will have many colums in the dataset does it still good for a machine learning algorithm ???

Great question!

The vectors can get very large, e.g. the length of all words in your vocab in an NLP problem.

Large vectors make the method slow (increased computational complexity).

In these cases, a dense representation could be used, e.g. word embeddings in NLP.

Hi Jason, thanks again for your amazing pedagogy.

Back to the Espoirt question, I face this problem with 84 user_ID. I do a OHE of them and, like you said when I fit the data with a SVM classifier, it’s look like I fall in a infinite loop. So taking in to account the fact that I am not in the NLP case, how can I fixe this ?

Thanks.

What do you mean you fall into an infinite loop?

Very helpful post, Jason!

Espoir raised my question here but I did not undestand how to apply your answer to my case. I have 11+ thousand different products id. The database has about 130 thousand entries. This easily leads to MemoryError when using OHE. What approach/solution should I look for?

Ouch.

Maybe you can use efficient sparse vector representations to cut down on memory?

Maybe try exploring dense vector methods that are used in NLP. Maybe you can something like a word embedding and let the model (e.g. a neural net) learn the relationship between different input labels, if any.

Hello Jason how do we retrieve the features back after OHE if we need to present it visually?

You can reverse the encoding with an argmax() (e.g. numpy.argmax())

Thank you for these wonderful posts!

Does data have to be one-hot encoded for classification trees and random forests as well or they can handle data without it? Or just try which gives better results?

No, trees can deal with categories as-is.

Hi Jason, this post is very helpful, thank you!!

Question- In general what happens to model performance, when we apply One Hot Encoding to a ordinal feature? Would you suggest only to use integer encoding in case of ordinal features?

It really depends on the problem and the meaning of the feature being encoded.

If in doubt, test.

I see, thanks!

hey Jason,

As usual this is another useful post on feature representation of categorical variables. Since logistic regression fits a separation line on the data points of the form w1X1 + w2X2 +.. where X are features such as categorical variables- Places,color etc, and w are weights, intuitively X can take only numerical values for the line to fit. Is this a right intuition?

Yes, regression algorithms like logistic regression require numeric input variables.

Thanks a lot for your clarifying. I love your blogs and daily email digests. They help me to understand key concepts & practical tips easily.

Thanks Raj.

nice!

Thanks.

very well explained..thanks

Thanks, I’m glad it helped.

I love your blog!

One question: if we use tree based methods like decision tree, etc. Do we still need one-hot encoding?

Thanks you very much!

No Jie. Most decision trees can work with categorical inputs directly.

Thank you very much!

No probs.

Hi Jason, loving the blog … a lot!

I’m using your binary classification tutorial as a template (thanks!) for a retail sales data predictor. I’m basically trying to predict future hourly sales using product features and hourly weather forecasts, trained on historical sales and using above/below annual average sales as my binary labels.

I have encoded my categorical data and I get good accuracy when training my data (87%+), but this falls down (to 26%) when I try to predict using an unseen, and much smaller data set.

As far as I can see my problem is caused by encoding the categorical data – the same categories in my unseen set have different codes than in my model. Could this be the cause of my poor prediction performance: the encoded prediction categories are not aligned to those used to train and test the model? If so how do you overcome these challenges in practice?

Hope it makes sense.

Nice work Andrew!

Your model might be overfitting, try a smaller model, try regularization, try a large dataset, try less training.

Here are more ideas:

http://machinelearningmastery.com/improve-deep-learning-performance/

I hope that helps as a start.

Hey Jason, didn’t think I had ‘that’ problem, but I probably do ðŸ™‚

Many thanks.

Appreciable and very helpful post, thank you!!!

Question: What is the best way to one hot encode an array of categorical variables?

I have also startup with a AI post you can also find some knowledge over there: Thebigmoapproach.com/

There are many ways and “best” is defined by the tools and problem.

Here are a few ways:

http://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/