Last Updated on August 17, 2020

Machine learning models require all input and output variables to be numeric.

This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.

The two most popular techniques are an **Ordinal Encoding** and a **One-Hot Encoding**.

In this tutorial, you will discover how to use encoding schemes for categorical machine learning data.

After completing this tutorial, you will know:

- Encoding is a required pre-processing step when working with categorical data for machine learning algorithms.
- How to use ordinal encoding for categorical variables that have a natural rank ordering.
- How to use one-hot encoding for categorical variables that do not have a natural rank ordering.

**Kick-start your project** with my new book Data Preparation for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

## Tutorial Overview

This tutorial is divided into six parts; they are:

- Nominal and Ordinal Variables
- Encoding Categorical Data
- Ordinal Encoding
- One-Hot Encoding
- Dummy Variable Encoding

- Breast Cancer Dataset
- OrdinalEncoder Transform
- OneHotEncoder Transform
- Common Questions

## Nominal and Ordinal Variables

Numerical data, as its name suggests, involves features that are only composed of numbers, such as integers or floating-point values.

Categorical data are variables that contain label values rather than numeric values.

The number of possible values is often limited to a fixed set.

Categorical variables are often called nominal.

Some examples include:

- A “
*pet*” variable with the values: “*dog*” and “*cat*“. - A “
*color*” variable with the values: “*red*“, “*green*“, and “*blue*“. - A “
*place*” variable with the values: “*first*“, “*second*“, and “*third*“.

Each value represents a different category.

Some categories may have a natural relationship to each other, such as a natural ordering.

The “*place*” variable above does have a natural ordering of values. This type of categorical variable is called an ordinal variable because the values can be ordered or ranked.

A numerical variable can be converted to an ordinal variable by dividing the range of the numerical variable into bins and assigning values to each bin. For example, a numerical variable between 1 and 10 can be divided into an ordinal variable with 5 labels with an ordinal relationship: 1-2, 3-4, 5-6, 7-8, 9-10. This is called discretization.

**Nominal Variable**(*Categorical*). Variable comprises a finite set of discrete values with no relationship between values.**Ordinal Variable**. Variable comprises a finite set of discrete values with a ranked ordering between values.

Some algorithms can work with categorical data directly.

For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation).

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

Some implementations of machine learning algorithms require all data to be numerical. For example, scikit-learn has this requirement.

This means that categorical data must be converted to a numerical form. If the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application.

### Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Encoding Categorical Data

There are three common approaches for converting ordinal and categorical variables to numerical values. They are:

- Ordinal Encoding
- One-Hot Encoding
- Dummy Variable Encoding

Let’s take a closer look at each in turn.

### Ordinal Encoding

In ordinal encoding, each unique category value is assigned an integer value.

For example, “*red*” is 1, “*green*” is 2, and “*blue*” is 3.

This is called an ordinal encoding or an integer encoding and is easily reversible. Often, integer values starting at zero are used.

For some variables, an ordinal encoding may be enough. The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.

It is a natural encoding for ordinal variables. For categorical variables, it imposes an ordinal relationship where no such relationship may exist. This can cause problems and a one-hot encoding may be used instead.

This ordinal encoding transform is available in the scikit-learn Python machine learning library via the OrdinalEncoder class.

By default, it will assign integers to labels in the order that is observed in the data. If a specific order is desired, it can be specified via the “*categories*” argument as a list with the rank order of all expected labels.

We can demonstrate the usage of this class by converting colors categories “red”, “green” and “blue” into integers. First the categories are sorted then numbers are applied. For strings, this means the labels are sorted alphabetically and that blue=0, green=1 and red=2.

The complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 |
# example of a ordinal encoding from numpy import asarray from sklearn.preprocessing import OrdinalEncoder # define data data = asarray([['red'], ['green'], ['blue']]) print(data) # define ordinal encoding encoder = OrdinalEncoder() # transform data result = encoder.fit_transform(data) print(result) |

Running the example first reports the 3 rows of label data, then the ordinal encoding.

We can see that the numbers are assigned to the labels as we expected.

1 2 3 4 5 6 |
[['red'] ['green'] ['blue']] [[2.] [1.] [0.]] |

This OrdinalEncoder class is intended for input variables that are organized into rows and columns, e.g. a matrix.

If a categorical target variable needs to be encoded for a classification predictive modeling problem, then the LabelEncoder class can be used. It does the same thing as the OrdinalEncoder, although it expects a one-dimensional input for the single target variable.

### One-Hot Encoding

For categorical variables where no ordinal relationship exists, the integer encoding may not be enough, at best, or misleading to the model at worst.

Forcing an ordinal relationship via an ordinal encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the ordinal representation. This is where the integer encoded variable is removed and one new binary variable is added for each unique integer value in the variable.

Each bit represents a possible category. If the variable cannot belong to multiple categories at once, then only one bit in the group can be “on.” This is called one-hot encoding …

— Page 78, Feature Engineering for Machine Learning, 2018.

In the “*color*” variable example, there are three categories, and, therefore, three binary variables are needed. A “1” value is placed in the binary variable for the color and “0” values for the other colors.

This one-hot encoding transform is available in the scikit-learn Python machine learning library via the OneHotEncoder class.

We can demonstrate the usage of the OneHotEncoder on the color categories. First the categories are sorted, in this case alphabetically because they are strings, then binary variables are created for each category in turn. This means blue will be represented as [1, 0, 0] with a “1” in for the first binary variable, then green, then finally red.

The complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 |
# example of a one hot encoding from numpy import asarray from sklearn.preprocessing import OneHotEncoder # define data data = asarray([['red'], ['green'], ['blue']]) print(data) # define one hot encoding encoder = OneHotEncoder(sparse=False) # transform data onehot = encoder.fit_transform(data) print(onehot) |

Running the example first lists the three rows of label data, then the one hot encoding matching our expectation of 3 binary variables in the order “blue”, “green” and “red”.

1 2 3 4 5 6 |
[['red'] ['green'] ['blue']] [[0. 0. 1.] [0. 1. 0.] [1. 0. 0.]] |

If you know all of the labels to be expected in the data, they can be specified via the “*categories*” argument as a list.

The encoder is fit on the training dataset, which likely contains at least one example of all expected labels for each categorical variable if you do not specify the list of labels. If new data contains categories not seen in the training dataset, the “*handle_unknown*” argument can be set to “*ignore*” to not raise an error, which will result in a zero value for each label.

### Dummy Variable Encoding

The one-hot encoding creates one binary variable for each category.

The problem is that this representation includes redundancy. For example, if we know that [1, 0, 0] represents “*blue*” and [0, 1, 0] represents “*green*” we don’t need another binary variable to represent “*red*“, instead we could use 0 values for both “*blue*” and “*green*” alone, e.g. [0, 0].

This is called a dummy variable encoding, and always represents C categories with C-1 binary variables.

When there are C possible values of the predictor and only C – 1 dummy variables are used, the matrix inverse can be computed and the contrast method is said to be a full rank parameterization

— Page 95, Feature Engineering and Selection, 2019.

In addition to being slightly less redundant, a dummy variable representation is required for some models.

For example, in the case of a linear regression model (and other regression models that have a bias term), a one hot encoding will case the matrix of input data to become singular, meaning it cannot be inverted and the linear regression coefficients cannot be calculated using linear algebra. For these types of models a dummy variable encoding must be used instead.

If the model includes an intercept and contains dummy variables […], then the […] columns would add up (row-wise) to the intercept and this linear combination would prevent the matrix inverse from being computed (as it is singular).

— Page 95, Feature Engineering and Selection, 2019.

We rarely encounter this problem in practice when evaluating machine learning algorithms, unless we are using linear regression of course.

… there are occasions when a complete set of dummy variables is useful. For example, the splits in a tree-based model are more interpretable when the dummy variables encode all the information for that predictor. We recommend using the full set if dummy variables when working with tree-based models.

— Page 56, Applied Predictive Modeling, 2013.

We can use the *OneHotEncoder* class to implement a dummy encoding as well as a one hot encoding.

The “*drop*” argument can be set to indicate which category will be come the one that is assigned all zero values, called the “*baseline*“. We can set this to “*first*” so that the first category is used. When the labels are sorted alphabetically, the first “blue” label will be the first and will become the baseline.

There will always be one fewer dummy variable than the number of levels. The level with no dummy variable […] is known as the baseline.

— Page 86, An Introduction to Statistical Learning with Applications in R, 2014.

We can demonstrate this with our color categories. The complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 |
# example of a dummy variable encoding from numpy import asarray from sklearn.preprocessing import OneHotEncoder # define data data = asarray([['red'], ['green'], ['blue']]) print(data) # define one hot encoding encoder = OneHotEncoder(drop='first', sparse=False) # transform data onehot = encoder.fit_transform(data) print(onehot) |

Running the example first lists the three rows for the categorical variable, then the dummy variable encoding, showing that green is “encoded” as [1, 0], “red” is encoded as [0, 1] and “blue” is encoded as [0, 0] as we specified.

1 2 3 4 5 6 |
[['red'] ['green'] ['blue']] [[0. 1.] [1. 0.] [0. 0.]] |

Now that we are familiar with the three approaches for encoding categorical variables, let’s look at a dataset that has categorical variables.

## Breast Cancer Dataset

As the basis of this tutorial, we will use the “Breast Cancer” dataset that has been widely studied in machine learning since the 1980s.

The dataset classifies breast cancer patient data as either a recurrence or no recurrence of cancer. There are 286 examples and nine input variables. It is a binary classification problem.

A reasonable classification accuracy score on this dataset is between 68 percent and 73 percent. We will aim for this region, but note that the models in this tutorial are not optimized: they are designed to demonstrate encoding schemes.

No need to download the dataset as we will access it directly from the code examples.

Looking at the data, we can see that all nine input variables are categorical.

Specifically, all variables are quoted strings. Some variables show an obvious ordinal relationship for ranges of values (like age ranges), and some do not.

1 2 3 4 5 6 |
'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events' '50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events' '50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events' '40-49','premeno','35-39','0-2','yes','3','right','left_low','yes','no-recurrence-events' '40-49','premeno','30-34','3-5','yes','2','left','right_up','no','recurrence-events' ... |

Note that this dataset has missing values marked with a “*nan*” value.

We will leave these values as-is in this tutorial and use the encoding schemes to encode “nan” as just another value. This is one possible and quite reasonable approach to handling missing values for categorical variables.

We can load this dataset into memory using the Pandas library.

1 2 3 4 5 |
... # load the dataset dataset = read_csv(url, header=None) # retrieve the array of data data = dataset.values |

Once loaded, we can split the columns into input (X) and output (y) for modeling.

1 2 3 4 |
... # separate into input and output columns X = data[:, :-1].astype(str) y = data[:, -1].astype(str) |

Making use of this function, the complete example of loading and summarizing the raw categorical dataset is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# load and summarize the dataset from pandas import read_csv # define the location of the dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv" # load the dataset dataset = read_csv(url, header=None) # retrieve the array of data data = dataset.values # separate into input and output columns X = data[:, :-1].astype(str) y = data[:, -1].astype(str) # summarize print('Input', X.shape) print('Output', y.shape) |

Running the example reports the size of the input and output elements of the dataset.

We can see that we have 286 examples and nine input variables.

1 2 |
Input (286, 9) Output (286,) |

Now that we are familiar with the dataset, let’s look at how we can encode it for modeling.

## OrdinalEncoder Transform

An ordinal encoding involves mapping each unique label to an integer value.

This type of encoding is really only appropriate if there is a known relationship between the categories. This relationship does exist for some of the variables in our dataset, and ideally, this should be harnessed when preparing the data.

In this case, we will ignore any possible existing ordinal relationship and assume all variables are categorical. It can still be helpful to use an ordinal encoding, at least as a point of reference with other encoding schemes.

We can use the OrdinalEncoder from scikit-learn to encode each variable to integers. This is a flexible class and does allow the order of the categories to be specified as arguments if any such order is known.

Note: I will leave it as an exercise for you to update the example below to try specifying the order for those variables that have a natural ordering and see if it has an impact on model performance.

Once defined, we can call the *fit_transform()* function and pass it to our dataset to create a quantile transformed version of our dataset.

1 2 3 4 |
... # ordinal encode input variables ordinal = OrdinalEncoder() X = ordinal.fit_transform(X) |

We can also prepare the target in the same manner.

1 2 3 4 |
... # ordinal encode target variable label_encoder = LabelEncoder() y = label_encoder.fit_transform(y) |

Let’s try it on our breast cancer dataset.

The complete example of creating an ordinal encoding transform of the breast cancer dataset and summarizing the result is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# ordinal encode the breast cancer dataset from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OrdinalEncoder # define the location of the dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv" # load the dataset dataset = read_csv(url, header=None) # retrieve the array of data data = dataset.values # separate into input and output columns X = data[:, :-1].astype(str) y = data[:, -1].astype(str) # ordinal encode input variables ordinal_encoder = OrdinalEncoder() X = ordinal_encoder.fit_transform(X) # ordinal encode target variable label_encoder = LabelEncoder() y = label_encoder.fit_transform(y) # summarize the transformed data print('Input', X.shape) print(X[:5, :]) print('Output', y.shape) print(y[:5]) |

Running the example transforms the dataset and reports the shape of the resulting dataset.

We would expect the number of rows, and in this case, the number of columns, to be unchanged, except all string values are now integer values.

As expected, in this case, we can see that the number of variables is unchanged, but all values are now ordinal encoded integers.

1 2 3 4 5 6 7 8 |
Input (286, 9) [[2. 2. 2. 0. 1. 2. 1. 2. 0.] [3. 0. 2. 0. 0. 0. 1. 0. 0.] [3. 0. 6. 0. 0. 1. 0. 1. 0.] [2. 2. 6. 0. 1. 2. 1. 1. 1.] [2. 2. 5. 4. 1. 1. 0. 4. 0.]] Output (286,) [1 0 1 0 1] |

Next, let’s evaluate machine learning on this dataset with this encoding.

The best practice when encoding variables is to fit the encoding on the training dataset, then apply it to the train and test datasets.

We will first split the dataset, then prepare the encoding on the training set, and apply it to the test set.

1 2 3 |
... # split the dataset into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) |

We can then fit the *OrdinalEncoder* on the training dataset and use it to transform the train and test datasets.

1 2 3 4 5 6 |
... # ordinal encode input variables ordinal_encoder = OrdinalEncoder() ordinal_encoder.fit(X_train) X_train = ordinal_encoder.transform(X_train) X_test = ordinal_encoder.transform(X_test) |

The same approach can be used to prepare the target variable. We can then fit a logistic regression algorithm on the training dataset and evaluate it on the test dataset.

The complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
# evaluate logistic regression on the breast cancer dataset with an ordinal encoding from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OrdinalEncoder from sklearn.metrics import accuracy_score # define the location of the dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv" # load the dataset dataset = read_csv(url, header=None) # retrieve the array of data data = dataset.values # separate into input and output columns X = data[:, :-1].astype(str) y = data[:, -1].astype(str) # split the dataset into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # ordinal encode input variables ordinal_encoder = OrdinalEncoder() ordinal_encoder.fit(X_train) X_train = ordinal_encoder.transform(X_train) X_test = ordinal_encoder.transform(X_test) # ordinal encode target variable label_encoder = LabelEncoder() label_encoder.fit(y_train) y_train = label_encoder.transform(y_train) y_test = label_encoder.transform(y_test) # define the model model = LogisticRegression() # fit on the training set model.fit(X_train, y_train) # predict on test set yhat = model.predict(X_test) # evaluate predictions accuracy = accuracy_score(y_test, yhat) print('Accuracy: %.2f' % (accuracy*100)) |

Running the example prepares the dataset in the correct manner, then evaluates a model fit on the transformed data.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the model achieved a classification accuracy of about 75.79 percent, which is a reasonable score.

1 |
Accuracy: 75.79 |

Next, let’s take a closer look at the one-hot encoding.

## OneHotEncoder Transform

A one-hot encoding is appropriate for categorical data where no relationship exists between categories.

The scikit-learn library provides the OneHotEncoder class to automatically one hot encode one or more variables.

By default the *OneHotEncoder* will output data with a sparse representation, which is efficient given that most values are 0 in the encoded representation. We will disable this feature by setting the “*sparse*” argument to *False* so that we can review the effect of the encoding.

Once defined, we can call the *fit_transform()* function and pass it to our dataset to create a quantile transformed version of our dataset.

1 2 3 4 |
... # one hot encode input variables onehot_encoder = OneHotEncoder(sparse=False) X = onehot_encoder.fit_transform(X) |

As before, we must label encode the target variable.

The complete example of creating a one-hot encoding transform of the breast cancer dataset and summarizing the result is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
# one-hot encode the breast cancer dataset from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder # define the location of the dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv" # load the dataset dataset = read_csv(url, header=None) # retrieve the array of data data = dataset.values # separate into input and output columns X = data[:, :-1].astype(str) y = data[:, -1].astype(str) # one hot encode input variables onehot_encoder = OneHotEncoder(sparse=False) X = onehot_encoder.fit_transform(X) # ordinal encode target variable label_encoder = LabelEncoder() y = label_encoder.fit_transform(y) # summarize the transformed data print('Input', X.shape) print(X[:5, :]) |

Running the example transforms the dataset and reports the shape of the resulting dataset.

We would expect the number of rows to remain the same, but the number of columns to dramatically increase.

As expected, in this case, we can see that the number of variables has leaped up from 9 to 43 and all values are now binary values 0 or 1.

1 2 3 4 5 6 7 8 9 10 11 |
Input (286, 43) [[0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0.] [0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0.] [0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0.] [0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1.] [0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0.]] |

Next, let’s evaluate machine learning on this dataset with this encoding as we did in the previous section.

The encoding is fit on the training set then applied to both train and test sets as before.

1 2 3 4 5 6 |
... # one-hot encode input variables onehot_encoder = OneHotEncoder() onehot_encoder.fit(X_train) X_train = onehot_encoder.transform(X_train) X_test = onehot_encoder.transform(X_test) |

Tying this together, the complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
# evaluate logistic regression on the breast cancer dataset with an one-hot encoding from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.metrics import accuracy_score # define the location of the dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv" # load the dataset dataset = read_csv(url, header=None) # retrieve the array of data data = dataset.values # separate into input and output columns X = data[:, :-1].astype(str) y = data[:, -1].astype(str) # split the dataset into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # one-hot encode input variables onehot_encoder = OneHotEncoder() onehot_encoder.fit(X_train) X_train = onehot_encoder.transform(X_train) X_test = onehot_encoder.transform(X_test) # ordinal encode target variable label_encoder = LabelEncoder() label_encoder.fit(y_train) y_train = label_encoder.transform(y_train) y_test = label_encoder.transform(y_test) # define the model model = LogisticRegression() # fit on the training set model.fit(X_train, y_train) # predict on test set yhat = model.predict(X_test) # evaluate predictions accuracy = accuracy_score(y_test, yhat) print('Accuracy: %.2f' % (accuracy*100)) |

Running the example prepares the dataset in the correct manner, then evaluates a model fit on the transformed data.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the model achieved a classification accuracy of about 70.53 percent, which is slightly worse than the ordinal encoding in the previous section.

1 |
Accuracy: 70.53 |

## Common Questions

This section lists some common questions and answers when encoding categorical data.

**Q. What if I have a mixture of numeric and categorical data?**

Or, what if I have a mixture of categorical and ordinal data?

You will need to prepare or encode each variable (column) in your dataset separately, then concatenate all of the prepared variables back together into a single array for fitting or evaluating the model.

Alternately, you can use the ColumnTransformer to conditionally apply different data transforms to different input variables.

**Q. What if I have hundreds of categories?**

Or, what if I concatenate many one-hot encoded vectors to create a many-thousand-element input vector?

You can use a one-hot encoding up to thousands and tens of thousands of categories. Also, having large vectors as input sounds intimidating, but the models can generally handle it.

**Q. What encoding technique is the best?**

This is unknowable.

Test each technique (and more) on your dataset with your chosen model and discover what works best for your case.

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Tutorials

- 3 Ways to Encode Categorical Variables for Deep Learning
- Why One-Hot Encode Data in Machine Learning?
- How to One Hot Encode Sequence Data in Python

### Books

- Feature Engineering for Machine Learning, 2018.
- Applied Predictive Modeling, 2013.
- An Introduction to Statistical Learning with Applications in R, 2014.

### APIs

- sklearn.preprocessing.OneHotEncoder API.
- sklearn.preprocessing.LabelEncoder API.
- sklearn.preprocessing.OrdinalEncoder API.

### Articles

## Summary

In this tutorial, you discovered how to use encoding schemes for categorical machine learning data.

Specifically, you learned:

- Encoding is a required pre-processing step when working with categorical data for machine learning algorithms.
- How to use ordinal encoding for categorical variables that have a natural rank ordering.
- How to use one-hot encoding for categorical variables that do not have a natural rank ordering.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

“This type of categorical variable is called an nominal variable because the values can be ordered or ranked.” Should here be ordinal variable instead of nominal variable

Thanks, fixed!

Hi,

« The “place” variable above does have a natural ordering of values. This type of categorical variable is called an nominal variable because the values can be ordered or ranked. ».

You mean « ordinal variable » not « nominal variable » ?

Yes, fixed. Thanks for catching this.

An interesting discussion on “Why OneHotEncoder not get_dummies?” in sklearn can be found here: https://stackoverflow.com/questions/36631163/what-are-the-pros-and-cons-between-get-dummies-pandas-and-onehotencoder-sciki

Thanks for sharing.

In case of the OrdinalEncoding, I have observed shift in the encoded value after introducing new value E.g. data = asarray([[‘orange’], [‘red’], [‘green’], [‘blue’]]). In real life scenario, additional values should not shuffle existing encoded values between training and test data sets. Can we use some hashing technique to generate consistent encoded values?

You can specify the fill list of expected values up front and save the object for use on all future data.

Or you can write some custom code to handle the mapping.

Hi,

When I try to do the dummy encoding with argument drop=’first’ I get this error: __init__() got an unexpected keyword argument ‘drop’.

I checked the syntax documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html?highlight=onehotencoder#sklearn.preprocessing.OneHotEncoder

The code seems to be right, but I still get the error. Can you please help?

Thanks so much for your tutorials – they are great!

Perhaps confirm that your version of scikit-learn is up to date?

Ugh. SO sorry. I thought I was on the latest version. I updated and the dummy encoding code works now. Thanks for taking the time to get back to me!

Well done!

No problem at all.

Please, i am trying to fit the OrdinalEncoder on the training dataset and use it to transform the train and test datasets as follows;

ordinal_encoder = OrdinalEncoder()

ordinal_encoder.fit(X_train)

X_train = ordinal_encoder.transform(X_train)

X_test = ordinal_encoder.transform(X_test)

I have this error

————————————————————————–

ValueError Traceback (most recent call last)

in ()

2 ordinal_encoder.fit(x_train)

3 x_train=ordinal_encoder.transform(x_train)

—-> 4 x_test = ordinal_encoder.transform (x_test)

1 frames

/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_encoders.py in _transform(self, X, handle_unknown)

122 msg = (“Found unknown categories {0} in column {1}”

123 ” during transform”.format(diff, i))

–> 124 raise ValueError(msg)

125 else:

126 # Set the problematic rows to an acceptable value and

ValueError: Found unknown categories [69.0, 70.0, 71.0, 72.0] in column 0 during transform

What can be the way out? I appreciate your time.

Sorry to hear that you are having trouble, this may help:

https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me

One reason for this problem is a category value exists in the test data set which is not available in the training set. Hence encoder has not seen this value during the fit and hence doesn’t know how to encode it. Few different options to handle such scenario

1. Make use of the handle_unknown parameter, refer OrdinalEncoder documentation.

2. Make use of categories parameter, refer OrdinalEncoder documentation. I personally prefer this method as this gives complete control and also allows consistency between train and test.

However in your case error message is ValueError: Found unknown categories [69.0, 70.0, 71.0, 72.0] in column 0 during transform

this gives me a hit that the first column in your data set is a sequence number, check if this should really be part of the feature set? If this is a simple sequence number there is absolutely no way you can consider them as categorical data.

Typically unseen categories are labeled as all zeros.

Hi!

One of the categorical variables I am dealing with is, employment length, that has 511193 distinct categories. I am getting a memory error ‘Unable to allocate 243. GiB for an array with shape (511193, 511193)’. How to deal with this. I have come across articles which say that one hot encoding can’t process high cardinality and would give misleading results. Kindly help! Thanks

Perhaps try an ordinal encoding.

Also, perhaps try comparing results to an embedding and a hash.

More suggestions here:

https://machinelearningmastery.com/faq/single-faq/how-do-i-handle-a-large-number-of-categories

Hi Jason,

So since the one-hot encoding may cause problems with linear regression. Can I process the data as dummy variable and this won’t affect whether I use it for tree-based methods or linear models?

Yes I believe so. Test to confirm if you’re concerned.

Hi Jason.

I’m a beginner at using machine learning for predictions.

I already followed your code on using OrdinalEncoder for the brest-cancer dataset. How do I input new data for predictions (apart from the test dataset) after encoding with OrdinalEncoder ?

This can help you make a prediction with new data for your model/pipeline:

https://machinelearningmastery.com/make-predictions-scikit-learn/

Hi Jason.

I’m a beginner at using machine learning for predictions.

I already followed your code on using OrdinalEncoder for the brest-cancer dataset. How do I get predictions for new data unknown to the dataset (i.e data different from the test dataset) after encoding with OrdinalEncoder?

Good question, see this:

https://machinelearningmastery.com/make-predictions-scikit-learn/

Thanks for your reply.

I figured this out but my question was how do I make predictions for new

categoricaldataset. Examples given in the link you provided were on numbers already.But say I am working on the breast-cancer.csv dataset, how do I predict the outcome for new categorical data unknown to the dataset?

The approach for making predictions is the same regardless of the data types. E.g. call model.predict()

If you are using encodings, then wrap them up into a pipeline and call pipeline.predict() and pass the raw data as input.

Many thanks. I figured it out.

Well done.

What if you need a default value to cover a categorical level that is unforeseen? Dummy encoding is very good for that, but I think some discussion of the matter could be included.

You can set the handle_unknown argument and specify how to handle labels unseen during training.

”’In this case, we will ignore any possible existing ordinal relationship and assume all variables are categorical. It can still be helpful to use an ordinal encoding, ‘at least as a point of reference with other encoding schemes.’ ”’ Can you elaborate this sentence?

Yes, you can choose to specify the ordinal relationship between the labels in some of the variables if you wish. This would likely improve the performance of the model fit on an ordinal encoding.

Running the code of “evaluate logistic regression on the breast cancer dataset with an ordinal encoding” a few time gives the same accuracy, because of random_state=1 during splitting to train and test sets. After setting random_state=None you can get an error like:

ValueError: Found unknown categories [“’24-26′”] in column 3 during transform

I suppose that we should make ordinal encoding of input variables before splitting. Modified code:

for i in range(10):

# ordinal encode input variables

ordinal_encoder = OrdinalEncoder()

ordinal_encoder.fit(X)

X = ordinal_encoder.transform(X)

# split the dataset into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=None)

# ordinal encode target variable

label_encoder = LabelEncoder()

label_encoder.fit(y_train)

y_train = label_encoder.transform(y_train)

y_test = label_encoder.transform(y_test)

# define the model

model = LogisticRegression()

# fit on the training set

model.fit(X_train, y_train)

# predict on test set

yhat = model.predict(X_test)

# evaluate predictions

accuracy = accuracy_score(y_test, yhat)

print(f'{i + 1}. Accuracy: {accuracy*100:.2f} %’)

works fine, giving various results in subsequent runs, for example:

1. Accuracy: 64.21 %

2. Accuracy: 77.89 %

3. Accuracy: 75.79 %

4. Accuracy: 77.89 %

5. Accuracy: 76.84 %

6. Accuracy: 68.42 %

7. Accuracy: 75.79 %

8. Accuracy: 73.68 %

9. Accuracy: 71.58 %

10. Accuracy: 80.00 %

Thanks for sharing.

You transform target with Ordinal encoder and then fit the logistic regression model.

Is this process called ordinal logistic regression or is it something different?

Thanks in advance

No, just “logistic regression”.

hi can u pls say after using one hot encoding and making our model with logistic regression or other classification algorithm how we can calculate the categorical data coefficient? coz after using one hot encoding a categorical variable will change to for example 3 columns and variable we should calculate coefficient of each columns that depend of the categorical data seperately and then add to each other or no how is it? i have different type of variables and my goal is show that which variable has most effect on my target variable(output) after build the classification model

What is the “categorical data coefficient”?

alueError Traceback (most recent call last)

in ()

59 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

60 # prepare input data

—> 61 X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)

62 # prepare output data

63 y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

in prepare_inputs(X_train, X_test)

38 # encode

39 train_enc = le.transform(X_train[:, i])

—> 40 test_enc = le.transform(X_test[:, i])

41 # store

42 X_train_enc.append(train_enc)

~\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py in transform(self, y)

131 if len(np.intersect1d(classes, self.classes_)) 133 raise ValueError(“y contains new labels: %s” % str(diff))

134 return np.searchsorted(self.classes_, y)

135

ValueError: y contains new labels: [‘102’ ‘120’ ‘137’ ‘145’ ‘147’ ’15’ ‘159’ ‘174’ ‘195’ ‘208’ ‘214’ ‘222’

‘223’ ’23’ ’24’ ‘247’ ‘248’ ‘259’ ‘264’ ‘286’ ‘289’ ‘290’ ‘291’ ‘298’

‘301’ ‘325’ ‘328’ ‘331’ ’34’ ‘341’ ‘354’ ‘361’ ‘367’ ‘378’ ‘401’ ‘402’

‘410’ ‘426’ ‘441’ ‘446’ ‘452’ ’46’ ‘468’ ‘485’ ‘489’ ‘5’ ‘510’ ‘512’

‘513’ ‘520’ ‘533’ ’54’ ‘550’ ‘579’ ‘586’ ‘587’ ’59’ ‘603’ ‘605’ ’73’ ’79’

’89’ ’90’ ’99’]

###################

I have multiclassification labe

Sorry to hear that you’re having trouble, these tips may help:

https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me

How to handle missing values in One Hot Encoder?

I am trying to apply one hot encoding to 2D numpy array containing missing value ‘N’. I have the following code, which works well without missing value.

import numpy as np

X = [[‘A’, ‘G’], [‘N’, ‘C’], [‘T’, ‘A’]]

X = np.array(X)

print(‘Shape of data before one hot encoding: ‘, X.shape)

classes = np.array([‘A’, ‘C’, ‘G’, ‘T’])

X = np.searchsorted(classes, X)

eye = np.eye(classes.shape[0])

onehotlabels = np.concatenate([eye[i] for i in X.T], axis=1)

print(‘Shape of data after pre-processing: ‘, onehotlabels.shape)

print(onehotlabels)

Output:

Shape of data before one hot encoding: (3, 2)

Shape of data after pre-processing: (3, 8)

[[1. 0. 0. 0. 0. 0. 1. 0.]

[0. 0. 0. 1. 0. 1. 0. 0.]

[0. 0. 0. 1. 1. 0. 0. 0.]]

Is it possible to replace ‘N’ by 0 0 0 0 by using this approach? The output will look like the following:

[[1. 0. 0. 0. 0. 0. 1. 0.]

[0. 0. 0. 0. 0. 1. 0. 0.]

[0. 0. 0. 1. 1. 0. 0. 0.]]

You can set “handle_unknown” to “ignore”.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Hi Jason,

Thanks for wonderful article.

I am facing one doubt and not getting answer anywhere.

Suppose my target variable is Category Fruit – Apple, Orange, Mango, Banana.

There is no ordinal relationship between them. Can I one hot encode them or use ordinal encoder.

Everywhere (including other websites). I can find that target variable is converted into Apple -1, Orange-2, Mango-3, Banana-4. Sometimes column with labels 1, 2,3,4 is already created.

My query – 1) If there is no relationship exists between Apple, Orange, Mango, Banana. So should we use One Hot encoding /Dummy Variable or Ordinal encoders for converting target variable. Most of the cases I found Ordinal encoder used.

If I use One Hot encoding /Dummy Variable encoding for target variable Apple, Orange, Mango, Banana. Then there will be many columns created. I think this is not appropriate.

Is it the reason for creating single column for target – Ordinal encoders are used.

2) So if we use one One Hot encoding /Dummy Variable creation for target variable with categories , then which algo will be suitable for prediction?

One hot encode sounds appropriate.

Yes, many columns will be created, it is typically not a problem, even with tens of thousands of new columns.

Hi,

as i understand label encoder is for categorical target variable (i.e y). Does it make sense to use label encoder for the independent feutures (i.e X)?

**I JUST BOUGHT YOU BOOK”DATA PREPARATION FOR MACHINE LEANRING”

THIS IS A FANTASTIC BOOK

Yes, you can use the label encoder on inputs, but we now have the “ordinal encoder” that does the same job and is designed to work with inputs.

Thanks!!!

ok, thanks. I use the label encoder as the technique do not require 2d data. When I tried to use the ordinal encoder I had to reshape the data.

Yes, this is a key difference.

Great technical writing!!!

I wonder which should be done first: data cleaning or data transform?

I think data cleaning includes subtasks such as outlier detection where values of columns are expected to be numeric.

Thanks.

Often both are iterated as you learn more about the data/problem, but generally cleaning then transforms.

Hello,

If I have categorical and nominal variables, can I encode the categorical with one-hot and the nominal ones with label encoding ?

Sure. Use whatever works best for your data.

Hi,

Quoting your article”

“If new data contains categories not seen in the training dataset, the “handle_unknown” argument can be set to “ignore” to not raise an error, which will result in a zero value for each label.”

I guess that because we fit and transform using certain OneHotEncoder on the training set, there can be a case where the testing set has features that haven’t been OneHotEncoded and that’s what you wrote in your blog.

Does setting the “handle_unknow”=’ignore’ in such a case wouldn’t cause a loss of accuracy in the predictions/ cassifications?

Is there another way to deal with that issue, without causing Data Leakage?

Regards

It may or may not impact model performance depending on the model and the dataset.

A better approach is to have a more complete training dataset.

Hi Jason,

Your posts are always my first ‘go-to’ for my ML/AI-related questions! Thanks for your contribution to the community.

My question is, in case of tree-based models, is it best practice to use OHE, or Dummy encoding? The tree-based algorithms build a tree based on all available variables and we if drop one of the levels, the tree will never see that variable and will not use it in the splits. I would love to know your thoughts on this.

Thanks.

Perhaps try both and see what works well or best for your specific dataset and model.

Very interesting article.

I’m still missing something. I’m actually trying to do logistic regression from scratch on a Fractures Dataset and I have several categorical columns to encode myself.

You say that DummyEncoding avoid the problem (when doing linear regression ok, maybe it is less important with logistic regression so ?) of singular matrix that comes with one hot encoding, and it works with n-1 columns.

But if I want to encode the sex column, only possibility is 1 for male 0 for female (or the contrary). If I apply the logic of DummyEncoding, we go back to the problem of ordinal relations (that does not existe here between the different values of a categorical variable) that are not relevant here and that we tried to avoid using precisely a One Hot Encoding ?

Then, I have a column “medication” with “no medication” “medication #1” “medication #2” values, here I can perfectly proceed as you did with the colors and one hot encoding right ?

Finally, where can I find more information about the exact mathematical reasons that make ordinal encoding bad when no ordinal relations exist and conversly make one hot encoder works good ?

Thanks a lot,

Geoffrey (from France)

For why ordinal encoding is bad when no ordinal relations exists: The levels of measurement are nominal, ordinal, interval, and ratio, in order of increasing information. If you do use ordinal number while no ordinal exists, you are introducing non-exist information to the machine to learn, which is essentially noise and confused the model. Think what happened if we introduce the phone number (which is nominal data) together with how many hours students worked (which is ratio data) to predict their test score.

I forgot, could you detail why the matrix becomes singular with one hot encoding ? the problems comes if we have n >=3, n being the number of different values in a categorical feature, right ? as with n = 2, we have 0 in a colum and 1 in the other, but with 3, 2 columns can have the same 0 value and I suppose the problems begin here ?

Yet I still have difficulties to understand why matrix becomes singular. For example in the example you give with the colors, the matrix is not singular, although you did not use a dummy encoding. The only way I see it could be singular is if we have a colum full of zeros but it would mean the value never appears so it’s kind odd…

More broadly, in linear regression, we can have 2 identical examples (for example 2 persons with the same age and height and the weight target is the same), our matrix would be singular right ?

Could you extend on these points ifyou don’t mind plz, I find them so interesting but I have difficulties to figure it out.

Thank you again

Geoffrey

Mathematically, a matrix is singular if you cannot inverse it. A matrix with a lot of zero (which in case of one-hot encoding) make it very difficult to inverse (what number multiplies zero is one?) and hence very easy become a singular matrix. Linear algebra is a very broad topic but it is very useful in the theory side of machine learning. Hope this short answer can help you move forward.

Hello, I have a target vector with 6 different values, after one-hot encoding I would have 6 target Vectors. I want to train a model with kernel machine and I can only have one target vector each time for training. So I would make 6 times training respective to the 6 target vectors. Now How could I compine the results of this 6 to get the prediction and the accuracy?

Thanks.

In your approach, probably you need to have 6 models, each gives a binary (sigmoidal activation at output, in range of 0 to 1) classification for each target value, then compare which model gives the highest floating point score and use it as the prediction.

Or a more common approach, one model with 6 output, each is a sigmoidal activation, and apply softmax to give a probability to each target. In this case, your output is not 6 target vectors, but one matrix of 6 columns.

First of all, thank you so much. I’ve learned a lot from your website.

I like to ask a question about the concept. Is ordinal data a type of nominal data that has order or both of them are branches of categorical data?

Nominal data can be considered categorical. But ordinal data is not, since the order means something, not just names.

Hi Jason, I’m carrying out binary classification with a range of models: some sci kit learn ones, some transformer models and some deep learning neural nets.

For my sci kit learn and transformer models, they both run fine when using get_dummies. For my Deep Learning model, when it outputs an array shape [3625, ] the model fails in terms of accuracy and loss (with a single output neuron). When get_dummies produces an array [3625,2] the model works well, do you know why this could be?

Many thanks

Sorry, not clear on what you’re describing. Are you saying the model failed because your prediction output and the training output are in different shape?

Would it be wise to apply one hot encoding or dummy encoding before pca or pls?

I have a mixed dataset with 2 categorical variables, with 3 and 2 categories respectively

PCA is part of the exploratory step to try to understand data. In that case, you may want to run one-hot encoding to make one nominal feature into multiple to help PCA.

Hi Adrian,

Thanks for the nice and detailed article! A quick question, do you think we need to normalize the original encoded variable (0, 1, 2, 3) to a range between 0 and 1? the reason I’m asking is that I have continuous variables in large scales that need to be normalized to 0 and 1 before feeding to the model. Should the same normalization be done after ordinal encoding the categories with ordinal relationships?

I would suggest you to learn about the level of measurement: https://en.wikipedia.org/wiki/Level_of_measurement

We do one-hot encoding because the data are nominal, and we do normalization because the data is interval or ratio. If you try it out, doing a wrong preprocessing would probably inject noise to the data and confuse your machine learning model.

My categorial variable does have the ordinal relationship (e.g. low, medium and high). After ordinal encoding, it is converted to 0, 1 and 2), my question is, should we further normalize it to a range between 0 and 1 (e.g., 0, 0.5, 1)

I don’t think 0 to 2 and 0 to 1 are vastly different. So in this case, it should be optional.

Hi Jason, Thank you so much for your helpful and informative series of posts.

I have a question about your statement above regarding when to implement encoding, that says:

“The best practice when encoding variables is to fit the encoding on the training dataset, then apply it to the train and test datasets.”

Why would this be true? It seems that some feel that there is risk of “data leakage” if the full dataset is encoded prior to splitting into train and test. But encoding is different from imputation. Imputation uses statistics from the dataset, so there is a risk of data leakage, for sure. But encoding is simply an alternate expression of the space of possible values of feature. It doesn’t care about the actual values found in train and test, it only cares about the space of possible values, right? Therefore I think that likely is no risk of data leakage in encoding as it does not care about the specific values found in the train and test sets. It only cares about the space of possible values.

This distinction is important, because if one tries, for example, to fit OrdinalEncode on train data only, which happens to be missing one of the possible values and this missing value appears in the test data we get a exception. But this can be avoided by encoding the full dataset prior to splitting.

Perhaps I’m not thinking about this clearly? I’d welcome your thoughts and perspective.

Thanks, mark

Hi Mark…Kindly narrow your query to a single question so that we better assist you.

Hello.

If a do a linear regression model I can include the year as a categorical variable (instead of numerical) in order to estimate a different coefficient for each year. This is commonly done when the variable is not linear or if you want to detect something strange on a specific year…

When you do it with common regression packages with R they automatically take one of the levels (classes) as reference and calculate the coefficients respect this base level. It can also be done taking as reference the grand mean of all levels (I think this is included in packages called contrasts but it’s related with the way we codify dummy variables).

My question is…

What if I want to calculate the model referring each year’s coefficient to the previous one instead of a common one? How would yo codify it?

Hi Amparo…Our content is based upon Python. The following resource provides an implementation of regression from scratch.

https://machinelearningmastery.com/implement-linear-regression-stochastic-gradient-descent-scratch-python/

What is this and what does it do — what and why am I separating into input and output columns? It just appears in the problem with no explanation. Can you please clarify?

# separate into input and output columns

X = data[:, :-1].astype(str)

y = data[:, -1].astype(str)

Hi Jason…You may find the following helpful:

https://machinelearningmastery.com/start-here/#dataprep