Ordinal and One-Hot Encodings for Categorical Data

By Jason Brownlee on August 17, 2020 in Data Preparation 82

Machine learning models require all input and output variables to be numeric.

This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.

The two most popular techniques are an Ordinal Encoding and a One-Hot Encoding.

In this tutorial, you will discover how to use encoding schemes for categorical machine learning data.

After completing this tutorial, you will know:

Encoding is a required pre-processing step when working with categorical data for machine learning algorithms.
How to use ordinal encoding for categorical variables that have a natural rank ordering.
How to use one-hot encoding for categorical variables that do not have a natural rank ordering.

Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Ordinal and One-Hot Encoding Transforms for Machine Learning
Photo by Felipe Valduga, some rights reserved.

Tutorial Overview

This tutorial is divided into six parts; they are:

Nominal and Ordinal Variables
Encoding Categorical Data
1. Ordinal Encoding
2. One-Hot Encoding
3. Dummy Variable Encoding
Breast Cancer Dataset
OrdinalEncoder Transform
OneHotEncoder Transform
Common Questions

Nominal and Ordinal Variables

Numerical data, as its name suggests, involves features that are only composed of numbers, such as integers or floating-point values.

Categorical data are variables that contain label values rather than numeric values.

The number of possible values is often limited to a fixed set.

Categorical variables are often called nominal.

Some examples include:

A “pet” variable with the values: “dog” and “cat“.
A “color” variable with the values: “red“, “green“, and “blue“.
A “place” variable with the values: “first“, “second“, and “third“.

Each value represents a different category.

Some categories may have a natural relationship to each other, such as a natural ordering.

The “place” variable above does have a natural ordering of values. This type of categorical variable is called an ordinal variable because the values can be ordered or ranked.

A numerical variable can be converted to an ordinal variable by dividing the range of the numerical variable into bins and assigning values to each bin. For example, a numerical variable between 1 and 10 can be divided into an ordinal variable with 5 labels with an ordinal relationship: 1-2, 3-4, 5-6, 7-8, 9-10. This is called discretization.

Nominal Variable (Categorical). Variable comprises a finite set of discrete values with no relationship between values.
Ordinal Variable. Variable comprises a finite set of discrete values with a ranked ordering between values.

Some algorithms can work with categorical data directly.

For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation).

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

Some implementations of machine learning algorithms require all data to be numerical. For example, scikit-learn has this requirement.

This means that categorical data must be converted to a numerical form. If the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Encoding Categorical Data

There are three common approaches for converting ordinal and categorical variables to numerical values. They are:

Ordinal Encoding
One-Hot Encoding
Dummy Variable Encoding

Let’s take a closer look at each in turn.

Ordinal Encoding

In ordinal encoding, each unique category value is assigned an integer value.

For example, “red” is 1, “green” is 2, and “blue” is 3.

This is called an ordinal encoding or an integer encoding and is easily reversible. Often, integer values starting at zero are used.

For some variables, an ordinal encoding may be enough. The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.

It is a natural encoding for ordinal variables. For categorical variables, it imposes an ordinal relationship where no such relationship may exist. This can cause problems and a one-hot encoding may be used instead.

This ordinal encoding transform is available in the scikit-learn Python machine learning library via the OrdinalEncoder class.

By default, it will assign integers to labels in the order that is observed in the data. If a specific order is desired, it can be specified via the “categories” argument as a list with the rank order of all expected labels.

We can demonstrate the usage of this class by converting colors categories “red”, “green” and “blue” into integers. First the categories are sorted then numbers are applied. For strings, this means the labels are sorted alphabetically and that blue=0, green=1 and red=2.

The complete example is listed below.

# example of a ordinal encoding
from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = encoder.fit_transform(data)
print(result)

# example of a ordinal encoding

from numpy import asarray

from sklearn.preprocessing import OrdinalEncoder

# define data

data = asarray([['red'], ['green'], ['blue']])

print(data)

# define ordinal encoding

encoder = OrdinalEncoder()

# transform data

result = encoder.fit_transform(data)

print(result)

Running the example first reports the 3 rows of label data, then the ordinal encoding.

We can see that the numbers are assigned to the labels as we expected.

[['red']
 ['green']
 ['blue']]
[[2.]
 [1.]
 [0.]]

[['red']

['green']

['blue']]

[[2.]

[1.]

[0.]]

This OrdinalEncoder class is intended for input variables that are organized into rows and columns, e.g. a matrix.

If a categorical target variable needs to be encoded for a classification predictive modeling problem, then the LabelEncoder class can be used. It does the same thing as the OrdinalEncoder, although it expects a one-dimensional input for the single target variable.

One-Hot Encoding

For categorical variables where no ordinal relationship exists, the integer encoding may not be enough, at best, or misleading to the model at worst.

Forcing an ordinal relationship via an ordinal encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the ordinal representation. This is where the integer encoded variable is removed and one new binary variable is added for each unique integer value in the variable.

Each bit represents a possible category. If the variable cannot belong to multiple categories at once, then only one bit in the group can be “on.” This is called one-hot encoding …

— Page 78, Feature Engineering for Machine Learning, 2018.

In the “color” variable example, there are three categories, and, therefore, three binary variables are needed. A “1” value is placed in the binary variable for the color and “0” values for the other colors.

This one-hot encoding transform is available in the scikit-learn Python machine learning library via the OneHotEncoder class.

We can demonstrate the usage of the OneHotEncoder on the color categories. First the categories are sorted, in this case alphabetically because they are strings, then binary variables are created for each category in turn. This means blue will be represented as [1, 0, 0] with a “1” in for the first binary variable, then green, then finally red.

The complete example is listed below.

# example of a one hot encoding
from numpy import asarray
from sklearn.preprocessing import OneHotEncoder
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define one hot encoding
encoder = OneHotEncoder(sparse=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

# example of a one hot encoding

from numpy import asarray

from sklearn.preprocessing import OneHotEncoder

# define data

data = asarray([['red'], ['green'], ['blue']])

print(data)

# define one hot encoding

encoder = OneHotEncoder(sparse=False)

# transform data

onehot = encoder.fit_transform(data)

print(onehot)

Running the example first lists the three rows of label data, then the one hot encoding matching our expectation of 3 binary variables in the order “blue”, “green” and “red”.

[['red']
 ['green']
 ['blue']]
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]

[['red']

['green']

['blue']]

[[0. 0. 1.]

[0. 1. 0.]

[1. 0. 0.]]

If you know all of the labels to be expected in the data, they can be specified via the “categories” argument as a list.

The encoder is fit on the training dataset, which likely contains at least one example of all expected labels for each categorical variable if you do not specify the list of labels. If new data contains categories not seen in the training dataset, the “handle_unknown” argument can be set to “ignore” to not raise an error, which will result in a zero value for each label.

Dummy Variable Encoding

The one-hot encoding creates one binary variable for each category.

The problem is that this representation includes redundancy. For example, if we know that [1, 0, 0] represents “blue” and [0, 1, 0] represents “green” we don’t need another binary variable to represent “red“, instead we could use 0 values for both “blue” and “green” alone, e.g. [0, 0].

This is called a dummy variable encoding, and always represents C categories with C-1 binary variables.

When there are C possible values of the predictor and only C – 1 dummy variables are used, the matrix inverse can be computed and the contrast method is said to be a full rank parameterization

— Page 95, Feature Engineering and Selection, 2019.

In addition to being slightly less redundant, a dummy variable representation is required for some models.

For example, in the case of a linear regression model (and other regression models that have a bias term), a one hot encoding will case the matrix of input data to become singular, meaning it cannot be inverted and the linear regression coefficients cannot be calculated using linear algebra. For these types of models a dummy variable encoding must be used instead.

If the model includes an intercept and contains dummy variables […], then the […] columns would add up (row-wise) to the intercept and this linear combination would prevent the matrix inverse from being computed (as it is singular).

— Page 95, Feature Engineering and Selection, 2019.

We rarely encounter this problem in practice when evaluating machine learning algorithms, unless we are using linear regression of course.

… there are occasions when a complete set of dummy variables is useful. For example, the splits in a tree-based model are more interpretable when the dummy variables encode all the information for that predictor. We recommend using the full set if dummy variables when working with tree-based models.

— Page 56, Applied Predictive Modeling, 2013.

We can use the OneHotEncoder class to implement a dummy encoding as well as a one hot encoding.

The “drop” argument can be set to indicate which category will be come the one that is assigned all zero values, called the “baseline“. We can set this to “first” so that the first category is used. When the labels are sorted alphabetically, the first “blue” label will be the first and will become the baseline.

There will always be one fewer dummy variable than the number of levels. The level with no dummy variable […] is known as the baseline.

— Page 86, An Introduction to Statistical Learning with Applications in R, 2014.

We can demonstrate this with our color categories. The complete example is listed below.

# example of a dummy variable encoding
from numpy import asarray
from sklearn.preprocessing import OneHotEncoder
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define one hot encoding
encoder = OneHotEncoder(drop='first', sparse=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

# example of a dummy variable encoding

from numpy import asarray

from sklearn.preprocessing import OneHotEncoder

# define data

data = asarray([['red'], ['green'], ['blue']])

print(data)

# define one hot encoding

encoder = OneHotEncoder(drop='first', sparse=False)

# transform data

onehot = encoder.fit_transform(data)

print(onehot)

Running the example first lists the three rows for the categorical variable, then the dummy variable encoding, showing that green is “encoded” as [1, 0], “red” is encoded as [0, 1] and “blue” is encoded as [0, 0] as we specified.

[['red']
 ['green']
 ['blue']]
[[0. 1.]
 [1. 0.]
 [0. 0.]]

[['red']

['green']

['blue']]

[[0. 1.]

[1. 0.]

[0. 0.]]

Now that we are familiar with the three approaches for encoding categorical variables, let’s look at a dataset that has categorical variables.

Breast Cancer Dataset

As the basis of this tutorial, we will use the “Breast Cancer” dataset that has been widely studied in machine learning since the 1980s.

The dataset classifies breast cancer patient data as either a recurrence or no recurrence of cancer. There are 286 examples and nine input variables. It is a binary classification problem.

A reasonable classification accuracy score on this dataset is between 68 percent and 73 percent. We will aim for this region, but note that the models in this tutorial are not optimized: they are designed to demonstrate encoding schemes.

No need to download the dataset as we will access it directly from the code examples.

Looking at the data, we can see that all nine input variables are categorical.

Specifically, all variables are quoted strings. Some variables show an obvious ordinal relationship for ranges of values (like age ranges), and some do not.

'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'
'50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'
'50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events'
'40-49','premeno','35-39','0-2','yes','3','right','left_low','yes','no-recurrence-events'
'40-49','premeno','30-34','3-5','yes','2','left','right_up','no','recurrence-events'
...

'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'

'50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'

'50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events'

'40-49','premeno','35-39','0-2','yes','3','right','left_low','yes','no-recurrence-events'

'40-49','premeno','30-34','3-5','yes','2','left','right_up','no','recurrence-events'

...

Note that this dataset has missing values marked with a “nan” value.

We will leave these values as-is in this tutorial and use the encoding schemes to encode “nan” as just another value. This is one possible and quite reasonable approach to handling missing values for categorical variables.

We can load this dataset into memory using the Pandas library.

...
# load the dataset
dataset = read_csv(url, header=None)
# retrieve the array of data
data = dataset.values

...

# load the dataset

dataset = read_csv(url, header=None)

# retrieve the array of data

data = dataset.values

Once loaded, we can split the columns into input (X) and output (y) for modeling.

...
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)

...

# separate into input and output columns

X = data[:, :-1].astype(str)

y = data[:, -1].astype(str)

Making use of this function, the complete example of loading and summarizing the raw categorical dataset is listed below.

# load and summarize the dataset
from pandas import read_csv
# define the location of the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"
# load the dataset
dataset = read_csv(url, header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# summarize
print('Input', X.shape)
print('Output', y.shape)

# load and summarize the dataset

from pandas import read_csv

# define the location of the dataset

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"

# load the dataset

dataset = read_csv(url, header=None)

# retrieve the array of data

data = dataset.values

# separate into input and output columns

X = data[:, :-1].astype(str)

y = data[:, -1].astype(str)

# summarize

print('Input', X.shape)

print('Output', y.shape)

Running the example reports the size of the input and output elements of the dataset.

We can see that we have 286 examples and nine input variables.

Input (286, 9)
Output (286,)

1 2	Input (286, 9) Output (286,)

Now that we are familiar with the dataset, let’s look at how we can encode it for modeling.

OrdinalEncoder Transform

An ordinal encoding involves mapping each unique label to an integer value.

This type of encoding is really only appropriate if there is a known relationship between the categories. This relationship does exist for some of the variables in our dataset, and ideally, this should be harnessed when preparing the data.

In this case, we will ignore any possible existing ordinal relationship and assume all variables are categorical. It can still be helpful to use an ordinal encoding, at least as a point of reference with other encoding schemes.

We can use the OrdinalEncoder from scikit-learn to encode each variable to integers. This is a flexible class and does allow the order of the categories to be specified as arguments if any such order is known.

Note: I will leave it as an exercise for you to update the example below to try specifying the order for those variables that have a natural ordering and see if it has an impact on model performance.

Once defined, we can call the fit_transform() function and pass it to our dataset to create a quantile transformed version of our dataset.

...
# ordinal encode input variables
ordinal = OrdinalEncoder()
X = ordinal.fit_transform(X)

...

# ordinal encode input variables

ordinal = OrdinalEncoder()

X = ordinal.fit_transform(X)

We can also prepare the target in the same manner.

...
# ordinal encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

...

# ordinal encode target variable

label_encoder = LabelEncoder()

y = label_encoder.fit_transform(y)

Let’s try it on our breast cancer dataset.

The complete example of creating an ordinal encoding transform of the breast cancer dataset and summarizing the result is listed below.

# ordinal encode the breast cancer dataset
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
# define the location of the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"
# load the dataset
dataset = read_csv(url, header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
X = ordinal_encoder.fit_transform(X)
# ordinal encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
# summarize the transformed data
print('Input', X.shape)
print(X[:5, :])
print('Output', y.shape)
print(y[:5])

# ordinal encode the breast cancer dataset

from pandas import read_csv

from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import OrdinalEncoder

# define the location of the dataset

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"

# load the dataset

dataset = read_csv(url, header=None)

# retrieve the array of data

data = dataset.values

# separate into input and output columns

X = data[:, :-1].astype(str)

y = data[:, -1].astype(str)

# ordinal encode input variables

ordinal_encoder = OrdinalEncoder()

X = ordinal_encoder.fit_transform(X)

# ordinal encode target variable

label_encoder = LabelEncoder()

y = label_encoder.fit_transform(y)

# summarize the transformed data

print('Input', X.shape)

print(X[:5, :])

print('Output', y.shape)

print(y[:5])

Running the example transforms the dataset and reports the shape of the resulting dataset.

We would expect the number of rows, and in this case, the number of columns, to be unchanged, except all string values are now integer values.

As expected, in this case, we can see that the number of variables is unchanged, but all values are now ordinal encoded integers.

Input (286, 9)
[[2. 2. 2. 0. 1. 2. 1. 2. 0.]
 [3. 0. 2. 0. 0. 0. 1. 0. 0.]
 [3. 0. 6. 0. 0. 1. 0. 1. 0.]
 [2. 2. 6. 0. 1. 2. 1. 1. 1.]
 [2. 2. 5. 4. 1. 1. 0. 4. 0.]]
Output (286,)
[1 0 1 0 1]

Input (286, 9)

[[2. 2. 2. 0. 1. 2. 1. 2. 0.]

[3. 0. 2. 0. 0. 0. 1. 0. 0.]

[3. 0. 6. 0. 0. 1. 0. 1. 0.]

[2. 2. 6. 0. 1. 2. 1. 1. 1.]

[2. 2. 5. 4. 1. 1. 0. 4. 0.]]

Output (286,)

[1 0 1 0 1]

Next, let’s evaluate machine learning on this dataset with this encoding.

The best practice when encoding variables is to fit the encoding on the training dataset, then apply it to the train and test datasets.

We will first split the dataset, then prepare the encoding on the training set, and apply it to the test set.

...
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

...

# split the dataset into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

We can then fit the OrdinalEncoder on the training dataset and use it to transform the train and test datasets.

...
# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(X_train)
X_train = ordinal_encoder.transform(X_train)
X_test = ordinal_encoder.transform(X_test)

...

# ordinal encode input variables

ordinal_encoder = OrdinalEncoder()

ordinal_encoder.fit(X_train)

X_train = ordinal_encoder.transform(X_train)

X_test = ordinal_encoder.transform(X_test)

The same approach can be used to prepare the target variable. We can then fit a logistic regression algorithm on the training dataset and evaluate it on the test dataset.

The complete example is listed below.

# evaluate logistic regression on the breast cancer dataset with an ordinal encoding
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import accuracy_score
# define the location of the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"
# load the dataset
dataset = read_csv(url, header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(X_train)
X_train = ordinal_encoder.transform(X_train)
X_test = ordinal_encoder.transform(X_test)
# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
# define the model
model = LogisticRegression()
# fit on the training set
model.fit(X_train, y_train)
# predict on test set
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))

# evaluate logistic regression on the breast cancer dataset with an ordinal encoding

from numpy import mean

from numpy import std

from pandas import read_csv

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import OrdinalEncoder

from sklearn.metrics import accuracy_score

# define the location of the dataset

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"

# load the dataset

dataset = read_csv(url, header=None)

# retrieve the array of data

data = dataset.values

# separate into input and output columns

X = data[:, :-1].astype(str)

y = data[:, -1].astype(str)

# split the dataset into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# ordinal encode input variables

ordinal_encoder = OrdinalEncoder()

ordinal_encoder.fit(X_train)

X_train = ordinal_encoder.transform(X_train)

X_test = ordinal_encoder.transform(X_test)

# ordinal encode target variable

label_encoder = LabelEncoder()

label_encoder.fit(y_train)

y_train = label_encoder.transform(y_train)

y_test = label_encoder.transform(y_test)

# define the model

model = LogisticRegression()

# fit on the training set

model.fit(X_train, y_train)

# predict on test set

yhat = model.predict(X_test)

# evaluate predictions

accuracy = accuracy_score(y_test, yhat)

print('Accuracy: %.2f' % (accuracy*100))

Running the example prepares the dataset in the correct manner, then evaluates a model fit on the transformed data.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the model achieved a classification accuracy of about 75.79 percent, which is a reasonable score.

Accuracy: 75.79

1	Accuracy: 75.79

Next, let’s take a closer look at the one-hot encoding.

OneHotEncoder Transform

A one-hot encoding is appropriate for categorical data where no relationship exists between categories.

The scikit-learn library provides the OneHotEncoder class to automatically one hot encode one or more variables.

By default the OneHotEncoder will output data with a sparse representation, which is efficient given that most values are 0 in the encoded representation. We will disable this feature by setting the “sparse” argument to False so that we can review the effect of the encoding.

Once defined, we can call the fit_transform() function and pass it to our dataset to create a quantile transformed version of our dataset.

...
# one hot encode input variables
onehot_encoder = OneHotEncoder(sparse=False)
X = onehot_encoder.fit_transform(X)

...

# one hot encode input variables

onehot_encoder = OneHotEncoder(sparse=False)

X = onehot_encoder.fit_transform(X)

As before, we must label encode the target variable.

The complete example of creating a one-hot encoding transform of the breast cancer dataset and summarizing the result is listed below.

# one-hot encode the breast cancer dataset
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# define the location of the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"
# load the dataset
dataset = read_csv(url, header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# one hot encode input variables
onehot_encoder = OneHotEncoder(sparse=False)
X = onehot_encoder.fit_transform(X)
# ordinal encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
# summarize the transformed data
print('Input', X.shape)
print(X[:5, :])

# one-hot encode the breast cancer dataset

from pandas import read_csv

from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import OneHotEncoder

# define the location of the dataset

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"

# load the dataset

dataset = read_csv(url, header=None)

# retrieve the array of data

data = dataset.values

# separate into input and output columns

X = data[:, :-1].astype(str)

y = data[:, -1].astype(str)

# one hot encode input variables

onehot_encoder = OneHotEncoder(sparse=False)

X = onehot_encoder.fit_transform(X)

# ordinal encode target variable

label_encoder = LabelEncoder()

y = label_encoder.fit_transform(y)

# summarize the transformed data

print('Input', X.shape)

print(X[:5, :])

Running the example transforms the dataset and reports the shape of the resulting dataset.

We would expect the number of rows to remain the same, but the number of columns to dramatically increase.

As expected, in this case, we can see that the number of variables has leaped up from 9 to 43 and all values are now binary values 0 or 1.

Input (286, 43)
[[0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0.]]

Input (286, 43)

[[0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.

0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0.]

[0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.

0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0.]

[0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.

0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0.]

[0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.

0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1.]

[0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.

1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0.]]

Next, let’s evaluate machine learning on this dataset with this encoding as we did in the previous section.

The encoding is fit on the training set then applied to both train and test sets as before.

...
# one-hot encode input variables
onehot_encoder = OneHotEncoder()
onehot_encoder.fit(X_train)
X_train = onehot_encoder.transform(X_train)
X_test = onehot_encoder.transform(X_test)

...

# one-hot encode input variables

onehot_encoder = OneHotEncoder()

onehot_encoder.fit(X_train)

X_train = onehot_encoder.transform(X_train)

X_test = onehot_encoder.transform(X_test)

Tying this together, the complete example is listed below.

# evaluate logistic regression on the breast cancer dataset with an one-hot encoding
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score
# define the location of the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"
# load the dataset
dataset = read_csv(url, header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# one-hot encode input variables
onehot_encoder = OneHotEncoder()
onehot_encoder.fit(X_train)
X_train = onehot_encoder.transform(X_train)
X_test = onehot_encoder.transform(X_test)
# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
# define the model
model = LogisticRegression()
# fit on the training set
model.fit(X_train, y_train)
# predict on test set
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))

# evaluate logistic regression on the breast cancer dataset with an one-hot encoding

from numpy import mean

from numpy import std

from pandas import read_csv

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import OneHotEncoder

from sklearn.metrics import accuracy_score

# define the location of the dataset

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"

# load the dataset

dataset = read_csv(url, header=None)

# retrieve the array of data

data = dataset.values

# separate into input and output columns

X = data[:, :-1].astype(str)

y = data[:, -1].astype(str)

# split the dataset into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# one-hot encode input variables

onehot_encoder = OneHotEncoder()

onehot_encoder.fit(X_train)

X_train = onehot_encoder.transform(X_train)

X_test = onehot_encoder.transform(X_test)

# ordinal encode target variable

label_encoder = LabelEncoder()

label_encoder.fit(y_train)

y_train = label_encoder.transform(y_train)

y_test = label_encoder.transform(y_test)

# define the model

model = LogisticRegression()

# fit on the training set

model.fit(X_train, y_train)

# predict on test set

yhat = model.predict(X_test)

# evaluate predictions

accuracy = accuracy_score(y_test, yhat)

print('Accuracy: %.2f' % (accuracy*100))

Running the example prepares the dataset in the correct manner, then evaluates a model fit on the transformed data.

In this case, the model achieved a classification accuracy of about 70.53 percent, which is slightly worse than the ordinal encoding in the previous section.

Accuracy: 70.53

1	Accuracy: 70.53

Common Questions

This section lists some common questions and answers when encoding categorical data.

Q. What if I have a mixture of numeric and categorical data?

Or, what if I have a mixture of categorical and ordinal data?

You will need to prepare or encode each variable (column) in your dataset separately, then concatenate all of the prepared variables back together into a single array for fitting or evaluating the model.

Alternately, you can use the ColumnTransformer to conditionally apply different data transforms to different input variables.

Q. What if I have hundreds of categories?

Or, what if I concatenate many one-hot encoded vectors to create a many-thousand-element input vector?

You can use a one-hot encoding up to thousands and tens of thousands of categories. Also, having large vectors as input sounds intimidating, but the models can generally handle it.

Q. What encoding technique is the best?

This is unknowable.

Test each technique (and more) on your dataset with your chosen model and discover what works best for your case.

Summary

In this tutorial, you discovered how to use encoding schemes for categorical machine learning data.

Specifically, you learned:

Encoding is a required pre-processing step when working with categorical data for machine learning algorithms.
How to use ordinal encoding for categorical variables that have a natural rank ordering.
How to use one-hot encoding for categorical variables that do not have a natural rank ordering.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

82 Responses to Ordinal and One-Hot Encodings for Categorical Data

Yuhou June 12, 2020 at 5:26 pm #

“This type of categorical variable is called an nominal variable because the values can be ordered or ranked.” Should here be ordinal variable instead of nominal variable

Reply
- Jason Brownlee June 13, 2020 at 5:52 am #
  
  Thanks, fixed!
  
  Reply
Jérôme Plumecoq June 12, 2020 at 5:55 pm #

Hi,

« The “place” variable above does have a natural ordering of values. This type of categorical variable is called an nominal variable because the values can be ordered or ranked. ».

You mean « ordinal variable » not « nominal variable » ?

Reply
- Jason Brownlee June 13, 2020 at 5:53 am #
  
  Yes, fixed. Thanks for catching this.
  
  Reply
Andy June 14, 2020 at 6:35 pm #

An interesting discussion on “Why OneHotEncoder not get_dummies?” in sklearn can be found here: https://stackoverflow.com/questions/36631163/what-are-the-pros-and-cons-between-get-dummies-pandas-and-onehotencoder-sciki

Reply
- Jason Brownlee June 15, 2020 at 6:02 am #
  
  Thanks for sharing.
  
  Reply
Nilesh Saratkar June 18, 2020 at 3:15 pm #

In case of the OrdinalEncoding, I have observed shift in the encoded value after introducing new value E.g. data = asarray([[‘orange’], [‘red’], [‘green’], [‘blue’]]). In real life scenario, additional values should not shuffle existing encoded values between training and test data sets. Can we use some hashing technique to generate consistent encoded values?

Reply
- Jason Brownlee June 19, 2020 at 6:09 am #
  
  You can specify the fill list of expected values up front and save the object for use on all future data.
  
  Or you can write some custom code to handle the mapping.
  
  Reply
Nat June 20, 2020 at 7:25 am #

Hi,
When I try to do the dummy encoding with argument drop=’first’ I get this error: __init__() got an unexpected keyword argument ‘drop’.

I checked the syntax documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html?highlight=onehotencoder#sklearn.preprocessing.OneHotEncoder

The code seems to be right, but I still get the error. Can you please help?

Thanks so much for your tutorials – they are great!

Reply
- Jason Brownlee June 21, 2020 at 5:57 am #
  
  Perhaps confirm that your version of scikit-learn is up to date?
  
  Reply
  - Nat June 21, 2020 at 8:28 am #
    
    Ugh. SO sorry. I thought I was on the latest version. I updated and the dummy encoding code works now. Thanks for taking the time to get back to me!
    
    Reply
    - Jason Brownlee June 22, 2020 at 6:08 am #
      
      Well done!
      
      No problem at all.
      
      Reply
Saheed Yakub June 21, 2020 at 1:09 am #

Please, i am trying to fit the OrdinalEncoder on the training dataset and use it to transform the train and test datasets as follows;
ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(X_train)
X_train = ordinal_encoder.transform(X_train)
X_test = ordinal_encoder.transform(X_test)

I have this error
————————————————————————–
ValueError Traceback (most recent call last)
in ()
2 ordinal_encoder.fit(x_train)
3 x_train=ordinal_encoder.transform(x_train)
—-> 4 x_test = ordinal_encoder.transform (x_test)

1 frames
/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_encoders.py in _transform(self, X, handle_unknown)
122 msg = (“Found unknown categories {0} in column {1}”
123 ” during transform”.format(diff, i))
–> 124 raise ValueError(msg)
125 else:
126 # Set the problematic rows to an acceptable value and

ValueError: Found unknown categories [69.0, 70.0, 71.0, 72.0] in column 0 during transform

What can be the way out? I appreciate your time.

Reply
- Jason Brownlee June 21, 2020 at 6:27 am #
  
  Sorry to hear that you are having trouble, this may help:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
- Saket Agrawal March 21, 2021 at 1:55 am #
  
  One reason for this problem is a category value exists in the test data set which is not available in the training set. Hence encoder has not seen this value during the fit and hence doesn’t know how to encode it. Few different options to handle such scenario
  
  1. Make use of the handle_unknown parameter, refer OrdinalEncoder documentation.
  2. Make use of categories parameter, refer OrdinalEncoder documentation. I personally prefer this method as this gives complete control and also allows consistency between train and test.
  
  However in your case error message is ValueError: Found unknown categories [69.0, 70.0, 71.0, 72.0] in column 0 during transform
  
  this gives me a hit that the first column in your data set is a sequence number, check if this should really be part of the feature set? If this is a simple sequence number there is absolutely no way you can consider them as categorical data.
  
  Reply
  - Jason Brownlee March 21, 2021 at 6:10 am #
    
    Typically unseen categories are labeled as all zeros.
    
    Reply
Prerna July 7, 2020 at 7:56 am #

Hi!
One of the categorical variables I am dealing with is, employment length, that has 511193 distinct categories. I am getting a memory error ‘Unable to allocate 243. GiB for an array with shape (511193, 511193)’. How to deal with this. I have come across articles which say that one hot encoding can’t process high cardinality and would give misleading results. Kindly help! Thanks

Reply
- Jason Brownlee July 7, 2020 at 1:57 pm #
  
  Perhaps try an ordinal encoding.
  
  Also, perhaps try comparing results to an embedding and a hash.
  
  More suggestions here:
  https://machinelearningmastery.com/faq/single-faq/how-do-i-handle-a-large-number-of-categories
  
  Reply
Ale July 9, 2020 at 6:45 am #

Hi Jason,

So since the one-hot encoding may cause problems with linear regression. Can I process the data as dummy variable and this won’t affect whether I use it for tree-based methods or linear models?

Reply
- Jason Brownlee July 9, 2020 at 1:18 pm #
  
  Yes I believe so. Test to confirm if you’re concerned.
  
  Reply
Pelumi Soyombo September 22, 2020 at 2:04 am #

Hi Jason.

I’m a beginner at using machine learning for predictions.

I already followed your code on using OrdinalEncoder for the brest-cancer dataset. How do I input new data for predictions (apart from the test dataset) after encoding with OrdinalEncoder ?

Reply
- Jason Brownlee September 22, 2020 at 6:53 am #
  
  This can help you make a prediction with new data for your model/pipeline:
  https://machinelearningmastery.com/make-predictions-scikit-learn/
  
  Reply
Usman Aileru September 22, 2020 at 2:40 am #

Hi Jason.

I’m a beginner at using machine learning for predictions.

I already followed your code on using OrdinalEncoder for the brest-cancer dataset. How do I get predictions for new data unknown to the dataset (i.e data different from the test dataset) after encoding with OrdinalEncoder?

Reply
- Jason Brownlee September 22, 2020 at 6:53 am #
  
  Good question, see this:
  https://machinelearningmastery.com/make-predictions-scikit-learn/
  
  Reply
  - Usman Aileru September 22, 2020 at 8:07 pm #
    
    Thanks for your reply.
    
    I figured this out but my question was how do I make predictions for new categorical dataset. Examples given in the link you provided were on numbers already.
    
    But say I am working on the breast-cancer.csv dataset, how do I predict the outcome for new categorical data unknown to the dataset?
    
    Reply
    - Jason Brownlee September 23, 2020 at 6:36 am #
      
      The approach for making predictions is the same regardless of the data types. E.g. call model.predict()
      
      If you are using encodings, then wrap them up into a pipeline and call pipeline.predict() and pass the raw data as input.
      
      Reply
      - Usman Aileru September 23, 2020 at 9:45 pm #
        
        Many thanks. I figured it out.
      - Jason Brownlee September 24, 2020 at 6:14 am #
        
        Well done.
MFT October 10, 2020 at 12:40 pm #

What if you need a default value to cover a categorical level that is unforeseen? Dummy encoding is very good for that, but I think some discussion of the matter could be included.

Reply
- Jason Brownlee October 10, 2020 at 1:55 pm #
  
  You can set the handle_unknown argument and specify how to handle labels unseen during training.
  
  Reply
Seema Patel October 26, 2020 at 7:50 pm #

”’In this case, we will ignore any possible existing ordinal relationship and assume all variables are categorical. It can still be helpful to use an ordinal encoding, ‘at least as a point of reference with other encoding schemes.’ ”’ Can you elaborate this sentence?

Reply
- Jason Brownlee October 27, 2020 at 6:43 am #
  
  Yes, you can choose to specify the ordinal relationship between the labels in some of the variables if you wish. This would likely improve the performance of the model fit on an ordinal encoding.
  
  Reply
Wiktor December 18, 2020 at 6:14 pm #

Running the code of “evaluate logistic regression on the breast cancer dataset with an ordinal encoding” a few time gives the same accuracy, because of random_state=1 during splitting to train and test sets. After setting random_state=None you can get an error like:
ValueError: Found unknown categories [“’24-26′”] in column 3 during transform
I suppose that we should make ordinal encoding of input variables before splitting. Modified code:

for i in range(10):
# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(X)
X = ordinal_encoder.transform(X)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=None)
# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
# define the model
model = LogisticRegression()
# fit on the training set
model.fit(X_train, y_train)
# predict on test set
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print(f'{i + 1}. Accuracy: {accuracy*100:.2f} %’)

works fine, giving various results in subsequent runs, for example:

1. Accuracy: 64.21 %
2. Accuracy: 77.89 %
3. Accuracy: 75.79 %
4. Accuracy: 77.89 %
5. Accuracy: 76.84 %
6. Accuracy: 68.42 %
7. Accuracy: 75.79 %
8. Accuracy: 73.68 %
9. Accuracy: 71.58 %
10. Accuracy: 80.00 %

Reply
- Jason Brownlee December 19, 2020 at 6:15 am #
  
  Thanks for sharing.
  
  Reply
Pandas December 19, 2020 at 6:01 am #

You transform target with Ordinal encoder and then fit the logistic regression model.
Is this process called ordinal logistic regression or is it something different?
Thanks in advance

Reply
- Jason Brownlee December 19, 2020 at 6:23 am #
  
  No, just “logistic regression”.
  
  Reply
amin December 25, 2020 at 1:58 am #

hi can u pls say after using one hot encoding and making our model with logistic regression or other classification algorithm how we can calculate the categorical data coefficient? coz after using one hot encoding a categorical variable will change to for example 3 columns and variable we should calculate coefficient of each columns that depend of the categorical data seperately and then add to each other or no how is it? i have different type of variables and my goal is show that which variable has most effect on my target variable(output) after build the classification model

Reply
- Jason Brownlee December 25, 2020 at 5:25 am #
  
  What is the “categorical data coefficient”?
  
  Reply
SULAIMAN KHAN February 24, 2021 at 8:51 pm #

alueError Traceback (most recent call last)
in ()
59 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
60 # prepare input data
—> 61 X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)
62 # prepare output data
63 y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

in prepare_inputs(X_train, X_test)
38 # encode
39 train_enc = le.transform(X_train[:, i])
—> 40 test_enc = le.transform(X_test[:, i])
41 # store
42 X_train_enc.append(train_enc)

~\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py in transform(self, y)
131 if len(np.intersect1d(classes, self.classes_)) 133 raise ValueError(“y contains new labels: %s” % str(diff))
134 return np.searchsorted(self.classes_, y)
135

ValueError: y contains new labels: [‘102’ ‘120’ ‘137’ ‘145’ ‘147’ ’15’ ‘159’ ‘174’ ‘195’ ‘208’ ‘214’ ‘222’
‘223’ ’23’ ’24’ ‘247’ ‘248’ ‘259’ ‘264’ ‘286’ ‘289’ ‘290’ ‘291’ ‘298’
‘301’ ‘325’ ‘328’ ‘331’ ’34’ ‘341’ ‘354’ ‘361’ ‘367’ ‘378’ ‘401’ ‘402’
‘410’ ‘426’ ‘441’ ‘446’ ‘452’ ’46’ ‘468’ ‘485’ ‘489’ ‘5’ ‘510’ ‘512’
‘513’ ‘520’ ‘533’ ’54’ ‘550’ ‘579’ ‘586’ ‘587’ ’59’ ‘603’ ‘605’ ’73’ ’79’
’89’ ’90’ ’99’]
###################
I have multiclassification labe

Reply
- Jason Brownlee February 25, 2021 at 5:31 am #
  
  Sorry to hear that you’re having trouble, these tips may help:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
Tanzila March 2, 2021 at 8:11 pm #

How to handle missing values in One Hot Encoder?

I am trying to apply one hot encoding to 2D numpy array containing missing value ‘N’. I have the following code, which works well without missing value.

import numpy as np
X = [[‘A’, ‘G’], [‘N’, ‘C’], [‘T’, ‘A’]]
X = np.array(X)
print(‘Shape of data before one hot encoding: ‘, X.shape)
classes = np.array([‘A’, ‘C’, ‘G’, ‘T’])
X = np.searchsorted(classes, X)
eye = np.eye(classes.shape[0])
onehotlabels = np.concatenate([eye[i] for i in X.T], axis=1)
print(‘Shape of data after pre-processing: ‘, onehotlabels.shape)
print(onehotlabels)

Output:
Shape of data before one hot encoding: (3, 2)
Shape of data after pre-processing: (3, 8)
[[1. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 1. 0. 1. 0. 0.]
[0. 0. 0. 1. 1. 0. 0. 0.]]

Is it possible to replace ‘N’ by 0 0 0 0 by using this approach? The output will look like the following:
[[1. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 1. 1. 0. 0. 0.]]

Reply
- Jason Brownlee March 3, 2021 at 5:33 am #
  
  You can set “handle_unknown” to “ignore”.
  https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
  
  Reply
Ruhul March 16, 2021 at 12:44 pm #

Hi Jason,

Thanks for wonderful article.

I am facing one doubt and not getting answer anywhere.

Suppose my target variable is Category Fruit – Apple, Orange, Mango, Banana.

There is no ordinal relationship between them. Can I one hot encode them or use ordinal encoder.

Everywhere (including other websites). I can find that target variable is converted into Apple -1, Orange-2, Mango-3, Banana-4. Sometimes column with labels 1, 2,3,4 is already created.

My query – 1) If there is no relationship exists between Apple, Orange, Mango, Banana. So should we use One Hot encoding /Dummy Variable or Ordinal encoders for converting target variable. Most of the cases I found Ordinal encoder used.

If I use One Hot encoding /Dummy Variable encoding for target variable Apple, Orange, Mango, Banana. Then there will be many columns created. I think this is not appropriate.

Is it the reason for creating single column for target – Ordinal encoders are used.

2) So if we use one One Hot encoding /Dummy Variable creation for target variable with categories , then which algo will be suitable for prediction?

Reply
- Jason Brownlee March 17, 2021 at 5:56 am #
  
  One hot encode sounds appropriate.
  
  Yes, many columns will be created, it is typically not a problem, even with tens of thousands of new columns.
  
  Reply
AGGELOS PAPOUTSIS April 11, 2021 at 12:04 am #

Hi,

as i understand label encoder is for categorical target variable (i.e y). Does it make sense to use label encoder for the independent feutures (i.e X)?

**I JUST BOUGHT YOU BOOK”DATA PREPARATION FOR MACHINE LEANRING”

THIS IS A FANTASTIC BOOK

Reply
- Jason Brownlee April 11, 2021 at 4:52 am #
  
  Yes, you can use the label encoder on inputs, but we now have the “ordinal encoder” that does the same job and is designed to work with inputs.
  
  Thanks!!!
  
  Reply
  - AGGELOS PAPOUTSIS April 11, 2021 at 6:54 pm #
    
    ok, thanks. I use the label encoder as the technique do not require 2d data. When I tried to use the ordinal encoder I had to reshape the data.
    
    Reply
    - Jason Brownlee April 12, 2021 at 5:05 am #
      
      Yes, this is a key difference.
      
      Reply
Tran Quoc Lap April 20, 2021 at 11:43 am #

Great technical writing!!!
I wonder which should be done first: data cleaning or data transform?
I think data cleaning includes subtasks such as outlier detection where values of columns are expected to be numeric.

Reply
- Jason Brownlee April 21, 2021 at 5:51 am #
  
  Thanks.
  
  Often both are iterated as you learn more about the data/problem, but generally cleaning then transforms.
  
  Reply
Lobelie June 26, 2021 at 6:14 am #

Hello,

If I have categorical and nominal variables, can I encode the categorical with one-hot and the nominal ones with label encoding ?

Reply
- Jason Brownlee June 27, 2021 at 4:33 am #
  
  Sure. Use whatever works best for your data.
  
  Reply
Daniel July 26, 2021 at 5:37 am #

Hi,
Quoting your article”
“If new data contains categories not seen in the training dataset, the “handle_unknown” argument can be set to “ignore” to not raise an error, which will result in a zero value for each label.”

I guess that because we fit and transform using certain OneHotEncoder on the training set, there can be a case where the testing set has features that haven’t been OneHotEncoded and that’s what you wrote in your blog.
Does setting the “handle_unknow”=’ignore’ in such a case wouldn’t cause a loss of accuracy in the predictions/ cassifications?
Is there another way to deal with that issue, without causing Data Leakage?

Regards

Reply
- Jason Brownlee July 27, 2021 at 5:02 am #
  
  It may or may not impact model performance depending on the model and the dataset.
  
  A better approach is to have a more complete training dataset.
  
  Reply
Talieh July 30, 2021 at 3:03 am #

Hi Jason,

Your posts are always my first ‘go-to’ for my ML/AI-related questions! Thanks for your contribution to the community.

My question is, in case of tree-based models, is it best practice to use OHE, or Dummy encoding? The tree-based algorithms build a tree based on all available variables and we if drop one of the levels, the tree will never see that variable and will not use it in the splits. I would love to know your thoughts on this.

Thanks.

Reply
- Jason Brownlee July 30, 2021 at 6:31 am #
  
  Perhaps try both and see what works well or best for your specific dataset and model.
  
  Reply
geoffrey laforest August 14, 2021 at 12:48 am #

Very interesting article.

I’m still missing something. I’m actually trying to do logistic regression from scratch on a Fractures Dataset and I have several categorical columns to encode myself.

You say that DummyEncoding avoid the problem (when doing linear regression ok, maybe it is less important with logistic regression so ?) of singular matrix that comes with one hot encoding, and it works with n-1 columns.

But if I want to encode the sex column, only possibility is 1 for male 0 for female (or the contrary). If I apply the logic of DummyEncoding, we go back to the problem of ordinal relations (that does not existe here between the different values of a categorical variable) that are not relevant here and that we tried to avoid using precisely a One Hot Encoding ?

Then, I have a column “medication” with “no medication” “medication #1” “medication #2” values, here I can perfectly proceed as you did with the colors and one hot encoding right ?

Finally, where can I find more information about the exact mathematical reasons that make ordinal encoding bad when no ordinal relations exist and conversly make one hot encoder works good ?

Thanks a lot,

Geoffrey (from France)

Reply
- Adrian Tam August 14, 2021 at 3:36 am #
  
  For why ordinal encoding is bad when no ordinal relations exists: The levels of measurement are nominal, ordinal, interval, and ratio, in order of increasing information. If you do use ordinal number while no ordinal exists, you are introducing non-exist information to the machine to learn, which is essentially noise and confused the model. Think what happened if we introduce the phone number (which is nominal data) together with how many hours students worked (which is ratio data) to predict their test score.
  
  Reply
geoffrey laforest August 14, 2021 at 1:04 am #

I forgot, could you detail why the matrix becomes singular with one hot encoding ? the problems comes if we have n >=3, n being the number of different values in a categorical feature, right ? as with n = 2, we have 0 in a colum and 1 in the other, but with 3, 2 columns can have the same 0 value and I suppose the problems begin here ?
Yet I still have difficulties to understand why matrix becomes singular. For example in the example you give with the colors, the matrix is not singular, although you did not use a dummy encoding. The only way I see it could be singular is if we have a colum full of zeros but it would mean the value never appears so it’s kind odd…

More broadly, in linear regression, we can have 2 identical examples (for example 2 persons with the same age and height and the weight target is the same), our matrix would be singular right ?

Could you extend on these points ifyou don’t mind plz, I find them so interesting but I have difficulties to figure it out.

Thank you again

Geoffrey

Reply
- Adrian Tam August 14, 2021 at 3:38 am #
  
  Mathematically, a matrix is singular if you cannot inverse it. A matrix with a lot of zero (which in case of one-hot encoding) make it very difficult to inverse (what number multiplies zero is one?) and hence very easy become a singular matrix. Linear algebra is a very broad topic but it is very useful in the theory side of machine learning. Hope this short answer can help you move forward.
  
  Reply
Khaled August 19, 2021 at 7:12 am #

Hello, I have a target vector with 6 different values, after one-hot encoding I would have 6 target Vectors. I want to train a model with kernel machine and I can only have one target vector each time for training. So I would make 6 times training respective to the 6 target vectors. Now How could I compine the results of this 6 to get the prediction and the accuracy?

Thanks.

Reply
- Adrian Tam August 20, 2021 at 1:06 am #
  
  In your approach, probably you need to have 6 models, each gives a binary (sigmoidal activation at output, in range of 0 to 1) classification for each target value, then compare which model gives the highest floating point score and use it as the prediction.
  Or a more common approach, one model with 6 output, each is a sigmoidal activation, and apply softmax to give a probability to each target. In this case, your output is not 6 target vectors, but one matrix of 6 columns.
  
  Reply
Hosna September 8, 2021 at 6:15 pm #

First of all, thank you so much. I’ve learned a lot from your website.
I like to ask a question about the concept. Is ordinal data a type of nominal data that has order or both of them are branches of categorical data?

Reply
- Adrian Tam September 9, 2021 at 4:31 am #
  
  Nominal data can be considered categorical. But ordinal data is not, since the order means something, not just names.
  
  Reply
Crispin October 26, 2021 at 5:08 am #

Hi Jason, I’m carrying out binary classification with a range of models: some sci kit learn ones, some transformer models and some deep learning neural nets.

For my sci kit learn and transformer models, they both run fine when using get_dummies. For my Deep Learning model, when it outputs an array shape [3625, ] the model fails in terms of accuracy and loss (with a single output neuron). When get_dummies produces an array [3625,2] the model works well, do you know why this could be?

Many thanks

Reply
- Adrian Tam October 27, 2021 at 2:54 am #
  
  Sorry, not clear on what you’re describing. Are you saying the model failed because your prediction output and the training output are in different shape?
  
  Reply
Carlos Pires November 17, 2021 at 8:35 pm #

Would it be wise to apply one hot encoding or dummy encoding before pca or pls?
I have a mixed dataset with 2 categorical variables, with 3 and 2 categories respectively

Reply
- Adrian Tam November 18, 2021 at 5:45 am #
  
  PCA is part of the exploratory step to try to understand data. In that case, you may want to run one-hot encoding to make one nominal feature into multiple to help PCA.
  
  Reply
Eddie November 19, 2021 at 2:02 pm #

Hi Adrian,

Thanks for the nice and detailed article! A quick question, do you think we need to normalize the original encoded variable (0, 1, 2, 3) to a range between 0 and 1? the reason I’m asking is that I have continuous variables in large scales that need to be normalized to 0 and 1 before feeding to the model. Should the same normalization be done after ordinal encoding the categories with ordinal relationships?

Reply
- Adrian Tam November 20, 2021 at 1:49 am #
  
  I would suggest you to learn about the level of measurement: https://en.wikipedia.org/wiki/Level_of_measurement
  We do one-hot encoding because the data are nominal, and we do normalization because the data is interval or ratio. If you try it out, doing a wrong preprocessing would probably inject noise to the data and confuse your machine learning model.
  
  Reply
Eddie November 20, 2021 at 3:07 pm #

My categorial variable does have the ordinal relationship (e.g. low, medium and high). After ordinal encoding, it is converted to 0, 1 and 2), my question is, should we further normalize it to a range between 0 and 1 (e.g., 0, 0.5, 1)

Reply
- Adrian Tam November 21, 2021 at 4:01 am #
  
  I don’t think 0 to 2 and 0 to 1 are vastly different. So in this case, it should be optional.
  
  Reply
mark December 23, 2021 at 3:10 pm #

Hi Jason, Thank you so much for your helpful and informative series of posts.

I have a question about your statement above regarding when to implement encoding, that says:

“The best practice when encoding variables is to fit the encoding on the training dataset, then apply it to the train and test datasets.”

Why would this be true? It seems that some feel that there is risk of “data leakage” if the full dataset is encoded prior to splitting into train and test. But encoding is different from imputation. Imputation uses statistics from the dataset, so there is a risk of data leakage, for sure. But encoding is simply an alternate expression of the space of possible values of feature. It doesn’t care about the actual values found in train and test, it only cares about the space of possible values, right? Therefore I think that likely is no risk of data leakage in encoding as it does not care about the specific values found in the train and test sets. It only cares about the space of possible values.

This distinction is important, because if one tries, for example, to fit OrdinalEncode on train data only, which happens to be missing one of the possible values and this missing value appears in the test data we get a exception. But this can be avoided by encoding the full dataset prior to splitting.

Perhaps I’m not thinking about this clearly? I’d welcome your thoughts and perspective.

Thanks, mark

Reply
- James Carmichael February 28, 2022 at 12:20 pm #
  
  Hi Mark…Kindly narrow your query to a single question so that we better assist you.
  
  Reply
Amparo November 13, 2022 at 11:05 pm #

Hello.

If a do a linear regression model I can include the year as a categorical variable (instead of numerical) in order to estimate a different coefficient for each year. This is commonly done when the variable is not linear or if you want to detect something strange on a specific year…

When you do it with common regression packages with R they automatically take one of the levels (classes) as reference and calculate the coefficients respect this base level. It can also be done taking as reference the grand mean of all levels (I think this is included in packages called contrasts but it’s related with the way we codify dummy variables).

My question is…
What if I want to calculate the model referring each year’s coefficient to the previous one instead of a common one? How would yo codify it?

Reply
- James Carmichael November 14, 2022 at 3:36 pm #
  
  Hi Amparo…Our content is based upon Python. The following resource provides an implementation of regression from scratch.
  
  https://machinelearningmastery.com/implement-linear-regression-stochastic-gradient-descent-scratch-python/
  
  Reply
Jason December 16, 2022 at 9:53 am #

What is this and what does it do — what and why am I separating into input and output columns? It just appears in the problem with no explanation. Can you please clarify?

# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)

Reply
- James Carmichael December 17, 2022 at 8:00 am #
  
  Hi Jason…You may find the following helpful:
  
  https://machinelearningmastery.com/start-here/#dataprep
  
  Reply
Dani February 4, 2024 at 4:17 pm #

I have a question in mind. Why my model gives better result when using ordinal encoder even the feature is not an ordinal data? so what is happening? and should i keep use one hot encoding even when the model performs worse?

thanks in advance

Reply
- James Carmichael February 5, 2024 at 7:49 am #
  
  Hi Dani…Ordinal encoding is a technique used in machine learning to convert categorical data, which cannot be directly processed by most algorithms, into numerical format. This method is particularly suited for handling categorical variables where the categories have a natural, ordered relationship. Here are several scenarios when ordinal encoding is particularly useful:
  
  1. **Ordered Categories**: When the categorical variable represents an inherent order or ranking. For example, educational levels (e.g., High School < Bachelor's < Master's < PhD), rating scales (e.g., poor < fair < good < excellent), or sizes (e.g., small < medium < large). Ordinal encoding preserves the order information that can be leveraged by the machine learning model. 2. **Efficiency**: Ordinal encoding converts categories into simple numeric codes, which can be more efficient in terms of memory and computational cost compared to other encoding methods, such as one-hot encoding, especially when the number of categories is large but their order is important. 3. **Model Compatibility**: Some machine learning models can exploit the ordinal nature of the encoded features to improve performance. For example, tree-based models (like decision trees and gradient boosting machines) can benefit from ordinal encoding as they can make decisions that respect the order of the encoded categories. 4. **Dimensionality Reduction**: Unlike one-hot encoding, which increases the feature space dimensionality with each additional category (leading to the "curse of dimensionality" in some cases), ordinal encoding keeps the feature space more compact. This is particularly beneficial when dealing with high-cardinality categorical variables with a clear ranking. 5. **Simplicity and Interpretability**: In some cases, ordinal encoding can make the model's decisions more interpretable since the numerical representation directly corresponds to the order in the categorical variable. This can be advantageous in fields where model interpretability is crucial, such as finance and healthcare. However, it's important to use ordinal encoding judiciously, as it introduces an assumption of order that may not always be appropriate or may oversimplify the relationships within the data. For categorical variables without a natural order, other encoding techniques like one-hot encoding, target encoding, or binary encoding might be more suitable. Additionally, when using ordinal encoding, the choice of the numerical values can impact the model's performance, so it's essential to ensure that the encoding accurately reflects the underlying order among the categories.
  
  Reply
Petros October 3, 2024 at 11:44 pm #

Hi, nice article, is there any way to use these Column Transformers as alternative steps in GridSearchCV or RandomizedSearchCV? I mean, how can we fit them only to training data, since the hyparameter tuners apply cross-validation? Or how can we feed the Column Transformers with different data?
Thanks.

Reply
- James Carmichael October 4, 2024 at 7:40 am #
  
  Hi Petros…Yes, you can definitely use ColumnTransformer within GridSearchCV or RandomizedSearchCV, and it will be fitted only to the training data during each fold of cross-validation. The key is to combine ColumnTransformer with a pipeline to ensure that the transformers (and any preprocessing steps) are correctly applied only to the training data in each cross-validation split.
  
  Here’s how it works:
  
  ### Steps:
  1. **Create a Pipeline**: Use Pipeline from sklearn.pipeline to combine your preprocessing steps (e.g., ColumnTransformer) and the model into a single object.
  2. **Use GridSearchCV/RandomizedSearchCV**: Apply the GridSearchCV or RandomizedSearchCV to this pipeline, and it will automatically handle the fitting process for the ColumnTransformer within each cross-validation fold, fitting it only on the training data.
  
  ### Example:
  
  python from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.model_selection import GridSearchCV from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import fetch_openml from sklearn.model_selection import train_test_split
  # Load sample data data = fetch_openml(name="adult", version=2) X, y = data.data, data.target # Example of different data types numeric_features = ['age', 'hours-per-week'] categorical_features = ['workclass', 'education', 'marital-status'] # ColumnTransformer to apply different preprocessing to numeric and categorical data preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), numeric_features), ('cat', OneHotEncoder(), categorical_features) ]) # Pipeline: First apply ColumnTransformer, then fit the classifier pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', RandomForestClassifier())]) # Define parameter grid for GridSearchCV param_grid = { 'classifier__n_estimators': [100, 200], 'classifier__max_depth': [5, 10] } # GridSearchCV to tune hyperparameters and automatically handle preprocessing grid_search = GridSearchCV(pipeline, param_grid, cv=5) # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Fit the model grid_search.fit(X_train, y_train) # Check the best parameters and score print(f"Best params: {grid_search.best_params_}") print(f"Best cross-validation score: {grid_search.best_score_}")
  # Evaluate on test data print(f"Test set score: {grid_search.score(X_test, y_test)}")
  
  ### Key Points:
  1. **Pipeline**: By combining the ColumnTransformer with a model in a pipeline, you ensure that:
  – The transformer (preprocessor) is applied only to the training data in each fold of cross-validation.
  – The transformer is refitted from scratch in every fold.
  
  2. **GridSearchCV**: It will search over different hyperparameters, fitting the ColumnTransformer to the training data each time and applying the model to the validation set, ensuring proper cross-validation.
  
  ### Customizing for Different Data:
  If you need to apply the ColumnTransformer to different subsets of data (e.g., apply one transformer to some features and another to others), you can adjust the transformers within the ColumnTransformer or even dynamically create different pipelines for each case.
  
  Let me know if you’d like more details on a specific use case!
  
  Reply

Navigation

Ordinal and One-Hot Encodings for Categorical Data

Tutorial Overview

Nominal and Ordinal Variables

Want to Get Started With Data Preparation?

Encoding Categorical Data

Ordinal Encoding

One-Hot Encoding

Dummy Variable Encoding

Breast Cancer Dataset

OrdinalEncoder Transform

OneHotEncoder Transform

Common Questions

Further Reading

Tutorials

Books

APIs

Articles

Summary

Get a Handle on Modern Data Preparation!

Prepare Your Machine Learning Data in Minutes

Bring Modern Data Preparation Techniques to
Your Machine Learning Projects

More On This Topic

82 Responses to Ordinal and One-Hot Encodings for Categorical Data

Leave a Reply Click here to cancel reply.

Navigation

Tutorial Overview

Nominal and Ordinal Variables

Want to Get Started With Data Preparation?

Encoding Categorical Data

Ordinal Encoding

One-Hot Encoding

Dummy Variable Encoding

Breast Cancer Dataset

OrdinalEncoder Transform

OneHotEncoder Transform

Common Questions

Further Reading

Tutorials

Books

APIs

Articles

Summary

Get a Handle on Modern Data Preparation!

Prepare Your Machine Learning Data in Minutes

Bring Modern Data Preparation Techniques to Your Machine Learning Projects

More On This Topic

82 Responses to Ordinal and One-Hot Encodings for Categorical Data

Leave a Reply Click here to cancel reply.

Bring Modern Data Preparation Techniques to
Your Machine Learning Projects