[New Book] Click to get The Beginner's Guide to Data Science!
Use the offer code 20offearlybird to get 20% off. Hurry, sale ends soon!

Ordinal and One-Hot Encodings for Categorical Data

Machine learning models require all input and output variables to be numeric.

This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.

The two most popular techniques are an Ordinal Encoding and a One-Hot Encoding.

In this tutorial, you will discover how to use encoding schemes for categorical machine learning data.

After completing this tutorial, you will know:

  • Encoding is a required pre-processing step when working with categorical data for machine learning algorithms.
  • How to use ordinal encoding for categorical variables that have a natural rank ordering.
  • How to use one-hot encoding for categorical variables that do not have a natural rank ordering.

Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Ordinal and One-Hot Encoding Transforms for Machine Learning

Ordinal and One-Hot Encoding Transforms for Machine Learning
Photo by Felipe Valduga, some rights reserved.

Tutorial Overview

This tutorial is divided into six parts; they are:

  1. Nominal and Ordinal Variables
  2. Encoding Categorical Data
    1. Ordinal Encoding
    2. One-Hot Encoding
    3. Dummy Variable Encoding
  3. Breast Cancer Dataset
  4. OrdinalEncoder Transform
  5. OneHotEncoder Transform
  6. Common Questions

Nominal and Ordinal Variables

Numerical data, as its name suggests, involves features that are only composed of numbers, such as integers or floating-point values.

Categorical data are variables that contain label values rather than numeric values.

The number of possible values is often limited to a fixed set.

Categorical variables are often called nominal.

Some examples include:

  • A “pet” variable with the values: “dog” and “cat“.
  • A “color” variable with the values: “red“, “green“, and “blue“.
  • A “place” variable with the values: “first“, “second“, and “third“.

Each value represents a different category.

Some categories may have a natural relationship to each other, such as a natural ordering.

The “place” variable above does have a natural ordering of values. This type of categorical variable is called an ordinal variable because the values can be ordered or ranked.

A numerical variable can be converted to an ordinal variable by dividing the range of the numerical variable into bins and assigning values to each bin. For example, a numerical variable between 1 and 10 can be divided into an ordinal variable with 5 labels with an ordinal relationship: 1-2, 3-4, 5-6, 7-8, 9-10. This is called discretization.

  • Nominal Variable (Categorical). Variable comprises a finite set of discrete values with no relationship between values.
  • Ordinal Variable. Variable comprises a finite set of discrete values with a ranked ordering between values.

Some algorithms can work with categorical data directly.

For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation).

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

Some implementations of machine learning algorithms require all data to be numerical. For example, scikit-learn has this requirement.

This means that categorical data must be converted to a numerical form. If the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Encoding Categorical Data

There are three common approaches for converting ordinal and categorical variables to numerical values. They are:

  • Ordinal Encoding
  • One-Hot Encoding
  • Dummy Variable Encoding

Let’s take a closer look at each in turn.

Ordinal Encoding

In ordinal encoding, each unique category value is assigned an integer value.

For example, “red” is 1, “green” is 2, and “blue” is 3.

This is called an ordinal encoding or an integer encoding and is easily reversible. Often, integer values starting at zero are used.

For some variables, an ordinal encoding may be enough. The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.

It is a natural encoding for ordinal variables. For categorical variables, it imposes an ordinal relationship where no such relationship may exist. This can cause problems and a one-hot encoding may be used instead.

This ordinal encoding transform is available in the scikit-learn Python machine learning library via the OrdinalEncoder class.

By default, it will assign integers to labels in the order that is observed in the data. If a specific order is desired, it can be specified via the “categories” argument as a list with the rank order of all expected labels.

We can demonstrate the usage of this class by converting colors categories “red”, “green” and “blue” into integers. First the categories are sorted then numbers are applied. For strings, this means the labels are sorted alphabetically and that blue=0, green=1 and red=2.

The complete example is listed below.

Running the example first reports the 3 rows of label data, then the ordinal encoding.

We can see that the numbers are assigned to the labels as we expected.

This OrdinalEncoder class is intended for input variables that are organized into rows and columns, e.g. a matrix.

If a categorical target variable needs to be encoded for a classification predictive modeling problem, then the LabelEncoder class can be used. It does the same thing as the OrdinalEncoder, although it expects a one-dimensional input for the single target variable.

One-Hot Encoding

For categorical variables where no ordinal relationship exists, the integer encoding may not be enough, at best, or misleading to the model at worst.

Forcing an ordinal relationship via an ordinal encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the ordinal representation. This is where the integer encoded variable is removed and one new binary variable is added for each unique integer value in the variable.

Each bit represents a possible category. If the variable cannot belong to multiple categories at once, then only one bit in the group can be “on.” This is called one-hot encoding …

— Page 78, Feature Engineering for Machine Learning, 2018.

In the “color” variable example, there are three categories, and, therefore, three binary variables are needed. A “1” value is placed in the binary variable for the color and “0” values for the other colors.

This one-hot encoding transform is available in the scikit-learn Python machine learning library via the OneHotEncoder class.

We can demonstrate the usage of the OneHotEncoder on the color categories. First the categories are sorted, in this case alphabetically because they are strings, then binary variables are created for each category in turn. This means blue will be represented as [1, 0, 0] with a “1” in for the first binary variable, then green, then finally red.

The complete example is listed below.

Running the example first lists the three rows of label data, then the one hot encoding matching our expectation of 3 binary variables in the order “blue”, “green” and “red”.

If you know all of the labels to be expected in the data, they can be specified via the “categories” argument as a list.

The encoder is fit on the training dataset, which likely contains at least one example of all expected labels for each categorical variable if you do not specify the list of labels. If new data contains categories not seen in the training dataset, the “handle_unknown” argument can be set to “ignore” to not raise an error, which will result in a zero value for each label.

Dummy Variable Encoding

The one-hot encoding creates one binary variable for each category.

The problem is that this representation includes redundancy. For example, if we know that [1, 0, 0] represents “blue” and [0, 1, 0] represents “green” we don’t need another binary variable to represent “red“, instead we could use 0 values for both “blue” and “green” alone, e.g. [0, 0].

This is called a dummy variable encoding, and always represents C categories with C-1 binary variables.

When there are C possible values of the predictor and only C – 1 dummy variables are used, the matrix inverse can be computed and the contrast method is said to be a full rank parameterization

— Page 95, Feature Engineering and Selection, 2019.

In addition to being slightly less redundant, a dummy variable representation is required for some models.

For example, in the case of a linear regression model (and other regression models that have a bias term), a one hot encoding will case the matrix of input data to become singular, meaning it cannot be inverted and the linear regression coefficients cannot be calculated using linear algebra. For these types of models a dummy variable encoding must be used instead.

If the model includes an intercept and contains dummy variables […], then the […] columns would add up (row-wise) to the intercept and this linear combination would prevent the matrix inverse from being computed (as it is singular).

— Page 95, Feature Engineering and Selection, 2019.

We rarely encounter this problem in practice when evaluating machine learning algorithms, unless we are using linear regression of course.

… there are occasions when a complete set of dummy variables is useful. For example, the splits in a tree-based model are more interpretable when the dummy variables encode all the information for that predictor. We recommend using the full set if dummy variables when working with tree-based models.

— Page 56, Applied Predictive Modeling, 2013.

We can use the OneHotEncoder class to implement a dummy encoding as well as a one hot encoding.

The “drop” argument can be set to indicate which category will be come the one that is assigned all zero values, called the “baseline“. We can set this to “first” so that the first category is used. When the labels are sorted alphabetically, the first “blue” label will be the first and will become the baseline.

There will always be one fewer dummy variable than the number of levels. The level with no dummy variable […] is known as the baseline.

— Page 86, An Introduction to Statistical Learning with Applications in R, 2014.

We can demonstrate this with our color categories. The complete example is listed below.

Running the example first lists the three rows for the categorical variable, then the dummy variable encoding, showing that green is “encoded” as [1, 0], “red” is encoded as [0, 1] and “blue” is encoded as [0, 0] as we specified.

Now that we are familiar with the three approaches for encoding categorical variables, let’s look at a dataset that has categorical variables.

Breast Cancer Dataset

As the basis of this tutorial, we will use the “Breast Cancer” dataset that has been widely studied in machine learning since the 1980s.

The dataset classifies breast cancer patient data as either a recurrence or no recurrence of cancer. There are 286 examples and nine input variables. It is a binary classification problem.

A reasonable classification accuracy score on this dataset is between 68 percent and 73 percent. We will aim for this region, but note that the models in this tutorial are not optimized: they are designed to demonstrate encoding schemes.

No need to download the dataset as we will access it directly from the code examples.

Looking at the data, we can see that all nine input variables are categorical.

Specifically, all variables are quoted strings. Some variables show an obvious ordinal relationship for ranges of values (like age ranges), and some do not.

Note that this dataset has missing values marked with a “nan” value.

We will leave these values as-is in this tutorial and use the encoding schemes to encode “nan” as just another value. This is one possible and quite reasonable approach to handling missing values for categorical variables.

We can load this dataset into memory using the Pandas library.

Once loaded, we can split the columns into input (X) and output (y) for modeling.

Making use of this function, the complete example of loading and summarizing the raw categorical dataset is listed below.

Running the example reports the size of the input and output elements of the dataset.

We can see that we have 286 examples and nine input variables.

Now that we are familiar with the dataset, let’s look at how we can encode it for modeling.

OrdinalEncoder Transform

An ordinal encoding involves mapping each unique label to an integer value.

This type of encoding is really only appropriate if there is a known relationship between the categories. This relationship does exist for some of the variables in our dataset, and ideally, this should be harnessed when preparing the data.

In this case, we will ignore any possible existing ordinal relationship and assume all variables are categorical. It can still be helpful to use an ordinal encoding, at least as a point of reference with other encoding schemes.

We can use the OrdinalEncoder from scikit-learn to encode each variable to integers. This is a flexible class and does allow the order of the categories to be specified as arguments if any such order is known.

Note: I will leave it as an exercise for you to update the example below to try specifying the order for those variables that have a natural ordering and see if it has an impact on model performance.

Once defined, we can call the fit_transform() function and pass it to our dataset to create a quantile transformed version of our dataset.

We can also prepare the target in the same manner.

Let’s try it on our breast cancer dataset.

The complete example of creating an ordinal encoding transform of the breast cancer dataset and summarizing the result is listed below.

Running the example transforms the dataset and reports the shape of the resulting dataset.

We would expect the number of rows, and in this case, the number of columns, to be unchanged, except all string values are now integer values.

As expected, in this case, we can see that the number of variables is unchanged, but all values are now ordinal encoded integers.

Next, let’s evaluate machine learning on this dataset with this encoding.

The best practice when encoding variables is to fit the encoding on the training dataset, then apply it to the train and test datasets.

We will first split the dataset, then prepare the encoding on the training set, and apply it to the test set.

We can then fit the OrdinalEncoder on the training dataset and use it to transform the train and test datasets.

The same approach can be used to prepare the target variable. We can then fit a logistic regression algorithm on the training dataset and evaluate it on the test dataset.

The complete example is listed below.

Running the example prepares the dataset in the correct manner, then evaluates a model fit on the transformed data.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the model achieved a classification accuracy of about 75.79 percent, which is a reasonable score.

Next, let’s take a closer look at the one-hot encoding.

OneHotEncoder Transform

A one-hot encoding is appropriate for categorical data where no relationship exists between categories.

The scikit-learn library provides the OneHotEncoder class to automatically one hot encode one or more variables.

By default the OneHotEncoder will output data with a sparse representation, which is efficient given that most values are 0 in the encoded representation. We will disable this feature by setting the “sparse” argument to False so that we can review the effect of the encoding.

Once defined, we can call the fit_transform() function and pass it to our dataset to create a quantile transformed version of our dataset.

As before, we must label encode the target variable.

The complete example of creating a one-hot encoding transform of the breast cancer dataset and summarizing the result is listed below.

Running the example transforms the dataset and reports the shape of the resulting dataset.

We would expect the number of rows to remain the same, but the number of columns to dramatically increase.

As expected, in this case, we can see that the number of variables has leaped up from 9 to 43 and all values are now binary values 0 or 1.

Next, let’s evaluate machine learning on this dataset with this encoding as we did in the previous section.

The encoding is fit on the training set then applied to both train and test sets as before.

Tying this together, the complete example is listed below.

Running the example prepares the dataset in the correct manner, then evaluates a model fit on the transformed data.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the model achieved a classification accuracy of about 70.53 percent, which is slightly worse than the ordinal encoding in the previous section.

Common Questions

This section lists some common questions and answers when encoding categorical data.

Q. What if I have a mixture of numeric and categorical data?

Or, what if I have a mixture of categorical and ordinal data?

You will need to prepare or encode each variable (column) in your dataset separately, then concatenate all of the prepared variables back together into a single array for fitting or evaluating the model.

Alternately, you can use the ColumnTransformer to conditionally apply different data transforms to different input variables.

Q. What if I have hundreds of categories?

Or, what if I concatenate many one-hot encoded vectors to create a many-thousand-element input vector?

You can use a one-hot encoding up to thousands and tens of thousands of categories. Also, having large vectors as input sounds intimidating, but the models can generally handle it.

Q. What encoding technique is the best?

This is unknowable.

Test each technique (and more) on your dataset with your chosen model and discover what works best for your case.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Books

APIs

Articles

Summary

In this tutorial, you discovered how to use encoding schemes for categorical machine learning data.

Specifically, you learned:

  • Encoding is a required pre-processing step when working with categorical data for machine learning algorithms.
  • How to use ordinal encoding for categorical variables that have a natural rank ordering.
  • How to use one-hot encoding for categorical variables that do not have a natural rank ordering.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Modern Data Preparation!

Data Preparation for Machine Learning

Prepare Your Machine Learning Data in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Data Preparation for Machine Learning

It provides self-study tutorials with full working code on:
Feature Selection, RFE, Data Cleaning, Data Transforms, Scaling, Dimensionality Reduction, and much more...

Bring Modern Data Preparation Techniques to
Your Machine Learning Projects


See What's Inside

80 Responses to Ordinal and One-Hot Encodings for Categorical Data

  1. Avatar
    Yuhou June 12, 2020 at 5:26 pm #

    “This type of categorical variable is called an nominal variable because the values can be ordered or ranked.” Should here be ordinal variable instead of nominal variable

  2. Avatar
    Jérôme Plumecoq June 12, 2020 at 5:55 pm #

    Hi,

    « The “place” variable above does have a natural ordering of values. This type of categorical variable is called an nominal variable because the values can be ordered or ranked. ».

    You mean « ordinal variable » not « nominal variable » ?

  3. Avatar
    Andy June 14, 2020 at 6:35 pm #

    An interesting discussion on “Why OneHotEncoder not get_dummies?” in sklearn can be found here: https://stackoverflow.com/questions/36631163/what-are-the-pros-and-cons-between-get-dummies-pandas-and-onehotencoder-sciki

  4. Avatar
    Nilesh Saratkar June 18, 2020 at 3:15 pm #

    In case of the OrdinalEncoding, I have observed shift in the encoded value after introducing new value E.g. data = asarray([[‘orange’], [‘red’], [‘green’], [‘blue’]]). In real life scenario, additional values should not shuffle existing encoded values between training and test data sets. Can we use some hashing technique to generate consistent encoded values?

    • Avatar
      Jason Brownlee June 19, 2020 at 6:09 am #

      You can specify the fill list of expected values up front and save the object for use on all future data.

      Or you can write some custom code to handle the mapping.

  5. Avatar
    Nat June 20, 2020 at 7:25 am #

    Hi,
    When I try to do the dummy encoding with argument drop=’first’ I get this error: __init__() got an unexpected keyword argument ‘drop’.

    I checked the syntax documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html?highlight=onehotencoder#sklearn.preprocessing.OneHotEncoder

    The code seems to be right, but I still get the error. Can you please help?

    Thanks so much for your tutorials – they are great!

    • Avatar
      Jason Brownlee June 21, 2020 at 5:57 am #

      Perhaps confirm that your version of scikit-learn is up to date?

      • Avatar
        Nat June 21, 2020 at 8:28 am #

        Ugh. SO sorry. I thought I was on the latest version. I updated and the dummy encoding code works now. Thanks for taking the time to get back to me!

  6. Avatar
    Saheed Yakub June 21, 2020 at 1:09 am #

    Please, i am trying to fit the OrdinalEncoder on the training dataset and use it to transform the train and test datasets as follows;
    ordinal_encoder = OrdinalEncoder()
    ordinal_encoder.fit(X_train)
    X_train = ordinal_encoder.transform(X_train)
    X_test = ordinal_encoder.transform(X_test)

    I have this error
    ————————————————————————–
    ValueError Traceback (most recent call last)
    in ()
    2 ordinal_encoder.fit(x_train)
    3 x_train=ordinal_encoder.transform(x_train)
    —-> 4 x_test = ordinal_encoder.transform (x_test)

    1 frames
    /usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_encoders.py in _transform(self, X, handle_unknown)
    122 msg = (“Found unknown categories {0} in column {1}”
    123 ” during transform”.format(diff, i))
    –> 124 raise ValueError(msg)
    125 else:
    126 # Set the problematic rows to an acceptable value and

    ValueError: Found unknown categories [69.0, 70.0, 71.0, 72.0] in column 0 during transform

    What can be the way out? I appreciate your time.

    • Avatar
      Jason Brownlee June 21, 2020 at 6:27 am #

      Sorry to hear that you are having trouble, this may help:
      https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me

    • Avatar
      Saket Agrawal March 21, 2021 at 1:55 am #

      One reason for this problem is a category value exists in the test data set which is not available in the training set. Hence encoder has not seen this value during the fit and hence doesn’t know how to encode it. Few different options to handle such scenario

      1. Make use of the handle_unknown parameter, refer OrdinalEncoder documentation.
      2. Make use of categories parameter, refer OrdinalEncoder documentation. I personally prefer this method as this gives complete control and also allows consistency between train and test.

      However in your case error message is ValueError: Found unknown categories [69.0, 70.0, 71.0, 72.0] in column 0 during transform

      this gives me a hit that the first column in your data set is a sequence number, check if this should really be part of the feature set? If this is a simple sequence number there is absolutely no way you can consider them as categorical data.

      • Avatar
        Jason Brownlee March 21, 2021 at 6:10 am #

        Typically unseen categories are labeled as all zeros.

  7. Avatar
    Prerna July 7, 2020 at 7:56 am #

    Hi!
    One of the categorical variables I am dealing with is, employment length, that has 511193 distinct categories. I am getting a memory error ‘Unable to allocate 243. GiB for an array with shape (511193, 511193)’. How to deal with this. I have come across articles which say that one hot encoding can’t process high cardinality and would give misleading results. Kindly help! Thanks

  8. Avatar
    Ale July 9, 2020 at 6:45 am #

    Hi Jason,

    So since the one-hot encoding may cause problems with linear regression. Can I process the data as dummy variable and this won’t affect whether I use it for tree-based methods or linear models?

    • Avatar
      Jason Brownlee July 9, 2020 at 1:18 pm #

      Yes I believe so. Test to confirm if you’re concerned.

  9. Avatar
    Pelumi Soyombo September 22, 2020 at 2:04 am #

    Hi Jason.

    I’m a beginner at using machine learning for predictions.

    I already followed your code on using OrdinalEncoder for the brest-cancer dataset. How do I input new data for predictions (apart from the test dataset) after encoding with OrdinalEncoder ?

  10. Avatar
    Usman Aileru September 22, 2020 at 2:40 am #

    Hi Jason.

    I’m a beginner at using machine learning for predictions.

    I already followed your code on using OrdinalEncoder for the brest-cancer dataset. How do I get predictions for new data unknown to the dataset (i.e data different from the test dataset) after encoding with OrdinalEncoder?

    • Avatar
      Jason Brownlee September 22, 2020 at 6:53 am #

      Good question, see this:
      https://machinelearningmastery.com/make-predictions-scikit-learn/

      • Avatar
        Usman Aileru September 22, 2020 at 8:07 pm #

        Thanks for your reply.

        I figured this out but my question was how do I make predictions for new categorical dataset. Examples given in the link you provided were on numbers already.

        But say I am working on the breast-cancer.csv dataset, how do I predict the outcome for new categorical data unknown to the dataset?

        • Avatar
          Jason Brownlee September 23, 2020 at 6:36 am #

          The approach for making predictions is the same regardless of the data types. E.g. call model.predict()

          If you are using encodings, then wrap them up into a pipeline and call pipeline.predict() and pass the raw data as input.

          • Avatar
            Usman Aileru September 23, 2020 at 9:45 pm #

            Many thanks. I figured it out.

          • Avatar
            Jason Brownlee September 24, 2020 at 6:14 am #

            Well done.

  11. Avatar
    MFT October 10, 2020 at 12:40 pm #

    What if you need a default value to cover a categorical level that is unforeseen? Dummy encoding is very good for that, but I think some discussion of the matter could be included.

    • Avatar
      Jason Brownlee October 10, 2020 at 1:55 pm #

      You can set the handle_unknown argument and specify how to handle labels unseen during training.

  12. Avatar
    Seema Patel October 26, 2020 at 7:50 pm #

    ”’In this case, we will ignore any possible existing ordinal relationship and assume all variables are categorical. It can still be helpful to use an ordinal encoding, ‘at least as a point of reference with other encoding schemes.’ ”’ Can you elaborate this sentence?

    • Avatar
      Jason Brownlee October 27, 2020 at 6:43 am #

      Yes, you can choose to specify the ordinal relationship between the labels in some of the variables if you wish. This would likely improve the performance of the model fit on an ordinal encoding.

  13. Avatar
    Wiktor December 18, 2020 at 6:14 pm #

    Running the code of “evaluate logistic regression on the breast cancer dataset with an ordinal encoding” a few time gives the same accuracy, because of random_state=1 during splitting to train and test sets. After setting random_state=None you can get an error like:
    ValueError: Found unknown categories [“’24-26′”] in column 3 during transform
    I suppose that we should make ordinal encoding of input variables before splitting. Modified code:

    for i in range(10):
    # ordinal encode input variables
    ordinal_encoder = OrdinalEncoder()
    ordinal_encoder.fit(X)
    X = ordinal_encoder.transform(X)
    # split the dataset into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=None)
    # ordinal encode target variable
    label_encoder = LabelEncoder()
    label_encoder.fit(y_train)
    y_train = label_encoder.transform(y_train)
    y_test = label_encoder.transform(y_test)
    # define the model
    model = LogisticRegression()
    # fit on the training set
    model.fit(X_train, y_train)
    # predict on test set
    yhat = model.predict(X_test)
    # evaluate predictions
    accuracy = accuracy_score(y_test, yhat)
    print(f'{i + 1}. Accuracy: {accuracy*100:.2f} %’)

    works fine, giving various results in subsequent runs, for example:

    1. Accuracy: 64.21 %
    2. Accuracy: 77.89 %
    3. Accuracy: 75.79 %
    4. Accuracy: 77.89 %
    5. Accuracy: 76.84 %
    6. Accuracy: 68.42 %
    7. Accuracy: 75.79 %
    8. Accuracy: 73.68 %
    9. Accuracy: 71.58 %
    10. Accuracy: 80.00 %

  14. Avatar
    Pandas December 19, 2020 at 6:01 am #

    You transform target with Ordinal encoder and then fit the logistic regression model.
    Is this process called ordinal logistic regression or is it something different?
    Thanks in advance

  15. Avatar
    amin December 25, 2020 at 1:58 am #

    hi can u pls say after using one hot encoding and making our model with logistic regression or other classification algorithm how we can calculate the categorical data coefficient? coz after using one hot encoding a categorical variable will change to for example 3 columns and variable we should calculate coefficient of each columns that depend of the categorical data seperately and then add to each other or no how is it? i have different type of variables and my goal is show that which variable has most effect on my target variable(output) after build the classification model

    • Avatar
      Jason Brownlee December 25, 2020 at 5:25 am #

      What is the “categorical data coefficient”?

  16. Avatar
    SULAIMAN KHAN February 24, 2021 at 8:51 pm #

    alueError Traceback (most recent call last)
    in ()
    59 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
    60 # prepare input data
    —> 61 X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)
    62 # prepare output data
    63 y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

    in prepare_inputs(X_train, X_test)
    38 # encode
    39 train_enc = le.transform(X_train[:, i])
    —> 40 test_enc = le.transform(X_test[:, i])
    41 # store
    42 X_train_enc.append(train_enc)

    ~\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py in transform(self, y)
    131 if len(np.intersect1d(classes, self.classes_)) 133 raise ValueError(“y contains new labels: %s” % str(diff))
    134 return np.searchsorted(self.classes_, y)
    135

    ValueError: y contains new labels: [‘102’ ‘120’ ‘137’ ‘145’ ‘147’ ’15’ ‘159’ ‘174’ ‘195’ ‘208’ ‘214’ ‘222’
    ‘223’ ’23’ ’24’ ‘247’ ‘248’ ‘259’ ‘264’ ‘286’ ‘289’ ‘290’ ‘291’ ‘298’
    ‘301’ ‘325’ ‘328’ ‘331’ ’34’ ‘341’ ‘354’ ‘361’ ‘367’ ‘378’ ‘401’ ‘402’
    ‘410’ ‘426’ ‘441’ ‘446’ ‘452’ ’46’ ‘468’ ‘485’ ‘489’ ‘5’ ‘510’ ‘512’
    ‘513’ ‘520’ ‘533’ ’54’ ‘550’ ‘579’ ‘586’ ‘587’ ’59’ ‘603’ ‘605’ ’73’ ’79’
    ’89’ ’90’ ’99’]
    ###################
    I have multiclassification labe

  17. Avatar
    Tanzila March 2, 2021 at 8:11 pm #

    How to handle missing values in One Hot Encoder?

    I am trying to apply one hot encoding to 2D numpy array containing missing value ‘N’. I have the following code, which works well without missing value.

    import numpy as np
    X = [[‘A’, ‘G’], [‘N’, ‘C’], [‘T’, ‘A’]]
    X = np.array(X)
    print(‘Shape of data before one hot encoding: ‘, X.shape)
    classes = np.array([‘A’, ‘C’, ‘G’, ‘T’])
    X = np.searchsorted(classes, X)
    eye = np.eye(classes.shape[0])
    onehotlabels = np.concatenate([eye[i] for i in X.T], axis=1)
    print(‘Shape of data after pre-processing: ‘, onehotlabels.shape)
    print(onehotlabels)

    Output:
    Shape of data before one hot encoding: (3, 2)
    Shape of data after pre-processing: (3, 8)
    [[1. 0. 0. 0. 0. 0. 1. 0.]
    [0. 0. 0. 1. 0. 1. 0. 0.]
    [0. 0. 0. 1. 1. 0. 0. 0.]]

    Is it possible to replace ‘N’ by 0 0 0 0 by using this approach? The output will look like the following:
    [[1. 0. 0. 0. 0. 0. 1. 0.]
    [0. 0. 0. 0. 0. 1. 0. 0.]
    [0. 0. 0. 1. 1. 0. 0. 0.]]

  18. Avatar
    Ruhul March 16, 2021 at 12:44 pm #

    Hi Jason,

    Thanks for wonderful article.

    I am facing one doubt and not getting answer anywhere.

    Suppose my target variable is Category Fruit – Apple, Orange, Mango, Banana.

    There is no ordinal relationship between them. Can I one hot encode them or use ordinal encoder.

    Everywhere (including other websites). I can find that target variable is converted into Apple -1, Orange-2, Mango-3, Banana-4. Sometimes column with labels 1, 2,3,4 is already created.

    My query – 1) If there is no relationship exists between Apple, Orange, Mango, Banana. So should we use One Hot encoding /Dummy Variable or Ordinal encoders for converting target variable. Most of the cases I found Ordinal encoder used.

    If I use One Hot encoding /Dummy Variable encoding for target variable Apple, Orange, Mango, Banana. Then there will be many columns created. I think this is not appropriate.

    Is it the reason for creating single column for target – Ordinal encoders are used.

    2) So if we use one One Hot encoding /Dummy Variable creation for target variable with categories , then which algo will be suitable for prediction?

    • Avatar
      Jason Brownlee March 17, 2021 at 5:56 am #

      One hot encode sounds appropriate.

      Yes, many columns will be created, it is typically not a problem, even with tens of thousands of new columns.

  19. Avatar
    AGGELOS PAPOUTSIS April 11, 2021 at 12:04 am #

    Hi,

    as i understand label encoder is for categorical target variable (i.e y). Does it make sense to use label encoder for the independent feutures (i.e X)?

    **I JUST BOUGHT YOU BOOK”DATA PREPARATION FOR MACHINE LEANRING”

    THIS IS A FANTASTIC BOOK

    • Avatar
      Jason Brownlee April 11, 2021 at 4:52 am #

      Yes, you can use the label encoder on inputs, but we now have the “ordinal encoder” that does the same job and is designed to work with inputs.

      Thanks!!!

      • Avatar
        AGGELOS PAPOUTSIS April 11, 2021 at 6:54 pm #

        ok, thanks. I use the label encoder as the technique do not require 2d data. When I tried to use the ordinal encoder I had to reshape the data.

  20. Avatar
    Tran Quoc Lap April 20, 2021 at 11:43 am #

    Great technical writing!!!
    I wonder which should be done first: data cleaning or data transform?
    I think data cleaning includes subtasks such as outlier detection where values of columns are expected to be numeric.

    • Avatar
      Jason Brownlee April 21, 2021 at 5:51 am #

      Thanks.

      Often both are iterated as you learn more about the data/problem, but generally cleaning then transforms.

  21. Avatar
    Lobelie June 26, 2021 at 6:14 am #

    Hello,

    If I have categorical and nominal variables, can I encode the categorical with one-hot and the nominal ones with label encoding ?

  22. Avatar
    Daniel July 26, 2021 at 5:37 am #

    Hi,
    Quoting your article”
    “If new data contains categories not seen in the training dataset, the “handle_unknown” argument can be set to “ignore” to not raise an error, which will result in a zero value for each label.”

    I guess that because we fit and transform using certain OneHotEncoder on the training set, there can be a case where the testing set has features that haven’t been OneHotEncoded and that’s what you wrote in your blog.
    Does setting the “handle_unknow”=’ignore’ in such a case wouldn’t cause a loss of accuracy in the predictions/ cassifications?
    Is there another way to deal with that issue, without causing Data Leakage?

    Regards

    • Avatar
      Jason Brownlee July 27, 2021 at 5:02 am #

      It may or may not impact model performance depending on the model and the dataset.

      A better approach is to have a more complete training dataset.

  23. Avatar
    Talieh July 30, 2021 at 3:03 am #

    Hi Jason,

    Your posts are always my first ‘go-to’ for my ML/AI-related questions! Thanks for your contribution to the community.

    My question is, in case of tree-based models, is it best practice to use OHE, or Dummy encoding? The tree-based algorithms build a tree based on all available variables and we if drop one of the levels, the tree will never see that variable and will not use it in the splits. I would love to know your thoughts on this.

    Thanks.

    • Avatar
      Jason Brownlee July 30, 2021 at 6:31 am #

      Perhaps try both and see what works well or best for your specific dataset and model.

  24. Avatar
    geoffrey laforest August 14, 2021 at 12:48 am #

    Very interesting article.

    I’m still missing something. I’m actually trying to do logistic regression from scratch on a Fractures Dataset and I have several categorical columns to encode myself.

    You say that DummyEncoding avoid the problem (when doing linear regression ok, maybe it is less important with logistic regression so ?) of singular matrix that comes with one hot encoding, and it works with n-1 columns.

    But if I want to encode the sex column, only possibility is 1 for male 0 for female (or the contrary). If I apply the logic of DummyEncoding, we go back to the problem of ordinal relations (that does not existe here between the different values of a categorical variable) that are not relevant here and that we tried to avoid using precisely a One Hot Encoding ?

    Then, I have a column “medication” with “no medication” “medication #1” “medication #2” values, here I can perfectly proceed as you did with the colors and one hot encoding right ?

    Finally, where can I find more information about the exact mathematical reasons that make ordinal encoding bad when no ordinal relations exist and conversly make one hot encoder works good ?

    Thanks a lot,

    Geoffrey (from France)

    • Avatar
      Adrian Tam August 14, 2021 at 3:36 am #

      For why ordinal encoding is bad when no ordinal relations exists: The levels of measurement are nominal, ordinal, interval, and ratio, in order of increasing information. If you do use ordinal number while no ordinal exists, you are introducing non-exist information to the machine to learn, which is essentially noise and confused the model. Think what happened if we introduce the phone number (which is nominal data) together with how many hours students worked (which is ratio data) to predict their test score.

  25. Avatar
    geoffrey laforest August 14, 2021 at 1:04 am #

    I forgot, could you detail why the matrix becomes singular with one hot encoding ? the problems comes if we have n >=3, n being the number of different values in a categorical feature, right ? as with n = 2, we have 0 in a colum and 1 in the other, but with 3, 2 columns can have the same 0 value and I suppose the problems begin here ?
    Yet I still have difficulties to understand why matrix becomes singular. For example in the example you give with the colors, the matrix is not singular, although you did not use a dummy encoding. The only way I see it could be singular is if we have a colum full of zeros but it would mean the value never appears so it’s kind odd…

    More broadly, in linear regression, we can have 2 identical examples (for example 2 persons with the same age and height and the weight target is the same), our matrix would be singular right ?

    Could you extend on these points ifyou don’t mind plz, I find them so interesting but I have difficulties to figure it out.

    Thank you again

    Geoffrey

    • Avatar
      Adrian Tam August 14, 2021 at 3:38 am #

      Mathematically, a matrix is singular if you cannot inverse it. A matrix with a lot of zero (which in case of one-hot encoding) make it very difficult to inverse (what number multiplies zero is one?) and hence very easy become a singular matrix. Linear algebra is a very broad topic but it is very useful in the theory side of machine learning. Hope this short answer can help you move forward.

  26. Avatar
    Khaled August 19, 2021 at 7:12 am #

    Hello, I have a target vector with 6 different values, after one-hot encoding I would have 6 target Vectors. I want to train a model with kernel machine and I can only have one target vector each time for training. So I would make 6 times training respective to the 6 target vectors. Now How could I compine the results of this 6 to get the prediction and the accuracy?

    Thanks.

    • Avatar
      Adrian Tam August 20, 2021 at 1:06 am #

      In your approach, probably you need to have 6 models, each gives a binary (sigmoidal activation at output, in range of 0 to 1) classification for each target value, then compare which model gives the highest floating point score and use it as the prediction.
      Or a more common approach, one model with 6 output, each is a sigmoidal activation, and apply softmax to give a probability to each target. In this case, your output is not 6 target vectors, but one matrix of 6 columns.

  27. Avatar
    Hosna September 8, 2021 at 6:15 pm #

    First of all, thank you so much. I’ve learned a lot from your website.
    I like to ask a question about the concept. Is ordinal data a type of nominal data that has order or both of them are branches of categorical data?

    • Avatar
      Adrian Tam September 9, 2021 at 4:31 am #

      Nominal data can be considered categorical. But ordinal data is not, since the order means something, not just names.

  28. Avatar
    Crispin October 26, 2021 at 5:08 am #

    Hi Jason, I’m carrying out binary classification with a range of models: some sci kit learn ones, some transformer models and some deep learning neural nets.

    For my sci kit learn and transformer models, they both run fine when using get_dummies. For my Deep Learning model, when it outputs an array shape [3625, ] the model fails in terms of accuracy and loss (with a single output neuron). When get_dummies produces an array [3625,2] the model works well, do you know why this could be?

    Many thanks

    • Avatar
      Adrian Tam October 27, 2021 at 2:54 am #

      Sorry, not clear on what you’re describing. Are you saying the model failed because your prediction output and the training output are in different shape?

  29. Avatar
    Carlos Pires November 17, 2021 at 8:35 pm #

    Would it be wise to apply one hot encoding or dummy encoding before pca or pls?
    I have a mixed dataset with 2 categorical variables, with 3 and 2 categories respectively

    • Avatar
      Adrian Tam November 18, 2021 at 5:45 am #

      PCA is part of the exploratory step to try to understand data. In that case, you may want to run one-hot encoding to make one nominal feature into multiple to help PCA.

  30. Avatar
    Eddie November 19, 2021 at 2:02 pm #

    Hi Adrian,

    Thanks for the nice and detailed article! A quick question, do you think we need to normalize the original encoded variable (0, 1, 2, 3) to a range between 0 and 1? the reason I’m asking is that I have continuous variables in large scales that need to be normalized to 0 and 1 before feeding to the model. Should the same normalization be done after ordinal encoding the categories with ordinal relationships?

    • Avatar
      Adrian Tam November 20, 2021 at 1:49 am #

      I would suggest you to learn about the level of measurement: https://en.wikipedia.org/wiki/Level_of_measurement
      We do one-hot encoding because the data are nominal, and we do normalization because the data is interval or ratio. If you try it out, doing a wrong preprocessing would probably inject noise to the data and confuse your machine learning model.

  31. Avatar
    Eddie November 20, 2021 at 3:07 pm #

    My categorial variable does have the ordinal relationship (e.g. low, medium and high). After ordinal encoding, it is converted to 0, 1 and 2), my question is, should we further normalize it to a range between 0 and 1 (e.g., 0, 0.5, 1)

    • Avatar
      Adrian Tam November 21, 2021 at 4:01 am #

      I don’t think 0 to 2 and 0 to 1 are vastly different. So in this case, it should be optional.

  32. Avatar
    mark December 23, 2021 at 3:10 pm #

    Hi Jason, Thank you so much for your helpful and informative series of posts.

    I have a question about your statement above regarding when to implement encoding, that says:

    “The best practice when encoding variables is to fit the encoding on the training dataset, then apply it to the train and test datasets.”

    Why would this be true? It seems that some feel that there is risk of “data leakage” if the full dataset is encoded prior to splitting into train and test. But encoding is different from imputation. Imputation uses statistics from the dataset, so there is a risk of data leakage, for sure. But encoding is simply an alternate expression of the space of possible values of feature. It doesn’t care about the actual values found in train and test, it only cares about the space of possible values, right? Therefore I think that likely is no risk of data leakage in encoding as it does not care about the specific values found in the train and test sets. It only cares about the space of possible values.

    This distinction is important, because if one tries, for example, to fit OrdinalEncode on train data only, which happens to be missing one of the possible values and this missing value appears in the test data we get a exception. But this can be avoided by encoding the full dataset prior to splitting.

    Perhaps I’m not thinking about this clearly? I’d welcome your thoughts and perspective.

    Thanks, mark

    • Avatar
      James Carmichael February 28, 2022 at 12:20 pm #

      Hi Mark…Kindly narrow your query to a single question so that we better assist you.

  33. Avatar
    Amparo November 13, 2022 at 11:05 pm #

    Hello.

    If a do a linear regression model I can include the year as a categorical variable (instead of numerical) in order to estimate a different coefficient for each year. This is commonly done when the variable is not linear or if you want to detect something strange on a specific year…

    When you do it with common regression packages with R they automatically take one of the levels (classes) as reference and calculate the coefficients respect this base level. It can also be done taking as reference the grand mean of all levels (I think this is included in packages called contrasts but it’s related with the way we codify dummy variables).

    My question is…
    What if I want to calculate the model referring each year’s coefficient to the previous one instead of a common one? How would yo codify it?

  34. Avatar
    Jason December 16, 2022 at 9:53 am #

    What is this and what does it do — what and why am I separating into input and output columns? It just appears in the problem with no explanation. Can you please clarify?

    # separate into input and output columns
    X = data[:, :-1].astype(str)
    y = data[:, -1].astype(str)

  35. Avatar
    Dani February 4, 2024 at 4:17 pm #

    I have a question in mind. Why my model gives better result when using ordinal encoder even the feature is not an ordinal data? so what is happening? and should i keep use one hot encoding even when the model performs worse?

    thanks in advance

    • Avatar
      James Carmichael February 5, 2024 at 7:49 am #

      Hi Dani…Ordinal encoding is a technique used in machine learning to convert categorical data, which cannot be directly processed by most algorithms, into numerical format. This method is particularly suited for handling categorical variables where the categories have a natural, ordered relationship. Here are several scenarios when ordinal encoding is particularly useful:

      1. **Ordered Categories**: When the categorical variable represents an inherent order or ranking. For example, educational levels (e.g., High School < Bachelor's < Master's < PhD), rating scales (e.g., poor < fair < good < excellent), or sizes (e.g., small < medium < large). Ordinal encoding preserves the order information that can be leveraged by the machine learning model. 2. **Efficiency**: Ordinal encoding converts categories into simple numeric codes, which can be more efficient in terms of memory and computational cost compared to other encoding methods, such as one-hot encoding, especially when the number of categories is large but their order is important. 3. **Model Compatibility**: Some machine learning models can exploit the ordinal nature of the encoded features to improve performance. For example, tree-based models (like decision trees and gradient boosting machines) can benefit from ordinal encoding as they can make decisions that respect the order of the encoded categories. 4. **Dimensionality Reduction**: Unlike one-hot encoding, which increases the feature space dimensionality with each additional category (leading to the "curse of dimensionality" in some cases), ordinal encoding keeps the feature space more compact. This is particularly beneficial when dealing with high-cardinality categorical variables with a clear ranking. 5. **Simplicity and Interpretability**: In some cases, ordinal encoding can make the model's decisions more interpretable since the numerical representation directly corresponds to the order in the categorical variable. This can be advantageous in fields where model interpretability is crucial, such as finance and healthcare. However, it's important to use ordinal encoding judiciously, as it introduces an assumption of order that may not always be appropriate or may oversimplify the relationships within the data. For categorical variables without a natural order, other encoding techniques like one-hot encoding, target encoding, or binary encoding might be more suitable. Additionally, when using ordinal encoding, the choice of the numerical values can impact the model's performance, so it's essential to ensure that the encoding accurately reflects the underlying order among the categories.

Leave a Reply