Autoencoder Feature Extraction for Classification

Autoencoder is a type of neural network that can be used to learn a compressed representation of raw data.

An autoencoder is composed of an encoder and a decoder sub-models. The encoder compresses the input and the decoder attempts to recreate the input from the compressed version provided by the encoder. After training, the encoder model is saved and the decoder is discarded.

The encoder can then be used as a data preparation technique to perform feature extraction on raw data that can be used to train a different machine learning model.

In this tutorial, you will discover how to develop and evaluate an autoencoder for classification predictive modeling.

After completing this tutorial, you will know:

  • An autoencoder is a neural network model that can be used to learn a compressed representation of raw data.
  • How to train an autoencoder model on a training dataset and save just the encoder part of the model.
  • How to use the encoder as a data preparation step when training a machine learning model.

Let’s get started.

How to Develop an Autoencoder for Classification

How to Develop an Autoencoder for Classification
Photo by Bernd Thaller, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Autoencoders for Feature Extraction
  2. Autoencoder for Classification
  3. Encoder as Data Preparation for Predictive Model

Autoencoders for Feature Extraction

An autoencoder is a neural network model that seeks to learn a compressed representation of an input.

An autoencoder is a neural network that is trained to attempt to copy its input to its output.

— Page 502, Deep Learning, 2016.

They are an unsupervised learning method, although technically, they are trained using supervised learning methods, referred to as self-supervised.

Autoencoders are typically trained as part of a broader model that attempts to recreate the input.

For example:

  • X = model.predict(X)

The design of the autoencoder model purposefully makes this challenging by restricting the architecture to a bottleneck at the midpoint of the model, from which the reconstruction of the input data is performed.

There are many types of autoencoders, and their use varies, but perhaps the more common use is as a learned or automatic feature extraction model.

In this case, once the model is fit, the reconstruction aspect of the model can be discarded and the model up to the point of the bottleneck can be used. The output of the model at the bottleneck is a fixed-length vector that provides a compressed representation of the input data.

Usually they are restricted in ways that allow them to copy only approximately, and to copy only input that resembles the training data. Because the model is forced to prioritize which aspects of the input should be copied, it often learns useful properties of the data.

— Page 502, Deep Learning, 2016.

Input data from the domain can then be provided to the model and the output of the model at the bottleneck can be used as a feature vector in a supervised learning model, for visualization, or more generally for dimensionality reduction.

Next, let’s explore how we might develop an autoencoder for feature extraction on a classification predictive modeling problem.

Autoencoder for Classification

In this section, we will develop an autoencoder to learn a compressed representation of the input features for a classification predictive modeling problem.

First, let’s define a classification predictive modeling problem.

We will use the make_classification() scikit-learn function to define a synthetic binary (2-class) classification task with 100 input features (columns) and 1,000 examples (rows). Importantly, we will define the problem in such a way that most of the input variables are redundant (90 of the 100 or 90 percent), allowing the autoencoder later to learn a useful compressed representation.

The example below defines the dataset and summarizes its shape.

Running the example defines the dataset and prints the shape of the arrays, confirming the number of rows and columns.

Next, we will develop a Multilayer Perceptron (MLP) autoencoder model.

The model will take all of the input columns, then output the same values. It will learn to recreate the input pattern exactly.

The autoencoder consists of two parts: the encoder and the decoder. The encoder learns how to interpret the input and compress it to an internal representation defined by the bottleneck layer. The decoder takes the output of the encoder (the bottleneck layer) and attempts to recreate the input.

Once the autoencoder is trained, the decoder is discarded and we only keep the encoder and use it to compress examples of input to vectors output by the bottleneck layer.

In this first autoencoder, we won’t compress the input at all and will use a bottleneck layer the same size as the input. This should be an easy problem that the model will learn nearly perfectly and is intended to confirm our model is implemented correctly.

We will define the model using the functional API; if this is new to you, I recommend this tutorial:

Prior to defining and fitting the model, we will split the data into train and test sets and scale the input data by normalizing the values to the range 0-1, a good practice with MLPs.

We will define the encoder to have two hidden layers, the first with two times the number of inputs (e.g. 200) and the second with the same number of inputs (100), followed by the bottleneck layer with the same number of inputs as the dataset (100).

To ensure the model learns well, we will use batch normalization and leaky ReLU activation.

The decoder will be defined with a similar structure, although in reverse.

It will have two hidden layers, the first with the number of inputs in the dataset (e.g. 100) and the second with double the number of inputs (e.g. 200). The output layer will have the same number of nodes as there are columns in the input data and will use a linear activation function to output numeric values.

The model will be fit using the efficient Adam version of stochastic gradient descent and minimizes the mean squared error, given that reconstruction is a type of multi-output regression problem.

We can plot the layers in the autoencoder model to get a feeling for how the data flows through the model.

The image below shows a plot of the autoencoder.

Plot of Autoencoder Model for Classification With No Compression

Plot of Autoencoder Model for Classification With No Compression

Next, we can train the model to reproduce the input and keep track of the performance of the model on the hold-out test set.

After training, we can plot the learning curves for the train and test sets to confirm the model learned the reconstruction problem well.

Finally, we can save the encoder model for use later, if desired.

As part of saving the encoder, we will also plot the encoder model to get a feeling for the shape of the output of the bottleneck layer, e.g. a 100 element vector.

An example of this plot is provided below.

Plot of Encoder Model for Classification With No Compression

Plot of Encoder Model for Classification With No Compression

Tying this all together, the complete example of an autoencoder for reconstructing the input data for a classification dataset without any compression in the bottleneck layer is listed below.

Running the example fits the model and reports loss on the train and test sets along the way.

Note: if you have problems creating the plots of the model, you can comment out the import and call the plot_model() function.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we see that loss gets low, but does not go to zero (as we might have expected) with no compression in the bottleneck layer. Perhaps further tuning the model architecture or learning hyperparameters is required.

A plot of the learning curves is created showing that the model achieves a good fit in reconstructing the input, which holds steady throughout training, not overfitting.

Learning Curves of Training the Autoencoder Model Without Compression

Learning Curves of Training the Autoencoder Model Without Compression

So far, so good. We know how to develop an autoencoder without compression.

Next, let’s change the configuration of the model so that the bottleneck layer has half the number of nodes (e.g. 50).

Tying this together, the complete example is listed below.

Running the example fits the model and reports loss on the train and test sets along the way.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we see that loss gets similarly low as the above example without compression, suggesting that perhaps the model performs just as well with a bottleneck half the size.

A plot of the learning curves is created, again showing that the model achieves a good fit in reconstructing the input, which holds steady throughout training, not overfitting.

Learning Curves of Training the Autoencoder Model With Compression

Learning Curves of Training the Autoencoder Model With Compression

The trained encoder is saved to the file “encoder.h5” that we can load and use later.

Next, let’s explore how we might use the trained encoder model.

Encoder as Data Preparation for Predictive Model

In this section, we will use the trained encoder from the autoencoder to compress input data and train a different predictive model.

First, let’s establish a baseline in performance on this problem. This is important as if the performance of a model is not improved by the compressed encoding, then the compressed encoding does not add value to the project and should not be used.

We can train a logistic regression model on the training dataset directly and evaluate the performance of the model on the holdout test set.

The complete example is listed below.

Running the example fits a logistic regression model on the training dataset and evaluates it on the test set.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieves a classification accuracy of about 89.3 percent.

We would hope and expect that a logistic regression model fit on an encoded version of the input to achieve better accuracy for the encoding to be considered useful.

We can update the example to first encode the data using the encoder model trained in the previous section.

First, we can load the trained encoder model from the file.

We can then use the encoder to transform the raw input data (e.g. 100 columns) into bottleneck vectors (e.g. 50 element vectors).

This process can be applied to the train and test datasets.

We can then use this encoded data to train and evaluate the logistic regression model, as before.

Tying this together, the complete example is listed below.

Running the example first encodes the dataset using the encoder, then fits a logistic regression model on the training dataset and evaluates it on the test set.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieves a classification accuracy of about 93.9 percent.

This is a better classification accuracy than the same model evaluated on the raw dataset, suggesting that the encoding is helpful for our chosen model and test harness.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Books

APIs

Articles

Summary

In this tutorial, you discovered how to develop and evaluate an autoencoder for classification predictive modeling.

Specifically, you learned:

  • An autoencoder is a neural network model that can be used to learn a compressed representation of raw data.
  • How to train an autoencoder model on a training dataset and save just the encoder part of the model.
  • How to use the encoder as a data preparation step when training a machine learning model.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop Deep Learning Projects with Python!

Deep Learning with Python

 What If You Could Develop A Network in Minutes

...with just a few lines of Python

Discover how in my new Ebook:
Deep Learning With Python

It covers end-to-end projects on topics like:
Multilayer PerceptronsConvolutional Nets and Recurrent Neural Nets, and more...

Finally Bring Deep Learning To
Your Own Projects

Skip the Academics. Just Results.

See What's Inside

43 Responses to Autoencoder Feature Extraction for Classification

  1. Mike December 7, 2020 at 12:33 pm #

    Thanks Jason,

    Can you explain again why we would expect the results of a compressed dataset with the encoder to give better results than the raw dataset? Aren’t we just losing information by compressing?

    I thought that the value of the compression would be that we would be dealing with a smaller dataset with less features.

    I guess somehow it’s learned more useful latent features similar to how embeddings work? Is that the case?

    Thanks

    • Jason Brownlee December 7, 2020 at 1:36 pm #

      We don’t expect it to give better performance, but if it does, it’s great for our project.

      It is similar to an embedding for discrete data.

      Yes – similar to dimensionality reduction or feature selection, but using less features is only useful if we get same or better performance.

  2. John December 10, 2020 at 7:29 pm #

    Thanks for this tutorial!

    How does encoder.save(‘encoder.h5’) get the learned weights from the model object? How does instantiating a new model object using encoder = Model(inputs=visible, outputs=bottleneck) allow us to keep the weights?

  3. Hilal December 11, 2020 at 6:45 am #

    Dear Jason, thank you for all informative sharings. I confused in one point like John. How does new encoder model learns weights from the autoencoder or why don’t we compile encoder model?

    • Jason Brownlee December 11, 2020 at 7:42 am #

      You’re welcome.

      We train the encoder as part of the autoencoder, but then only save the encoder part. The weights are shared between the two models.

      No beed need to compile the encoder as it is not trained directly.

  4. Usman December 11, 2020 at 1:44 pm #

    Dear Jason,
    Thanks for the nice tutorial. Is there an efficient way to see how the data is projected on the bottleneck? I would like to compare the projection with PCA.

  5. Sampa December 11, 2020 at 4:07 pm #

    Thank you so much for this informative tutorial. Please let me know the required version of keras and tensorflow to implement this codes.

    • Jason Brownlee December 12, 2020 at 6:22 am #

      You can use the latest version of Keras and TensorFlow libraries.

  6. Siddheshwar Harkal December 11, 2020 at 5:35 pm #

    Dear Jason
    this is a classification problem then why we take the loss as MSE

    • Jason Brownlee December 12, 2020 at 6:23 am #

      We use MSE loss for the reconstruction error for the inputs – which are numeric.

  7. arkadia December 11, 2020 at 8:45 pm #

    Thanks for this tutorial. Is it possible to make a single prediction? Which transformation should do we apply?

    • Jason Brownlee December 12, 2020 at 6:26 am #

      Yes, encode the input with the encoder, then pass the input to the predict() function of the trained model.

  8. Abdelrahim December 11, 2020 at 10:36 pm #

    Dear Jason, I think there is a typo mistake in
    # train autoencoder for classification with no compression in the bottleneck layer
    in filt calling
    you writ “history = model.fit(X_train, X_train, epochs=200, batch_size=16, verbose=2, validation_data=(X_test,X_test)) ”
    I think y_train Not 2 of X_train
    with best regards
    thnks for tutorial

    • Jason Brownlee December 12, 2020 at 6:28 am #

      No, it is correct.

      The autoencoder is being trained to reconstruct the input – that is the whole idea of the autoencoder.

  9. Igors Papka December 12, 2020 at 10:24 pm #

    Dear Dr. Jason,
    Thank you for the tutorial.

    The method looks good for determining the number of clusters in unsupervised learning. I tried to reduce the dimensions with it and estimate the number of clusters first on the large synthetic dataset (more than 25000 instances and 100 features) with 10 informative features and then repeat it on the same real noisy data. I achieved good results in both cases by reducing the number of features to less than the informative ones, five in my case. This method helps to see the clear “elbows” of AIC, BIC informative criteria in the plot of the Gaussian Mixture Model, and fasten the work of algorithm in times.

    • Jason Brownlee December 13, 2020 at 6:04 am #

      You’re welcome.

      Nice work, thanks for sharing your finding!

    • Hareem Ayesha December 17, 2020 at 6:51 pm #

      Hi… can we use this tutorial for multi label classification problem??

      • Jason Brownlee December 18, 2020 at 7:15 am #

        The autoencoder can be used directly, just change the predictive model that makes use of the encoded input.

  10. JG December 13, 2020 at 7:41 am #

    Hi Jason:

    Thank you very much for all your free great tutorial catalog … one of the best in the world !.that serves as inspiration to my following work!

    I share my conclusions after applying several modification to your baseline autoencoder classification code:

    1.) Code Modifications:

    1.1) I decided to compare accuracies results from 5 different classification models:
    (LogisticRegression, SVC, ExtratreesClassifier, RandomForestClassifier, XGBClassifier)
    1.2) I apply statistical evaluation to model results trough well known “KFold()” and “cross_val_score()” functions of SKLearn library
    1.3) and very important I apply several rates of autoencoding features compression such as 1 (no compression at all), 1/2 (your election) , 1/4 (even more compressed) and of course not autoencoding and even expand features to double to see what happen (some kind of embedding?)) …

    2.) my conclusion, after obtaining the same approach results as your LogisticRegression model, are the results are more sensitive to the model chosen:
    sometimes autoencoding it is no better results that not autoencoding, and sometines 1/4 compression is the best …so a lot of variations that indicate you have to work in a heuristic way for every particular problem!
    In particular my best results are chosen SVC classification model and not autoencoding bu on logistic regression model it is true the best results are achieved by autoencoding and feature compression (1/2).

    It is a pity that I can no insert here (I do not know how?) my graphs results to visualize it!

    As I said you provide us with the basic tools and concepts and then we can experiment variations on those ideas

    • Jason Brownlee December 13, 2020 at 1:03 pm #

      Thanks!

      Well done, that sounds like a great experiment.

      Likely results are limited by the synthetic dataset. Perhaps the results would be more interesting/varied with a larger and more realistic dataset where feature extraction can play an important role.

  11. JG December 13, 2020 at 11:48 pm #

    As a matter of fact I applied the same autoencoder analysis to a more “realistic” dataset as “breast cancer” and “diabetes pima india” and I got similar results of previous one, but with less accuracy around 75% for Cancer and 77% for Diabetes, probably because of few samples (286 for cancer and 768 for diabetes)…

    In both cases cases LogisticRegression is now the best model with and without autoencoding and compression… I remember got same results using ‘onehotencoding’ in the cancer case …

    So “trial and error” with different models and different encoding methods for each particular problema seem to be the only way-out…

    • Jason Brownlee December 14, 2020 at 6:18 am #

      Very nice work!

      No silver bullet for feature extraction, and all that. Just another method in our toolbox.

  12. Selma December 18, 2020 at 12:29 am #

    Hello
    I need a matlab code for this tutorial

  13. Robert December 31, 2020 at 5:44 am #

    Hi Jason,

    Thank you very much for this insightful guide.

    When using an AE solely for feature creation, can you skip the steps on decoding and fitting? i.e. just use the encoder part:

    # define encoder
    visible = Input(shape=(n_inputs,))

    # encoder level 1
    e = Dense(n_inputs*2)(visible)
    e = BatchNormalization()(e)
    e = LeakyReLU()(e)

    # encoder level 2
    e = Dense(n_inputs)(e)
    e = BatchNormalization()(e)
    e = LeakyReLU()(e)

    # bottleneck
    n_bottleneck = n_inputs
    bottleneck = Dense(n_bottleneck)(e)

    And then ‘create’ the new features by jumping to:

    encoder = Model(inputs=visible, outputs=bottleneck)
    X_train_encode = encoder.predict(X_train)
    X_test_encode = encoder.predict(X_test)

    In other words, is there any need to encode and fit when only using the AE to create features?

    Thank you very much.

    • Jason Brownlee December 31, 2020 at 9:23 am #

      This is exactly what we do at the end of the tutorial.

  14. Robert December 31, 2020 at 8:30 pm #

    But you load and use the saved encoder at the end of the tutorial – encoder = load_model(‘encoder.h5’). Just wondering if encoding and fitting prior to saving the encoder has any impact at the end when creating. Thanks

    • Robert December 31, 2020 at 8:31 pm #

      * decoding and fitting

    • Jason Brownlee January 1, 2021 at 5:24 am #

      The encoder model must be fit before it can be used.

      You can choose to save the fit encoder model to file or not, it does not make a difference to its performance.

      The decoder is not saved, it is discarded.

      • Robert January 9, 2021 at 2:25 am #

        Why do we fit the encoder model in feature creation, if fitting is just used to reconstruct the input (which we don’t need)?

        • Jason Brownlee January 9, 2021 at 6:44 am #

          It is fit on the reconstruction project, then we discard the decoder and are left with just the encoder that knows how to compress input data in a useful way.

          • Robert January 9, 2021 at 7:00 am #

            Got it, thank you very much. Just wanted to ensure that the loss and val_loss are still relevant when using the latent representation, even though the decoder is discarded.

          • Jason Brownlee January 9, 2021 at 8:35 am #

            The loss is only relevant to the task of reconstructing input.

            The encoding achieved at the bottleneck layer may or may not be helpful to a prediction task using the input data, it depends on the specific dataset.

            Generally, it can be helpful – the whole idea of the tutorial is to teach you how to do this so you can test it on your data and find out.

          • Robert January 9, 2021 at 7:35 pm #

            Ok so loss is not relevant when only taking the encoded representation. I am trying to compare different (feature extraction) autoencoders. I was hoping to do so by comparing the loss and val_loss, but I guess doing so is only relevant when fitting a model for classification, after extracting the AE features.
            Thanks

          • Jason Brownlee January 10, 2021 at 5:38 am #

            Yes, the only relevant comparison (for predictive modeling) is the effect on a classifier/regressor that uses the encoded input.

  15. Abdelrahman January 11, 2021 at 9:56 am #

    Dear Jason,
    I am going to use the encoder part as a tool that generates a new features and I will combine them with the original data set.
    So, How can I control the number of new features I want to get, in the code?

    • Jason Brownlee January 11, 2021 at 10:28 am #

      Good question.

      Control over the number of features in the encoding is via the number of nodes in the bottleneck layer.

      • Abdelrahman Fayed January 12, 2021 at 11:32 am #

        I already did, But it always gives me number of features like equal my original input.
        Here is the code I changed.
        Or if you have time please send me the modified version which gave me 10 new featues.
        abdelrahmanahmedfayed@gmail.com

        # define encoder
        visible = Input(shape=(n_inputs,))
        # encoder level 1
        e = Dense(round(float(n_inputs) / 2.0))(visible)
        e = BatchNormalization()(e)
        e = LeakyReLU()(e)
        # encoder level 2
        e = Dense(round(float(n_inputs) / 2.0))(e)
        e = BatchNormalization()(e)
        e = LeakyReLU()(e)
        # bottleneck
        n_bottleneck = 10
        bottleneck = Dense(n_bottleneck)(e)

        • Jason Brownlee January 12, 2021 at 12:35 pm #

          Sorry, I don’t have the capacity to customize the tutorial for you.

Leave a Reply