Introduction to Dimensionality Reduction for Machine Learning

By Jason Brownlee on June 30, 2020 in Data Preparation 11

The number of input variables or features for a dataset is referred to as its dimensionality.

Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset.

More input features often make a predictive modeling task more challenging to model, more generally referred to as the curse of dimensionality.

High-dimensionality statistics and dimensionality reduction techniques are often used for data visualization. Nevertheless these techniques can be used in applied machine learning to simplify a classification or regression dataset in order to better fit a predictive model.

In this post, you will discover a gentle introduction to dimensionality reduction for machine learning

After reading this post, you will know:

Large numbers of input features can cause poor performance for machine learning algorithms.
Dimensionality reduction is a general field of study concerned with reducing the number of input features.
Dimensionality reduction methods include feature selection, linear algebra methods, projection methods, and autoencoders.

Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Updated May/2020: Changed section headings to be more accurate.

A Gentle Introduction to Dimensionality Reduction for Machine Learning
Photo by Kevin Jarrett, some rights reserved.

Overview

This tutorial is divided into three parts; they are:

Problem With Many Input Variables
Dimensionality Reduction
Techniques for Dimensionality Reduction
1. Feature Selection Methods
2. Matrix Factorization
3. Manifold Learning
4. Autoencoder Methods
5. Tips for Dimensionality Reduction

Problem With Many Input Variables

The performance of machine learning algorithms can degrade with too many input variables.

If your data is represented using rows and columns, such as in a spreadsheet, then the input variables are the columns that are fed as input to a model to predict the target variable. Input variables are also called features.

We can consider the columns of data representing dimensions on an n-dimensional feature space and the rows of data as points in that space. This is a useful geometric interpretation of a dataset.

Having a large number of dimensions in the feature space can mean that the volume of that space is very large, and in turn, the points that we have in that space (rows of data) often represent a small and non-representative sample.

This can dramatically impact the performance of machine learning algorithms fit on data with many input features, generally referred to as the “curse of dimensionality.”

Therefore, it is often desirable to reduce the number of input features.

This reduces the number of dimensions of the feature space, hence the name “dimensionality reduction.”

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Dimensionality Reduction

Dimensionality reduction refers to techniques for reducing the number of input variables in training data.

When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the “essence” of the data. This is called dimensionality reduction.

— Page 11, Machine Learning: A Probabilistic Perspective, 2012.

High-dimensionality might mean hundreds, thousands, or even millions of input variables.

Fewer input dimensions often mean correspondingly fewer parameters or a simpler structure in the machine learning model, referred to as degrees of freedom. A model with too many degrees of freedom is likely to overfit the training dataset and therefore may not perform well on new data.

It is desirable to have simple models that generalize well, and in turn, input data with few input variables. This is particularly true for linear models where the number of inputs and the degrees of freedom of the model are often closely related.

The fundamental reason for the curse of dimensionality is that high-dimensional functions have the potential to be much more complicated than low-dimensional ones, and that those complications are harder to discern. The only way to beat the curse is to incorporate knowledge about the data that is correct.

— Page 15, Pattern Classification, 2000.

Dimensionality reduction is a data preparation technique performed on data prior to modeling. It might be performed after data cleaning and data scaling and before training a predictive model.

… dimensionality reduction yields a more compact, more easily interpretable representation of the target concept, focusing the user’s attention on the most relevant variables.

— Page 289, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

As such, any dimensionality reduction performed on training data must also be performed on new data, such as a test dataset, validation dataset, and data when making a prediction with the final model.

Techniques for Dimensionality Reduction

There are many techniques that can be used for dimensionality reduction.

In this section, we will review the main techniques.

Feature Selection Methods

Perhaps the most common are so-called feature selection techniques that use scoring or statistical methods to select which features to keep and which features to delete.

… perform feature selection, to remove “irrelevant” features that do not help much with the classification problem.

— Page 86, Machine Learning: A Probabilistic Perspective, 2012.

Two main classes of feature selection techniques include wrapper methods and filter methods.

For more on feature selection in general, see the tutorial:

An Introduction to Feature Selection

Wrapper methods, as the name suggests, wrap a machine learning model, fitting and evaluating the model with different subsets of input features and selecting the subset the results in the best model performance. RFE is an example of a wrapper feature selection method.

Filter methods use scoring methods, like correlation between the feature and the target variable, to select a subset of input features that are most predictive. Examples include Pearson’s correlation and Chi-Squared test.

For more on filter-based feature selection methods, see the tutorial:

How to Choose a Feature Selection Method for Machine Learning

Matrix Factorization

Techniques from linear algebra can be used for dimensionality reduction.

Specifically, matrix factorization methods can be used to reduce a dataset matrix into its constituent parts.

Examples include the eigendecomposition and singular value decomposition.

For more on matrix factorization, see the tutorial:

A Gentle Introduction to Matrix Factorization for Machine Learning

The parts can then be ranked and a subset of those parts can be selected that best captures the salient structure of the matrix that can be used to represent the dataset.

The most common method for ranking the components is principal components analysis, or PCA for short.

The most common approach to dimensionality reduction is called principal components analysis or PCA.

— Page 11, Machine Learning: A Probabilistic Perspective, 2012.

For more on PCA, see the tutorial:

How to Calculate Principal Component Analysis (PCA) From Scratch in Python

Manifold Learning

Techniques from high-dimensionality statistics can also be used for dimensionality reduction.

In mathematics, a projection is a kind of function or mapping that transforms data in some way.

— Page 304, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

These techniques are sometimes referred to as “manifold learning” and are used to create a low-dimensional projection of high-dimensional data, often for the purposes of data visualization.

The projection is designed to both create a low-dimensional representation of the dataset whilst best preserving the salient structure or relationships in the data.

Examples of manifold learning techniques include:

Kohonen Self-Organizing Map (SOM).
Sammons Mapping
Multidimensional Scaling (MDS)
t-distributed Stochastic Neighbor Embedding (t-SNE).

The features in the projection often have little relationship with the original columns, e.g. they do not have column names, which can be confusing to beginners.

Autoencoder Methods

Deep learning neural networks can be constructed to perform dimensionality reduction.

A popular approach is called autoencoders. This involves framing a self-supervised learning problem where a model must reproduce the input correctly.

For more on self-supervised learning, see the tutorial:

14 Different Types of Learning in Machine Learning

A network model is used that seeks to compress the data flow to a bottleneck layer with far fewer dimensions than the original input data. The part of the model prior to and including the bottleneck is referred to as the encoder, and the part of the model that reads the bottleneck output and reconstructs the input is called the decoder.

An auto-encoder is a kind of unsupervised neural network that is used for dimensionality reduction and feature discovery. More precisely, an auto-encoder is a feedforward neural network that is trained to predict the input itself.

— Page 1000, Machine Learning: A Probabilistic Perspective, 2012.

After training, the decoder is discarded and the output from the bottleneck is used directly as the reduced dimensionality of the input. Inputs transformed by this encoder can then be fed into another model, not necessarily a neural network model.

Deep autoencoders are an effective framework for nonlinear dimensionality reduction. Once such a network has been built, the top-most layer of the encoder, the code layer hc, can be input to a supervised classification procedure.

— Page 448, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

The output of the encoder is a type of projection, and like other projection methods, there is no direct relationship to the bottleneck output back to the original input variables, making them challenging to interpret.

For an example of an autoencoder, see the tutorial:

A Gentle Introduction to LSTM Autoencoders

Tips for Dimensionality Reduction

There is no best technique for dimensionality reduction and no mapping of techniques to problems.

Instead, the best approach is to use systematic controlled experiments to discover what dimensionality reduction techniques, when paired with your model of choice, result in the best performance on your dataset.

Typically, linear algebra and manifold learning methods assume that all input features have the same scale or distribution. This suggests that it is good practice to either normalize or standardize data prior to using these methods if the input variables have differing scales or units.

Summary

In this post, you discovered a gentle introduction to dimensionality reduction for machine learning.

Specifically, you learned:

Large numbers of input features can cause poor performance for machine learning algorithms.
Dimensionality reduction is a general field of study concerned with reducing the number of input features.
Dimensionality reduction methods include feature selection, linear algebra methods, projection methods, and autoencoders.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

11 Responses to Introduction to Dimensionality Reduction for Machine Learning

Dominic September 29, 2020 at 11:15 pm #

I’ve a question regarding the term ‘dimensionality reduction’:

Assuming that I’ve a digital invoice which contains of n feature vectors and each has m features.

What if I want to reduce the number m, like I want to ‘squash’ all feature vectores that belongs to the same invoice.

What is the propper way to do this? Could I try Principal Component Analysis or Non-negative matrix factorization. Or am I completely missunderstanding the term ‘dimension’?

sincerly

Dominic

Reply
- Jason Brownlee September 30, 2020 at 6:35 am #
  
  You’re understanding is correct, we reduce the number of “features”, generally columns in a table of data.
  
  In your case, you might need to use an alternate method – not sure that PCA would be directly applicable. I could be wrong. Perhaps an autoencoder of some kind would be better.
  
  Reply
  - Dominic October 1, 2020 at 10:24 pm #
    
    allright, I see – thanks the information!
    
    Reply
Saurabh nand December 17, 2020 at 2:09 pm #

Have tried 3times not getting your mail for the crash course yet dissapointing

Reply
- Jason Brownlee December 18, 2020 at 7:14 am #
  
  Sorry, there was a massive google/email outage this week and I was impacted.
  
  It should be working now.
  
  Contact me directly any time if you have issues:
  https://machinelearningmastery.com/contact/
  
  Reply
Ignacio March 25, 2021 at 1:46 am #

I do also have a question,

Imagine a dataset with a lot of variables. Will it be possible to perform dimensionality reduction on some of the data but not on all the dataset?

The questions is just if it makes any sense mathematically speaking.

Reply
- Jason Brownlee March 25, 2021 at 4:46 am #
  
  Sure, you can select a subset of data on which to apply the method.
  
  Reply
Angel de la Vega June 23, 2021 at 11:34 pm #

Hello Jason, I have a question about dimensionality reduction and convolutional neural networks.

I have seen that dimensionality reduction often gives good results in “classical” machine learning. However, I have not found examples where dimensionality reduction is used for the inputs of convolutional neural networks.

Is there a case where dimensionality reduction brings any advantage to CNNs? And if not, what would be the reason?.

Thank you very much in advance

Reply
- Jason Brownlee June 24, 2021 at 6:02 am #
  
  Models like CNN perform their own automatic feature extraction process.
  
  Reply
Juan October 29, 2021 at 9:41 pm #

Hello Jason, thanks for your article.
I’m working on a model to predict demand for a category of products.
Every group contains different skus with different prices, currently we have been aggregating all the prices together using an average or a weighted average.
However, after reading your article, I’ve started thinking about reduce the dimensionality of prices.
Do you see any potential problem with this approach?
Thank you in advance.

Reply
- Adrian Tam October 30, 2021 at 12:40 pm #
  
  I would suggest to collect previous data on the sales by number of items sold. Then use PCA, for example, to find the hidden category. In economics, there are complementary goods and substitute goods. Identifying these relationship might help.
  
  Reply

Navigation