Last Updated on August 9, 2019
Linear algebra is a sub-field of mathematics concerned with vectors, matrices, and linear transforms.
It is a key foundation to the field of machine learning, from notations used to describe the operation of algorithms to the implementation of algorithms in code.
Although linear algebra is integral to the field of machine learning, the tight relationship is often left unexplained or explained using abstract concepts such as vector spaces or specific matrix operations.
In this post, you will discover 10 common examples of machine learning that you may be familiar with that use, require and are really best understood using linear algebra.
After reading this post, you will know:
- The use of linear algebra structures when working with data, such as tabular datasets and images.
- Linear algebra concepts when working with data preparation, such as one hot encoding and dimensionality reduction.
- The ingrained use of linear algebra notation and methods in sub-fields such as deep learning, natural language processing, and recommender systems.
Kick-start your project with my new book Linear Algebra for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
In this post, we will review 10 obvious and concrete examples of linear algebra in machine learning.
I tried to pick examples that you may be familiar with or have even worked with before. They are:
- Dataset and Data Files
- Images and Photographs
- One-Hot Encoding
- Linear Regression
- Principal Component Analysis
- Singular-Value Decomposition
- Latent Semantic Analysis
- Recommender Systems
- Deep Learning
Do you have your own favorite example of linear algebra in machine learning?
Let me know in the comments below.
Need help with Linear Algebra for Machine Learning?
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
1. Dataset and Data Files
In machine learning, you fit a model on a dataset.
This is the table-like set of numbers where each row represents an observation and each column represents a feature of the observation.
For example, below is a snippet of the Iris flowers dataset:
This data is in fact a matrix: a key data structure in linear algebra.
Further, when you split the data into inputs and outputs to fit a supervised machine learning model, such as the measurements and the flower species, you have a matrix (X) and a vector (y). The vector is another key data structure in linear algebra.
Each row has the same length, i.e. the same number of columns, therefore we can say that the data is vectorized where rows can be provided to a model one at a time or in a batch and the model can be pre-configured to expect rows of a fixed width.
For help loading data files as NumPy arrays, see the tutorial:
2. Images and Photographs
Perhaps you are more used to working with images or photographs in computer vision applications.
Each image that you work with is itself a table structure with a width and height and one pixel value in each cell for black and white images or 3 pixel values in each cell for a color image.
A photo is yet another example of a matrix from linear algebra.
Operations on the image, such as cropping, scaling, shearing, and so on are all described using the notation and operations of linear algebra.
For help loading images as NumPy arrays, see the tutorial:
3. One Hot Encoding
Sometimes you work with categorical data in machine learning.
Perhaps the class labels for classification problems, or perhaps categorical input variables.
It is common to encode categorical variables to make them easier to work with and learn by some techniques. A popular encoding for categorical variables is the one hot encoding.
A one hot encoding is where a table is created to represent the variable with one column for each category and a row for each example in the dataset. A check, or one-value, is added in the column for the categorical value for a given row, and a zero-value is added to all other columns.
For example, the color variable with the 3 rows:
Might be encoded as:
red, green, blue
1, 0, 0
0, 1, 0
0, 0, 1
Each row is encoded as a binary vector, a vector with zero or one values and this is an example of a sparse representation, a whole sub-field of linear algebra.
For more on one hot encoding, see the tutorial:
4. Linear Regression
Linear regression is an old method from statistics for describing the relationships between variables.
It is often used in machine learning for predicting numerical values in simpler regression problems.
There are many ways to describe and solve the linear regression problem, i.e. finding a set of coefficients that when multiplied by each of the input variables and added together results in the best prediction of the output variable.
If you have used a machine learning tool or library, the most common way of solving linear regression is via a least squares optimization that is solved using matrix factorization methods from linear regression, such as an LU decomposition or a singular-value decomposition, or SVD.
Even the common way of summarizing the linear regression equation uses linear algebra notation:
y = A . b
Where y is the output variable A is the dataset and b are the model coefficients.
For more on linear regression from a linear algebra perspective, see the tutorial:
In applied machine learning, we often seek the simplest possible models that achieve the best skill on our problem.
Simpler models are often better at generalizing from specific examples to unseen data.
In many methods that involve coefficients, such as regression methods and artificial neural networks, simpler models are often characterized by models that have smaller coefficient values.
A technique that is often used to encourage a model to minimize the size of coefficients while it is being fit on data is called regularization. Common implementations include the L2 and L1 forms of regularization.
Both of these forms of regularization are in fact a measure of the magnitude or length of the coefficients as a vector and are methods lifted directly from linear algebra called the vector norm.
For more on vector norms used in regularization, see the tutorial:
6. Principal Component Analysis
Often, a dataset has many columns, perhaps tens, hundreds, thousands, or more.
Modeling data with many features is challenging, and models built from data that include irrelevant features are often less skillful than models trained from the most relevant data.
It is hard to know which features of the data are relevant and which are not.
Methods for automatically reducing the number of columns of a dataset are called dimensionality reduction, and perhaps the most popular method is called the principal component analysis, or PCA for short.
This method is used in machine learning to create projections of high-dimensional data for both visualization and for training models.
The core of the PCA method is a matrix factorization method from linear algebra. The eigendecomposition can be used and more robust implementations may use the singular-value decomposition, or SVD.
For more on principal component analysis, see the tutorial:
7. Singular-Value Decomposition
Another popular dimensionality reduction method is the singular-value decomposition method, or SVD for short.
As mentioned, and as the name of the method suggests, it is a matrix factorization method from the field of linear algebra.
It has wide use in linear algebra and can be used directly in applications such as feature selection, visualization, noise reduction, and more.
We will see two more cases below of using the SVD in machine learning.
For more on the singular-value decomposition, see the tutorial:
8. Latent Semantic Analysis
In the sub-field of machine learning for working with text data called natural language processing, it is common to represent documents as large matrices of word occurrences.
For example, the columns of the matrix may be the known words in the vocabulary and rows may be sentences, paragraphs, pages, or documents of text with cells in the matrix marked as the count or frequency of the number of times the word occurred.
This is a sparse matrix representation of the text. Matrix factorization methods, such as the singular-value decomposition can be applied to this sparse matrix, which has the effect of distilling the representation down to its most relevant essence. Documents processed in this way are much easier to compare, query, and use as the basis for a supervised machine learning model.
This form of data preparation is called Latent Semantic Analysis, or LSA for short, and is also known by the name Latent Semantic Indexing, or LSI.
9. Recommender Systems
Predictive modeling problems that involve the recommendation of products are called recommender systems, a sub-field of machine learning.
Examples include the recommendation of books based on previous purchases and purchases by customers like you on Amazon, and the recommendation of movies and TV shows to watch based on your viewing history and viewing history of subscribers like you on Netflix.
The development of recommender systems is primarily concerned with linear algebra methods. A simple example is in the calculation of the similarity between sparse customer behavior vectors using distance measures such as Euclidean distance or dot products.
Matrix factorization methods like the singular-value decomposition are used widely in recommender systems to distill item and user data to their essence for querying and searching and comparison.
10. Deep Learning
Artificial neural networks are nonlinear machine learning algorithms that are inspired by elements of the information processing in the brain and have proven effective at a range of problems, not the least of which is predictive modeling.
Deep learning is the recent resurgence in the use of artificial neural networks with newer methods and faster hardware that allow for the development and training of larger and deeper (more layers) networks on very large datasets. Deep learning methods are routinely achieving state-of-the-art results on a range of challenging problems such as machine translation, photo captioning, speech recognition, and much more.
At their core, the execution of neural networks involves linear algebra data structures multiplied and added together. Scaled up to multiple dimensions, deep learning methods work with vectors, matrices, and even tensors of inputs and coefficients, where a tensor is a matrix with more than two dimensions.
Linear algebra is central to the description of deep learning methods via matrix notation to the implementation of deep learning methods such as Google’s TensorFlow Python library that has the word “tensor” in its name.
For more on tensors, see the tutorial:
In this post, you discovered 10 common examples of machine learning that you may be familiar with that use and require linear algebra.
Specifically, you learned:
- The use of linear algebra structures when working with data such as tabular datasets and images.
- Linear algebra concepts when working with data preparation such as one hot encoding and dimensionality reduction.
- The ingrained use of linear algebra notation and methods in sub-fields such as deep learning, natural language processing, and recommender systems.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Thanks Jason for explaining the examples in a simple way. I read your blogs regularly.
I am new to this field, catching up slowly and hence some basic questions on SVD/PCA –
1. How should we arrive on the best decision for feature selection and which features have contributed more to improve the performance?
2. Will it resolve bias-variance tradeoff issue?
3. Assuming that model is already implemented (say model version v1) in production, post implementation of model, if a new variable/column is added in the dataset which is critical as per business requirement and after/while rebuilding the model as per new variable requirement, the new model performance is not meeting the performance of previous model and also as per business expectation on performance. A significant change in the model performance observed and predicted values differ. In such scenario, any approach or suggestions to meet the performance expectation?
Looking forward your reply.
You would use experimentation of the results of different feature selection methods as inputs to models and choose based on resulting model skill.
Bias-variance tradeoff cannot be resolved, it is always present. A fact of applied machine learning.
Yes, you might need to start a new project, discarding assumptions/findings that helped achieve the prior model.
Thanks Jason for your response. Will sure work on experimenting different features selection and compare the results.
The above is riddled with 🍒 picking and confirmation bias. Skill in linear algebra is completely unnecessary to be effective at machine learning. Some of what you said above is a complete stretch.
If you are in fact a scientist, you won’t hide/delete this comment
Thanks for the opinions Michael.
You can learn more about eigendecomposition here:
Very informative article Jason! As more and more people learn and study machine learning, the deep learning curve always goes back to the challenges of arithmetic. At the end of the day, in order to truly learn machine learning, one must have basic knowledge of algebra. Will not knowing algebra make or break one to be truly great at machine learning? Most likely not, but having the core knowledge of algebra will most certainly help.
No. You can get great results and deliver a ton of value without a deep knowledge of linear algebra.
It can make a difference when trying to squeeze more skill/performance from models.
I was searching for tutorials on Linear Algebra that can clear my understanding about the relation between linear algebra and machine learning.
But I want to have a tutorial that not only teaches me about what Linear Algebra is but also teaches me on how to alternatively implement ML projects using Linear Algebra instead of Keras or Scikit Learn. I want this so that I can relate Linear Algebra with ML. Knowing only Linear Algebra will take me nowhere.
I wanted to confirm from you if your book “Basics of Linear Algebra for Machine Learning” contains some sample ML projects that were implemented using Linear Algebra instead of Scikit Learn or Keras. I could not find any details about sample ML projects using Linear Algebra in the index section of the book.
I am interested in buying this book but interested in this particular area.
Please let me know.
o, the book teaches you the liner algebra methods that are useful in machine learning, not how to code ML algorithms.
If you want to learn how to code ML algorithms, you can start here:
Thanks for pointing out the link.
I am still not clear about few things, please help.
My current assumption about the connection between Linear Algebra and ML is that “What LA is to ML is what Assembly Language is to Java”.
Whether or not one knows Assembly language, it seldom helps the person to redefine or rephrase his Java program in a better way.
I may be horribly wrong in my understanding and that’s why I am seeking your help in knowing about the connection between LA and ML.
If you could clear my doubts given below, it will help me in learning LA with lots of enthusiasm that I am moving in the right direction:
1. Lots of people tell that LA is must to be a good ML scientist. But none clarifies on what aspect of ML will the person excel after learning LA. You said in one of your comment above that one can squeeze more skill/performance if he knows LA. I could not understand it, the ML guys are limited to the libraries provided by Scikit learn or Keras, how knowing LA is going to help there ?
2. Will I be able to use keras or scikit learn in a different way after learning LA. Because Kears is what I will be using for all ML stuffs as I am not going to write ML algorithms using LA, so does learning LA mean that I will be able to use keras in a better way and what is that “better” way ?
3. Finally, will learning LA improve my intuition about the ML problems and is its impact limited to intuition improvement or does it help on anything beyond that ?
LA is a big ocean and I don’t want to jump into it without knowing why and what I am doing.
In some cases LA is a way of doing a thing, e.g. LA vs gradient descent for solving linear regression.
In some cases LA is a way of very efficiently describing a thing, e.g. more like the pseudocode to Java relationship.
A LA perspective on your Keras model can make data prep and connecting layers a snap, no more confusion. It can also help see different ways/architectures of approaching the same problem. A great example is implementing an operation like attention with a few LA transform layers rather than coding it as a custom layer.
Does that help?
Thanks for your reply and importantly for your patience.
I understand that I could be irritating you with my silly questions.
Now at-least I understand that knowing LA will be more beneficial than not knowing it.
As you said that we can in some cases do certain things better with LA like LA vs Gradient Descent. I assume we will be able to do some custom modifications to optimization functions as per our need if we knew LA.
I was looking for these things that you mentioned, just knowing LA for better intuition wasn’t sufficient but beyond that I must be able to implement my custom requirements not essentially for algorithms because that’s not my cup of tea but alteast some minor changes here and there.
Implementation of knowledge in terms of custom code was more important for me.
Thanks a lot for clearing my doubts.
Happy to help.
I have a QA background and I’m looking to change my field. I don’t want to be irrelevant in the future economy hence thinking about Machine Learning.
This looks like an ocean. Can you give suggestions on this.
Yes, start right here:
CAN U PLEASE TELL HOW PROBABILITY THEORY HELPS IN ML
& HOW DO WE HAVE TO STUDY PROBABILITY THEORY TO BE A PERFECT DATA SCIENTIST ???
Yes, you can get started with probability for machine learning here:
Where would I start my Son if he wanted to jump into this World.
Hi Rob…A great starting point can be found here: