Matrix decomposition, also known as matrix factorization, involves describing a given matrix using its constituent elements.

Perhaps the most known and widely used matrix decomposition method is the Singular-Value Decomposition, or SVD. All matrices have an SVD, which makes it more stable than other methods, such as the eigendecomposition. As such, it is often used in a wide array of applications including compressing, denoising, and data reduction.

In this tutorial, you will discover the Singular-Value Decomposition method for decomposing a matrix into its constituent elements.

After completing this tutorial, you will know:

- What Singular-value decomposition is and what is involved.
- How to calculate an SVD and reconstruct a rectangular and square matrix from SVD elements.
- How to calculate the pseudoinverse and perform dimensionality reduction using the SVD..

Let’s get started.

**Update Mar/2018**: Fixed typo in reconstruction. Changed V in code to VT for clarity. Fixed typo in the pseudoinverse equation.**Update Apr/2019**: Fixed a small typo re array sizes in the explanation of the SVD reconstruction example.

## Tutorial Overview

This tutorial is divided into 5 parts; they are:

- Singular-Value Decomposition
- Calculate Singular-Value Decomposition
- Reconstruct Matrix from SVD
- SVD for Pseudoinverse
- SVD for Dimensionality Reduction

### Need help with Linear Algebra for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Singular-Value Decomposition

The Singular-Value Decomposition, or SVD for short, is a matrix decomposition method for reducing a matrix to its constituent parts in order to make certain subsequent matrix calculations simpler.

For the case of simplicity we will focus on the SVD for real-valued matrices and ignore the case for complex numbers.

1 |
A = U . Sigma . V^T |

Where A is the real m x n matrix that we wish to decompose, U is an m x m matrix, Sigma (often represented by the uppercase Greek letter Sigma) is an m x n diagonal matrix, and V^T is the transpose of an n x n matrix where T is a superscript.

The Singular Value Decomposition is a highlight of linear algebra.

— Page 371, Introduction to Linear Algebra, Fifth Edition, 2016.

The diagonal values in the Sigma matrix are known as the singular values of the original matrix A. The columns of the U matrix are called the left-singular vectors of A, and the columns of V are called the right-singular vectors of A.

The SVD is calculated via iterative numerical methods. We will not go into the details of these methods. Every rectangular matrix has a singular value decomposition, although the resulting matrices may contain complex numbers and the limitations of floating point arithmetic may cause some matrices to fail to decompose neatly.

The singular value decomposition (SVD) provides another way to factorize a matrix, into singular vectors and singular values. The SVD allows us to discover some of the same kind of information as the eigendecomposition. However, the SVD is more generally applicable.

— Pages 44-45, Deep Learning, 2016.

The SVD is used widely both in the calculation of other matrix operations, such as matrix inverse, but also as a data reduction method in machine learning. SVD can also be used in least squares linear regression, image compression, and denoising data.

The singular value decomposition (SVD) has numerous applications in statistics, machine learning, and computer science. Applying the SVD to a matrix is like looking inside it with X-ray vision…

— Page 297, No Bullshit Guide To Linear Algebra, 2017

## Calculate Singular-Value Decomposition

The SVD can be calculated by calling the svd() function.

The function takes a matrix and returns the U, Sigma and V^T elements. The Sigma diagonal matrix is returned as a vector of singular values. The V matrix is returned in a transposed form, e.g. V.T.

The example below defines a 3×2 matrix and calculates the Singular-value decomposition.

1 2 3 4 5 6 7 8 9 10 11 |
# Singular-value decomposition from numpy import array from scipy.linalg import svd # define a matrix A = array([[1, 2], [3, 4], [5, 6]]) print(A) # SVD U, s, VT = svd(A) print(U) print(s) print(VT) |

Running the example first prints the defined 3×2 matrix, then the 3×3 U matrix, 2 element Sigma vector, and 2×2 V^T matrix elements calculated from the decomposition.

1 2 3 4 5 6 7 8 9 10 11 12 |
[[1 2] [3 4] [5 6]] [[-0.2298477 0.88346102 0.40824829] [-0.52474482 0.24078249 -0.81649658] [-0.81964194 -0.40189603 0.40824829]] [ 9.52551809 0.51430058] [[-0.61962948 -0.78489445] [-0.78489445 0.61962948]] |

## Reconstruct Matrix from SVD

The original matrix can be reconstructed from the U, Sigma, and V^T elements.

The U, s, and V elements returned from the svd() cannot be multiplied directly.

The s vector must be converted into a diagonal matrix using the diag() function. By default, this function will create a square matrix that is n x n, relative to our original matrix. This causes a problem as the size of the matrices do not fit the rules of matrix multiplication, where the number of columns in a matrix must match the number of rows in the subsequent matrix.

After creating the square Sigma diagonal matrix, the sizes of the matrices are relative to the original m x n matrix that we are decomposing, as follows:

1 |
U (m x m) . Sigma (n x n) . V^T (n x n) |

Where, in fact, we require:

1 |
U (m x m) . Sigma (m x n) . V^T (n x n) |

We can achieve this by creating a new Sigma matrix of all zero values that is m x n (e.g. more rows) and populate the first n x n part of the matrix with the square diagonal matrix calculated via diag().

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# Reconstruct SVD from numpy import array from numpy import diag from numpy import dot from numpy import zeros from scipy.linalg import svd # define a matrix A = array([[1, 2], [3, 4], [5, 6]]) print(A) # Singular-value decomposition U, s, VT = svd(A) # create m x n Sigma matrix Sigma = zeros((A.shape[0], A.shape[1])) # populate Sigma with n x n diagonal matrix Sigma[:A.shape[1], :A.shape[1]] = diag(s) # reconstruct matrix B = U.dot(Sigma.dot(VT)) print(B) |

Running the example first prints the original matrix, then the matrix reconstructed from the SVD elements.

1 2 3 4 5 6 7 |
[[1 2] [3 4] [5 6]] [[ 1. 2.] [ 3. 4.] [ 5. 6.]] |

The above complication with the Sigma diagonal only exists with the case where m and n are not equal. The diagonal matrix can be used directly when reconstructing a square matrix, as follows.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Reconstruct SVD from numpy import array from numpy import diag from numpy import dot from scipy.linalg import svd # define a matrix A = array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) print(A) # Singular-value decomposition U, s, VT = svd(A) # create n x n Sigma matrix Sigma = diag(s) # reconstruct matrix B = U.dot(Sigma.dot(VT)) print(B) |

Running the example prints the original 3×3 matrix and the version reconstructed directly from the SVD elements.

1 2 3 4 5 6 7 |
[[1 2 3] [4 5 6] [7 8 9]] [[ 1. 2. 3.] [ 4. 5. 6.] [ 7. 8. 9.]] |

## SVD for Pseudoinverse

The pseudoinverse is the generalization of the matrix inverse for square matrices to rectangular matrices where the number of rows and columns are not equal.

It is also called the the Moore-Penrose Inverse after two independent discoverers of the method or the Generalized Inverse.

Matrix inversion is not defined for matrices that are not square. […] When A has more columns than rows, then solving a linear equation using the pseudoinverse provides one of the many possible solutions.

— Page 46, Deep Learning, 2016.

The pseudoinverse is denoted as A^+, where A is the matrix that is being inverted and + is a superscript.

The pseudoinverse is calculated using the singular value decomposition of A:

1 |
A^+ = V . D^+ . U^T |

Or, without the dot notation:

1 |
A^+ = VD^+U^T |

Where A^+ is the pseudoinverse, D^+ is the pseudoinverse of the diagonal matrix Sigma and U^T is the transpose of U.

We can get U and V from the SVD operation.

1 |
A = U . Sigma . V^T |

The D^+ can be calculated by creating a diagonal matrix from Sigma, calculating the reciprocal of each non-zero element in Sigma, and taking the transpose if the original matrix was rectangular.

1 2 3 |
s11, 0, 0 Sigma = ( 0, s22, 0) 0, 0, s33 |

1 2 3 |
1/s11, 0, 0 D^+ = ( 0, 1/s22, 0) 0, 0, 1/s33 |

The pseudoinverse provides one way of solving the linear regression equation, specifically when there are more rows than there are columns, which is often the case.

NumPy provides the function pinv() for calculating the pseudoinverse of a rectangular matrix.

The example below defines a 4×2 matrix and calculates the pseudoinverse.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Pseudoinverse from numpy import array from numpy.linalg import pinv # define matrix A = array([ [0.1, 0.2], [0.3, 0.4], [0.5, 0.6], [0.7, 0.8]]) print(A) # calculate pseudoinverse B = pinv(A) print(B) |

Running the example first prints the defined matrix, and then the calculated pseudoinverse.

1 2 3 4 5 6 7 |
[[ 0.1 0.2] [ 0.3 0.4] [ 0.5 0.6] [ 0.7 0.8]] [[ -1.00000000e+01 -5.00000000e+00 9.04289323e-15 5.00000000e+00] [ 8.50000000e+00 4.50000000e+00 5.00000000e-01 -3.50000000e+00]] |

We can calculate the pseudoinverse manually via the SVD and compare the results to the pinv() function.

First we must calculate the SVD. Next we must calculate the reciprocal of each value in the s array. Then the s array can be transformed into a diagonal matrix with an added row of zeros to make it rectangular. Finally, we can calculate the pseudoinverse from the elements.

The specific implementation is:

1 |
A^+ = V . D^+ . U^V |

The full example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# Pseudoinverse via SVD from numpy import array from numpy.linalg import svd from numpy import zeros from numpy import diag # define matrix A = array([ [0.1, 0.2], [0.3, 0.4], [0.5, 0.6], [0.7, 0.8]]) print(A) # calculate svd U, s, VT = svd(A) # reciprocals of s d = 1.0 / s # create m x n D matrix D = zeros(A.shape) # populate D with n x n diagonal matrix D[:A.shape[1], :A.shape[1]] = diag(d) # calculate pseudoinverse B = VT.T.dot(D.T).dot(U.T) print(B) |

Running the example first prints the defined rectangular matrix and the pseudoinverse that matches the above results from the pinv() function.

1 2 3 4 5 6 7 |
[[ 0.1 0.2] [ 0.3 0.4] [ 0.5 0.6] [ 0.7 0.8]] [[ -1.00000000e+01 -5.00000000e+00 9.04831765e-15 5.00000000e+00] [ 8.50000000e+00 4.50000000e+00 5.00000000e-01 -3.50000000e+00]] |

## SVD for Dimensionality Reduction

A popular application of SVD is for dimensionality reduction.

Data with a large number of features, such as more features (columns) than observations (rows) may be reduced to a smaller subset of features that are most relevant to the prediction problem.

The result is a matrix with a lower rank that is said to approximate the original matrix.

To do this we can perform an SVD operation on the original data and select the top k largest singular values in Sigma. These columns can be selected from Sigma and the rows selected from V^T.

An approximate B of the original vector A can then be reconstructed.

1 |
B = U . Sigmak . V^Tk |

In natural language processing, this approach can be used on matrices of word occurrences or word frequencies in documents and is called Latent Semantic Analysis or Latent Semantic Indexing.

In practice, we can retain and work with a descriptive subset of the data called T. This is a dense summary of the matrix or a projection.

1 |
T = U . Sigmak |

Further, this transform can be calculated and applied to the original matrix A as well as other similar matrices.

1 |
T = V^k . A |

The example below demonstrates data reduction with the SVD.

First a 3×10 matrix is defined, with more columns than rows. The SVD is calculated and only the first two features are selected. The elements are recombined to give an accurate reproduction of the original matrix. Finally the transform is calculated two different ways.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
from numpy import array from numpy import diag from numpy import zeros from scipy.linalg import svd # define a matrix A = array([ [1,2,3,4,5,6,7,8,9,10], [11,12,13,14,15,16,17,18,19,20], [21,22,23,24,25,26,27,28,29,30]]) print(A) # Singular-value decomposition U, s, VT = svd(A) # create m x n Sigma matrix Sigma = zeros((A.shape[0], A.shape[1])) # populate Sigma with n x n diagonal matrix Sigma[:A.shape[0], :A.shape[0]] = diag(s) # select n_elements = 2 Sigma = Sigma[:, :n_elements] VT = VT[:n_elements, :] # reconstruct B = U.dot(Sigma.dot(VT)) print(B) # transform T = U.dot(Sigma) print(T) T = A.dot(VT.T) print(T) |

Running the example first prints the defined matrix then the reconstructed approximation, followed by two equivalent transforms of the original matrix.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
[[ 1 2 3 4 5 6 7 8 9 10] [11 12 13 14 15 16 17 18 19 20] [21 22 23 24 25 26 27 28 29 30]] [[ 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.] [ 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.] [ 21. 22. 23. 24. 25. 26. 27. 28. 29. 30.]] [[-18.52157747 6.47697214] [-49.81310011 1.91182038] [-81.10462276 -2.65333138]] [[-18.52157747 6.47697214] [-49.81310011 1.91182038] [-81.10462276 -2.65333138]] |

The scikit-learn provides a TruncatedSVD class that implements this capability directly.

The TruncatedSVD class can be created in which you must specify the number of desirable features or components to select, e.g. 2. Once created, you can fit the transform (e.g. calculate V^Tk) by calling the fit() function, then apply it to the original matrix by calling the transform() function. The result is the transform of A called T above.

The example below demonstrates the TruncatedSVD class.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
from numpy import array from sklearn.decomposition import TruncatedSVD # define array A = array([ [1,2,3,4,5,6,7,8,9,10], [11,12,13,14,15,16,17,18,19,20], [21,22,23,24,25,26,27,28,29,30]]) print(A) # svd svd = TruncatedSVD(n_components=2) svd.fit(A) result = svd.transform(A) print(result) |

Running the example first prints the defined matrix, followed by the transformed version of the matrix.

We can see that the values match those calculated manually above, except for the sign on some values. We can expect there to be some instability when it comes to the sign given the nature of the calculations involved and the differences in the underlying libraries and methods used. This instability of sign should not be a problem in practice as long as the transform is trained for reuse.

1 2 3 4 5 6 7 |
[[ 1 2 3 4 5 6 7 8 9 10] [11 12 13 14 15 16 17 18 19 20] [21 22 23 24 25 26 27 28 29 30]] [[ 18.52157747 6.47697214] [ 49.81310011 1.91182038] [ 81.10462276 -2.65333138]] |

## Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

- Experiment with the SVD method on your own data.
- Research and list 10 applications of SVD in machine learning.
- Apply SVD as a data reduction technique on a tabular dataset.

If you explore any of these extensions, I’d love to know.

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Books

- Chapter 12, Singular-Value and Jordan Decompositions, Linear Algebra and Matrix Analysis for Statistics, 2014.
- Chapter 4, The Singular Value Decomposition and Chapter 5, More on the SVD, Numerical Linear Algebra, 1997.
- Section 2.4 The Singular Value Decomposition, Matrix Computations, 2012.
- Chapter 7 The Singular Value Decomposition (SVD), Introduction to Linear Algebra, Fifth Edition, 2016.
- Section 2.8 Singular Value Decomposition, Deep Learning, 2016.
- Section 7.D Polar Decomposition and Singular Value Decomposition, Linear Algebra Done Right, Third Edition, 2015.
- Lecture 3 The Singular Value Decomposition, Numerical Linear Algebra, 1997.
- Section 2.6 Singular Value Decomposition, Numerical Recipes: The Art of Scientific Computing, Third Edition, 2007.
- Section 2.9 The Moore-Penrose Pseudoinverse, Deep Learning, 2016.

### API

- numpy.linalg.svd() API
- numpy.matrix.H API
- numpy.diag() API
- numpy.linalg.pinv() API.
- sklearn.decomposition.TruncatedSVD API

### Articles

- Matrix decomposition on Wikipedia
- Singular-value decomposition on Wikipedia
- Singular value on Wikipedia
- Moore-Penrose inverse on Wikipedia
- Latent semantic analysis on Wikipedia

## Summary

In this tutorial, you discovered the Singular-value decomposition method for decomposing a matrix into its constituent elements.

Specifically, you learned:

- What Singular-value decomposition is and what is involved.
- How to calculate an SVD and reconstruct a rectangular and square matrix from SVD elements.
- How to calculate the pseudoinverse and perform dimensionality reduction using the SVD.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

One important thing that needs clarification: SVD is valid only to real numbers, therefore it should not be applied to ordinal or categorical variabl3s

Great point.

Thanks for the article! Some proofreading corrections:

“Where A is the real n x m matrix that we wish to decompose…”

A is m x n.

“Running the example first prints the defined 3×2 matrix, then the 3×3 U matrix, 2 element Sigma vector, and 2×3 V^T matrix elements calculated from the decomposition.”

It’s 2×2 V^T.

“Where A^+ is the pseudoinverse, D^+ is the pseudoinverse of the diagonal matrix Sigma and V^T is the transpose of V^T.”

The last part should be about U^T being a transpose of U.

Thanks, fixed!

Hello. great tutorial. One question. What happens if the A matrix has more rows than columns. I tend to define my A as [features, samples].

Good question, I’m not sure off hand.

If you are using it for dimensionality reduction, perhaps try it and see how the projection impacts model skill.

Thank you. Your course is great!

Thank you again.

Thanks!

Thank you Mr. Jason! I am working on MxN matrix where M>N. What if I want to implement SVD Dimensionality Reduction to it? I get matrix size error at Sigma[:A.shape[0], :A.shape[0]] = diag(s) what if I change it to Sigma[:diag(s).shape[0], :diag(s).shape] = diag(s) ? It worked very well btw.

You said that U.dot(Sigma) and A.dot(VT.T) are two equivalent transforms of the original matrix.

So why U.dot(Sigma) == A.dot(VT.T) returns mostly False.

Second thing, could you show us got to Randomized SVD transformed data with randomized_svd function?

Are you sure, where exactly do I say that?

Thanks for the suggestion.

Iterative SVD like FunkSVD are able to be updated incrementally, but standard SVD needs to be fully recomputed to incorporate a new row or column in the ratings matrix if used for recommender systems. THis quick update feature is essential for practical recommender systems. FunkSVD is not an exact SVD but is close enough and is super efficient for adding a user or an item which happens all the time on larger websites like Amazon or Netflix. The full recomputation is way too expensive for large recommender systems and when would you perform it on a global website that gets 24 hour traffic — you cannot do it.

For research papers or wherever the data are static, however, the plain SVD might be perfectly fine, as long as its small enough of a dataset.

SVD is also unsuited to highly sparse ratings matrices, because SVD cannot incorporate any missing data at all. YOu need to supply values so that no data are missing. You can arbitrarily assign 0 or the mean of a user or item or a global mean to missing values. But then you are telling the SVD something that is not true, and the SVD just takes your lies as if they are honest data that actually exist. The SVD can only incorporate the values you give it as if they are all actual true values. Thus the decomposition likewise is strongly affected by the nonsense data values which you chose in advance of the decomposition. That said, SVD and iterative SVD approximations like FunkSVD have actually both been used anyway with some success on systems with missing data in the original ratings matrix, showing that despite the problems in the “truth” data, there are still some signals produced by the SVD.

It’s worse when high sparsity is present. 99.99% sparsity is present in a reddit recommender for subreddits I saw recently, where each comment by a user to a subreddit is a 1 and no comment is a missing data. Most reddit users don’t comment in more than a handful of subreddits. There are 2 millions of unique users and 50,000 subreddits where comments are submitted in a 5-day sample of reddit commenting traffic.

There are some recommenders where missing data is really 0, such as implicit recommenders where a 1 indicates a user clicked and missing data is 0 clicks. Putting zero there is the right thing to do for the missing data because it’s really 0 in truth. SVD then is perfectly applicable and it can be fed 100% true data to factorize, so the outputs it produces can be good too and not built on garbage inputs.

You can use algorithms that specifically exclude missing ratings from the loss function to correctly train a matrix factorization model in tensorflow using a gradient descent algorithm such as adam. Such a model design and code design is not influenced at all by missing data which are preimputed. This well addresses the missing value problem in some recommender systems.

It may be interesting to run an experiment to empirically see just how much better the predictions are if you handle missing values properly and not using them during model training, versus feeding in fake data values where data are missing and incorporating them into the model training.

I encourage you to run such experiments Geoffrey.

I think the sizes of the matrices specified in the first equation under the section “Reconstruct Matrix from SVD” are incorrect. diag(S) should result in a matrix of size (n by n) based on the discussion above.

Why do you say that?

Indeed.

Based on docs, svd() returns “s : ndarray […] Of shape (K,), with K = min(M, N).”

(https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.svd.html)

Meaning, S is a vector with its length being the smallest size of the original matrix.

So diag(S) will create a diagonal matrix of size (min(M, N), min(M, N))

Then, if m > n (as in the example A’s shape is (3, 2) ), the formula

U (m x m) . Sigma (m x m) . V^T (n x n)

would become :

U (m x m) . Sigma (n x n) . V^T (n x n)

Can we use SVD for clustering?

I don’t have material on clustering so I cannot give you good off the cuff advice.

Thank you for everything, Jason.

You’re very welcome! I’m happy that the posts are useful!

Thanks very much for this tutorial.

Could you please understand what you mean by:

“This instability of sign should not be a problem in practice as long as the transform is trained for reuse.”

It means the sign (+ or -) may change based on different solutions found and to not worry about that.

Hi, How do we know the amount of variance captured or in other way to decide on the number of n_components if we are using TruncatedSVD. For PCA, we get the explained variance by “pca.explained_variance_ratio_ “method. Any similar option for SVD?

You can use svd.explained_variance_ratio_

More here:

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

As mentioned above, SVD does task of feature reduction. Feature Reduction means specifically feature extraction or feature selection(beacase both are feature reduction techniques).What SVD is doing is I guess it is feature extraction. Am i right?and Can it be used for feature selection if so?

Yes, it is a type of projection or feature extraction.

It is an alternative to feature selection.

Hi, I want to find first singular vector from a Matrix… if I use svd then how should I find singular vector from it. please help me….

Same Question, Please help!

Perhaps this is what you are referring to?

https://en.wikipedia.org/wiki/Singular_value

Hi, Jason — thanks for the great tutorials. They have been super helpful in my research. I realize this may be a bit off-topic, but I can’t seem to locate an answer that makes sense to me. It pertains to sklearn’s FactorAnalysis — not the same as this post’s topic, but related. Basically, in traditional exploratory factor analysis I believe that having more variables than observations would keep the model from converging. I had errors when I tried running such a model in R. However, I got interpretable results running the same data with sklearn’s FactorAnalysis. I would love a brief explanation as to how the machine learning version of EFA can converge while my traditional EFA did not. (Note: I didn’t try with more than one package in R, so I could be wrong.) Thanks in advance!

Great question!

Sorry, I cannot give you a good answer, I’m familiar with the sklearn factor analysis implementation.

Perhaps the API:

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.FactorAnalysis.html

Or Source:

https://github.com/scikit-learn/scikit-learn/blob/7813f7efb/sklearn/decomposition/factor_analysis.py#L35

will give you ideas.