How to Calculate Principal Component Analysis (PCA) from Scratch in Python

By Jason Brownlee on August 9, 2019 in Linear Algebra 99

An important machine learning method for dimensionality reduction is called Principal Component Analysis.

It is a method that uses simple matrix operations from linear algebra and statistics to calculate a projection of the original data into the same number or fewer dimensions.

In this tutorial, you will discover the Principal Component Analysis machine learning method for dimensionality reduction and how to implement it from scratch in Python.

After completing this tutorial, you will know:

The procedure for calculating the Principal Component Analysis and how to choose principal components.
How to calculate the Principal Component Analysis from scratch in NumPy.
How to calculate the Principal Component Analysis for reuse on more data in scikit-learn.

Kick-start your project with my new book Linear Algebra for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Apr/2018: Fixed typo in the explaination of the sklearn PCA attributes. Thanks kris.

How to Calculate the Principal Component Analysis from Scratch in Python
Photo by mickey, some rights reserved.

Tutorial Overview

This tutorial is divided into 3 parts; they are:

Principal Component Analysis
Manually Calculate Principal Component Analysis
Reusable Principal Component Analysis

Need help with Linear Algebra for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Principal Component Analysis

Principal Component Analysis, or PCA for short, is a method for reducing the dimensionality of data.

It can be thought of as a projection method where data with m-columns (features) is projected into a subspace with m or fewer columns, whilst retaining the essence of the original data.

The PCA method can be described and implemented using the tools of linear algebra.

PCA is an operation applied to a dataset, represented by an n x m matrix A that results in a projection of A which we will call B. Let’s walk through the steps of this operation.

     a11, a12
A = (a21, a22)
     a31, a32

B = PCA(A)

a11, a12

A = (a21, a22)

a31, a32

B = PCA(A)

The first step is to calculate the mean values of each column.

M = mean(A)

1	M = mean(A)

              (a11 + a21 + a31) / 3
M(m11, m12) = (a12 + a22 + a32) / 3

1 2	(a11 + a21 + a31) / 3 M(m11, m12) = (a12 + a22 + a32) / 3

Next, we need to center the values in each column by subtracting the mean column value.

C = A - M

C = A - M

The next step is to calculate the covariance matrix of the centered matrix C.

Correlation is a normalized measure of the amount and direction (positive or negative) that two columns change together. Covariance is a generalized and unnormalized version of correlation across multiple columns. A covariance matrix is a calculation of covariance of a given matrix with covariance scores for every column with every other column, including itself.

V = cov(C)

1	V = cov(C)

Finally, we calculate the eigendecomposition of the covariance matrix V. This results in a list of eigenvalues and a list of eigenvectors.

values, vectors = eig(V)

1	values, vectors = eig(V)

The eigenvectors represent the directions or components for the reduced subspace of B, whereas the eigenvalues represent the magnitudes for the directions. For more on this topic, see the post:

Gentle Introduction to Eigendecomposition, Eigenvalues, and Eigenvectors for Machine Learning

The eigenvectors can be sorted by the eigenvalues in descending order to provide a ranking of the components or axes of the new subspace for A.

If all eigenvalues have a similar value, then we know that the existing representation may already be reasonably compressed or dense and that the projection may offer little. If there are eigenvalues close to zero, they represent components or axes of B that may be discarded.

A total of m or less components must be selected to comprise the chosen subspace. Ideally, we would select k eigenvectors, called principal components, that have the k largest eigenvalues.

B = select(values, vectors)

1	B = select(values, vectors)

Other matrix decomposition methods can be used such as Singular-Value Decomposition, or SVD. As such, generally the values are referred to as singular values and the vectors of the subspace are referred to as principal components.

Once chosen, data can be projected into the subspace via matrix multiplication.

P = B^T . A

1	P = B^T . A

Where A is the original data that we wish to project, B^T is the transpose of the chosen principal components and P is the projection of A.

This is called the covariance method for calculating the PCA, although there are alternative ways to to calculate it.

Manually Calculate Principal Component Analysis

There is no pca() function in NumPy, but we can easily calculate the Principal Component Analysis step-by-step using NumPy functions.

The example below defines a small 3×2 matrix, centers the data in the matrix, calculates the covariance matrix of the centered data, and then the eigendecomposition of the covariance matrix. The eigenvectors and eigenvalues are taken as the principal components and singular values and used to project the original data.

from numpy import array
from numpy import mean
from numpy import cov
from numpy.linalg import eig
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# calculate the mean of each column
M = mean(A.T, axis=1)
print(M)
# center columns by subtracting column means
C = A - M
print(C)
# calculate covariance matrix of centered matrix
V = cov(C.T)
print(V)
# eigendecomposition of covariance matrix
values, vectors = eig(V)
print(vectors)
print(values)
# project data
P = vectors.T.dot(C.T)
print(P.T)

from numpy import array

from numpy import mean

from numpy import cov

from numpy.linalg import eig

# define a matrix

A = array([[1, 2], [3, 4], [5, 6]])

print(A)

# calculate the mean of each column

M = mean(A.T, axis=1)

print(M)

# center columns by subtracting column means

C = A - M

print(C)

# calculate covariance matrix of centered matrix

V = cov(C.T)

print(V)

# eigendecomposition of covariance matrix

values, vectors = eig(V)

print(vectors)

print(values)

# project data

P = vectors.T.dot(C.T)

print(P.T)

Running the example first prints the original matrix, then the eigenvectors and eigenvalues of the centered covariance matrix, followed finally by the projection of the original matrix.

Interestingly, we can see that only the first eigenvector is required, suggesting that we could project our 3×2 matrix onto a 3×1 matrix with little loss.

[[1 2]
 [3 4]
 [5 6]]

[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]

[ 8.  0.]

[[-2.82842712  0.        ]
 [ 0.          0.        ]
 [ 2.82842712  0.        ]]

[[1 2]

[3 4]

[5 6]]

[[ 0.70710678 -0.70710678]

[ 0.70710678 0.70710678]]

[ 8. 0.]

[[-2.82842712 0. ]

[ 0. 0. ]

[ 2.82842712 0. ]]

Reusable Principal Component Analysis

We can calculate a Principal Component Analysis on a dataset using the PCA() class in the scikit-learn library. The benefit of this approach is that once the projection is calculated, it can be applied to new data again and again quite easily.

When creating the class, the number of components can be specified as a parameter.

The class is first fit on a dataset by calling the fit() function, and then the original dataset or other data can be projected into a subspace with the chosen number of dimensions by calling the transform() function.

Once fit, the eigenvalues and principal components can be accessed on the PCA class via the explained_variance_ and components_ attributes.

The example below demonstrates using this class by first creating an instance, fitting it on a 3×2 matrix, accessing the values and vectors of the projection, and transforming the original data.

# Principal Component Analysis
from numpy import array
from sklearn.decomposition import PCA
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# create the PCA instance
pca = PCA(2)
# fit on data
pca.fit(A)
# access values and vectors
print(pca.components_)
print(pca.explained_variance_)
# transform data
B = pca.transform(A)
print(B)

# Principal Component Analysis

from numpy import array

from sklearn.decomposition import PCA

# define a matrix

A = array([[1, 2], [3, 4], [5, 6]])

print(A)

# create the PCA instance

pca = PCA(2)

# fit on data

pca.fit(A)

# access values and vectors

print(pca.components_)

print(pca.explained_variance_)

# transform data

B = pca.transform(A)

print(B)

Running the example first prints the 3×2 data matrix, then the principal components and values, followed by the projection of the original matrix.

We can see, that with some very minor floating point rounding that we achieve the same principal components, singular values, and projection as in the previous example.

[[1 2]
 [3 4]
 [5 6]]

[[ 0.70710678  0.70710678]
 [ 0.70710678 -0.70710678]]

[  8.00000000e+00   2.25080839e-33]

[[ -2.82842712e+00   2.22044605e-16]
 [  0.00000000e+00   0.00000000e+00]
 [  2.82842712e+00  -2.22044605e-16]]

[[1 2]

[3 4]

[5 6]]

[[ 0.70710678 0.70710678]

[ 0.70710678 -0.70710678]]

[ 8.00000000e+00 2.25080839e-33]

[[ -2.82842712e+00 2.22044605e-16]

[ 0.00000000e+00 0.00000000e+00]

[ 2.82842712e+00 -2.22044605e-16]]

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

Re-run the examples with your own small contrived matrix values.
Load a dataset and calculate the PCA on it and compare the results from the two methods.
Search for and locate 10 examples where PCA has been used in machine learning papers.

If you explore any of these extensions, I’d love to know.

Summary

In this tutorial, you discovered the Principal Component Analysis machine learning method for dimensionality reduction.

Specifically, you learned:

The procedure for calculating the Principal Component Analysis and how to choose principal components.
How to calculate the Principal Component Analysis from scratch in NumPy.
How to calculate the Principal Component Analysis for reuse on more data in scikit-learn.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

99 Responses to How to Calculate Principal Component Analysis (PCA) from Scratch in Python

John W March 2, 2018 at 1:38 pm #

Great article! I have been more of an R programmer in the past but have started to mess with Python. Python is a very versatile language and has started to draw my attention over the last few months.

Reply
- Jason Brownlee March 2, 2018 at 3:25 pm #
  
  Thanks John. I’m a big fan of Python myself these days.
  
  Reply
- Rajesh June 23, 2021 at 1:30 am #
  
  Hi Jason,
  
  This was fantastic explanation, thank you!
  
  Reply
  - Jason Brownlee June 23, 2021 at 5:39 am #
    
    You’re welcome!
    
    Reply
Saeed Ullah March 2, 2018 at 3:24 pm #

Hello Jason, it’s very nice you are doing great work and I request you to make such a post on ISOMAP Dimensionality Reduction too..

Reply
- Jason Brownlee March 2, 2018 at 3:26 pm #
  
  Thanks for the suggestion.
  
  Reply
john March 6, 2018 at 8:24 am #

Hello

Could you make a post on the Scree plot ?

Thank you

Reply
- Jason Brownlee March 6, 2018 at 2:53 pm #
  
  Thanks for the suggestion John.
  
  Reply
Ranjeet Singh March 8, 2018 at 5:57 pm #

Is there any direct relation between SVD and PCA since both perform dimentionality reduction?

Reply
- Jason Brownlee March 9, 2018 at 6:21 am #
  
  Yes, they both can be used for dimensionality reduction.
  
  Reply
Kaviyarasi March 19, 2018 at 8:01 pm #

Can we apply this for loaded file .csv format?

Reply
- Jason Brownlee March 20, 2018 at 6:15 am #
  
  Yes.
  
  Reply
- Vishwanath June 27, 2021 at 1:24 pm #
  
  Hi, I have one doubt. What happens if we give n_components=d where d is the no of dimensions. Does it denoise the data? Because it can’t reduce the dimensions.
  
  Reply
  - Jason Brownlee June 28, 2021 at 7:56 am #
    
    It will do something, likely something not useful.
    
    Reply
kris April 13, 2018 at 8:35 pm #

Hi Jason, thanks for the great work you are doing with your blog!

I think the attribute “explained_variance_” of the PCA class from scikit-learn returns the eigenvalues and not the singular values as you mention in the section “Reusable Principal Component Analysis”. For the singular values there is another attribute which is “singular_values_”. Correct?

Also, “single values” should read “eigenvalues” in the sentence “…that we achieve the same principal components, singular values, and projection as in…”. Correct?

Reply
- Jason Brownlee April 14, 2018 at 6:40 am #
  
  Correct, fixed.
  
  Thanks for pointing out the typo!
  
  Reply
Baron May 3, 2018 at 10:33 pm #

Hello teacher. can help you me ? I wanna now how to implement a CPA?

Reply
- Jason Brownlee May 4, 2018 at 7:44 am #
  
  What is CPA?
  
  Reply
  - Baron| May 10, 2018 at 9:10 pm #
    
    I´m sorry. I mean PCA
    
    Reply
    - Surya May 2, 2019 at 6:55 pm #
      
      I think he has explained that in tutorial
      
      Reply
Gravey May 15, 2018 at 11:27 pm #

Hi Jason,

Is there similar support for R or Matlab users? I’m trying to find a workshop / training in this area, if you could recommend anything that may help.

Reply
- Jason Brownlee May 16, 2018 at 6:03 am #
  
  I don’t know sorry.
  
  Reply
Mohammad June 18, 2018 at 11:23 am #

Great post!

I found a typo: In the initial explanation, it’s said:
P = B^T . A

In the manual calculation:
P = vectors.T.dot(C.T)

Which one is correct? The original A or the mean-centered C?

Reply
- Jason Brownlee June 18, 2018 at 3:10 pm #
  
  No typo, perhaps confusing explanation.
  
  B == vectors (components)
  A == C (centered data to project)
  
  Reply
  - KN October 1, 2024 at 11:10 pm #
    
    Yes this is confusing, not sure what you mean by A == C, when
    A = [[1 2], [3 4], [5 6]]
    C = [[-2. -2.], [ 0. 0.], [ 2. 2.]]
    so they are not the same.
    
    Also in the initial explanation it says B is the chosen subset of the vectors and in the python manual calculation example it seems to be all the vectors, as we did not choose anything. Confusing to a beginner like me. 🙂
    
    Reply
    - James Carmichael October 2, 2024 at 8:23 am #
      
      Hi KN…Let’s clarify the confusion around PCA step by step.
      
      ### PCA Overview (Simplified):
      1. **Center the Data**: The first step in PCA is to subtract the mean from each data point. This centers the data around zero, which is important for covariance calculation.
      
      2. **Covariance Matrix**: The covariance matrix is computed to understand the relationships between different dimensions (features) of your data.
      
      3. **Eigenvalues and Eigenvectors**: By decomposing the covariance matrix into its eigenvalues and eigenvectors, we can identify the principal components, which are the directions in which the data varies the most.
      
      4. **Project Data onto Principal Components**: The last step is to project the original data onto these principal components, reducing its dimensionality.
      
      Now let’s address the specific points you mentioned:
      
      ### 1. **Why is A != C?**
      
      You’re right: **A** and **C** are not the same. Here’s what’s happening:
      – **A** is the original data matrix.
      python A = [[1, 2], [3, 4], [5, 6]]
      – **C** is the centered data matrix, obtained by subtracting the mean from each feature (column) in **A**.
      
      If you calculate the mean of each column in **A**:
      – Column 1: mean = (1 + 3 + 5) / 3 = 3
      – Column 2: mean = (2 + 4 + 6) / 3 = 4
      
      Now subtract these means from each element of **A** to get **C**:
      python C = A - mean(A) C = [[1 - 3, 2 - 4], # Subtracting column-wise means [3 - 3, 4 - 4], [5 - 3, 6 - 4]]
      C = [[-2, -2], [ 0, 0], [ 2, 2]]
      
      So, **A != C** because **C** is the centered version of **A** (the data with the means removed).
      
      ### 2. **Choosing B as a Subset of Vectors**
      In PCA, the goal is to reduce the dimensionality by projecting your data onto fewer dimensions (principal components). You’re correct that **B** is supposed to be a subset of vectors (principal components).
      
      – In the **manual calculation** example you saw, you didn’t need to choose a subset explicitly because you were calculating the **full PCA** (using all vectors). When the goal is dimensionality reduction, you **only keep the top k components**, which explain the most variance.
      
      To clarify, let’s break it down:
      – **All vectors**: When you calculate PCA for learning purposes, you often use all eigenvectors for explanation.
      – **Chosen subset of vectors**: In practice, you only use the first k principal components (those corresponding to the largest eigenvalues). This is the subset referred to as **B**. For example, if you reduce 3D data to 2D, you choose the top 2 components (vectors) from PCA.
      
      ### PCA Calculation from Scratch
      
      Here’s a breakdown of how you would do PCA step by step in Python:
      
      python import numpy as np
      # 1. Input matrix A (Original data) A = np.array([[1, 2], [3, 4], [5, 6]]) # 2. Centering the data (Subtracting the mean) mean = np.mean(A, axis=0) # Mean of each column C = A - mean # Center the matrix A print("Centered Data:\n", C) # 3. Covariance matrix calculation cov_matrix = np.cov(C, rowvar=False) print("Covariance Matrix:\n", cov_matrix) # 4. Eigen decomposition (Eigenvalues and eigenvectors) eigen_values, eigen_vectors = np.linalg.eig(cov_matrix) print("Eigenvalues:\n", eigen_values) print("Eigenvectors:\n", eigen_vectors) # 5. Sort the eigenvalues and choose the top k eigenvectors # Sorting by eigenvalue (descending order) idx = np.argsort(eigen_values)[::-1] eigen_values = eigen_values[idx] eigen_vectors = eigen_vectors[:, idx]
      # 6. Transform the original matrix using the eigenvectors # Here, using all components, so no selection transformed_data = np.dot(C, eigen_vectors) print("Transformed Data:\n", transformed_data)
      
      ### Key Takeaways:
      – **C** is the centered data, not the same as **A**. Centering is essential before PCA.
      – When we refer to “choosing a subset of vectors,” we are talking about **choosing k principal components** (which are eigenvectors). Initially, we calculate all the principal components, but we typically choose only a few (those with the highest eigenvalues) for dimensionality reduction.
      
      I hope this clears things up! Let me know if you need further clarification.
      
      Reply
      - KN October 3, 2024 at 8:06 pm #
        
        Thank you very much James for the fast reply and explanations. This clarifies my second question, but not the first one yet (the same question Mohammad had).
        
        The still remaining question/confusion is:
        
        In the last step formula there is A (the original data matrix), and also in the explanations you posted also in your comment you are talking about the original data (A) for the last step. The formula: P = B^T . A
        
        In the manual calculation (or in step 6 of your comment) you are talking about the original data (A) like mentioned, but not using the original data (A) but the centered version C:
        P = vectors.T.dot(C.T)
        transformed_data = np.dot(C, eigen_vectors)
        
        So the question is, why in the last step and the formula are we talking about original data A and then using C which is the centered version? Thank you!
      - KN October 5, 2024 at 3:57 am #
        
        I have now done some more studying and testing, and this is how I now understand it, maybe useful for other learners: In the end you can transform with the eigenvector(s) either the original data or the centralized original data, whichever makes more sense for the task/situation.
Martin Power October 13, 2018 at 9:02 pm #

When I copy the code from section “Reusable Principal Component Analysis” and run in a Jupyter notebook with a Python3.6 kernel, I get a different output to what is shown on site.

The values for the Eigenvectors and Matrix B are the same but the polarity is not the same.

Any idea what is causing the mismatch?

[[1 2]
[3 4]
[5 6]]
[[ 0.70710678 0.70710678]
[-0.70710678 0.70710678]]
[8. 0.]
[[-2.82842712e+00 -2.22044605e-16]
[ 0.00000000e+00 0.00000000e+00]
[ 2.82842712e+00 2.22044605e-16]]

Reply
- Jason Brownlee October 14, 2018 at 6:03 am #
  
  Yes, I address this in the post.
  
  Minor differences and differences in sign can occur due to differences across platforms from multiple runs of the solver (used under the covers).
  
  These matrix operations require converging a solution, they are not entirely deterministic like arithmetic, we are approximating.
  
  Reply
  - Praveen Kumar September 15, 2019 at 5:18 pm #
    
    Hi Jason,
    Is there any way to get PCs with same polarity and order?
    
    Reply
    - Jason Brownlee September 16, 2019 at 6:34 am #
      
      Sort them by magnitude and ignore sign.
      
      Reply
RB October 19, 2018 at 7:53 am #

Is there a way to store the PCA model after fit() during training and reuse that model later (by loading from saved file) on live data ?

Reply
- Jason Brownlee October 19, 2018 at 10:57 am #
  
  Yes, you can save the elements to file in plain text or as pickled python objects.
  
  Reply
Sanjay November 22, 2018 at 1:00 pm #

Hi Jason

while computing the mean, shouldn’t the axis be equal to 0 rather than 1? since each dimension or feature must be averaged rather than each data point

Reply
- Jason Brownlee November 22, 2018 at 2:12 pm #
  
  I believe 0 would be row-wise, 1 is column wise
  
  Reply
uluc December 31, 2018 at 1:20 am #

This is not from stratch at all. Calculating covariance matrix and eigenvalue decomposition of is it an important part, which this tutorial skips totally.

Reply
- Jason Brownlee December 31, 2018 at 6:13 am #
  
  Thanks for the note, more on covar here:
  https://machinelearningmastery.com/introduction-to-expected-value-variance-and-covariance/
  
  More on eigendecomposition here:
  https://machinelearningmastery.com/introduction-to-eigendecomposition-eigenvalues-and-eigenvectors/
  
  Reply
  - Tim April 17, 2019 at 5:34 pm #
    
    Dude this is still not from scratch. You just explain what eigenvectors and eigenvalues are then use a toolbox to do the dirty work for you. Can you please explain the details of finding the eigenvectors?
    
    Reply
    - Jason Brownlee April 18, 2019 at 8:22 am #
      
      Sure, see this post:
      https://machinelearningmastery.com/introduction-to-eigendecomposition-eigenvalues-and-eigenvectors/
      
      Reply
Yogesh February 1, 2019 at 6:25 pm #

HI Jason,

I have a doubt , is there u are saying PCA with eigenvector and PCA with svd both are different ? or i understood wrong,

secondly can we use together ?

Reply
- Jason Brownlee February 2, 2019 at 6:13 am #
  
  PCA and SVD are different.
  
  Reply
Venkat February 18, 2019 at 7:57 pm #

Hi Jason,

Can you extend PCA and Hotelling’s T^2 for confidence interval in python.

Thanks,
Venkat

Reply
- Jason Brownlee February 19, 2019 at 7:23 am #
  
  Sorry, what are you referring to exactly?
  
  Reply
Al February 22, 2019 at 5:41 am #

Hi Jason, I found extracting top PCA explaining 90% of the variance, boosting to a large degree my h2o.deeplearning model to a +99% overall accuracy, AUC, tpr and npr. It is so good once the model is applied to my the test set to look unreal (basically only one misprediction out of 1k+ observations in my confusion matrix). I am not versant with the orthogonal transformations underlying PCA, but I was wondering: would PCA be the cause of overfitting on my data set? How is it possible to get to such an amazing result? How reliable would be my model over future and unseen observations?
Thanks

Reply
- Jason Brownlee February 22, 2019 at 6:27 am #
  
  Yes, the transform must be calculated on the train dataset only, then applied to train and test sets.
  
  Reply
  - Al February 23, 2019 at 3:58 am #
    
    I see waht you mean. Thanks!
    
    Reply
Samim April 30, 2019 at 6:22 pm #

Could you please explain more about pca.fit() and pca.transform what exactly is happening when we call these two ?

Reply
- Jason Brownlee May 1, 2019 at 7:01 am #
  
  Great question, fit is converging on a solution, e.g. finding the eigenvectors and eigenvalues.
  
  It might help to check the API documentation.
  
  Reply
Elvis Dennis June 4, 2019 at 12:32 am #

What is the difference between Split Zone design and Split Plot design?

Reply
- Jason Brownlee June 4, 2019 at 7:52 am #
  
  I have not heard these terms before, sorry.
  
  What is the content?
  
  Reply
Rajshree June 13, 2019 at 9:55 pm #

Amazing description Sir, but in the manual computation of PCA I’m having a different dataset having 1140 eigen vectors and want only 100 of them corresponding to their eigen values. So, how to choose the components and form the feature vector.

Reply
- Jason Brownlee June 14, 2019 at 6:44 am #
  
  Perhaps choose the 100 largest?
  
  Reply
marco August 22, 2019 at 5:14 am #

i stil confuse with this, could u give me an explanation about “.T” do in this code?
V = cov(C.T)

Reply
- Jason Brownlee August 22, 2019 at 6:33 am #
  
  Transpose.
  https://en.wikipedia.org/wiki/Transpose
  
  Reply
Biserka September 12, 2019 at 10:34 pm #

Hi Jason,

have you ever tried PCA to a existing data sets?

LIke UC Merced LandUse or AID?

I want to calculate PCA on features of these data sets extracted with some-pretrained CNN (the dimensions of the feature vectors are 100.000+).

Do you recommend it and how?

Reply
- Jason Brownlee September 13, 2019 at 5:41 am #
  
  I have, and I believe I have tutorials on it:
  https://machinelearningmastery.com/feature-selection-machine-learning-python/
  
  Reply
  - Biserka September 14, 2019 at 5:43 am #
    
    Thank you very much for your answer
    
    Reply
    - Jason Brownlee September 14, 2019 at 6:24 am #
      
      You’re welcome.
      
      Reply
Praveen Kumar September 15, 2019 at 5:01 pm #

Hi Jason,
Really nice Blog.
But I don’t understand why you ‘d to transpose Centered matrix to calculate covariance matrix.

# calculate covariance matrix of centered matrix
V = cov(C.T)

Reply
- Jason Brownlee September 16, 2019 at 6:34 am #
  
  Yes, it could be simpler, thanks.
  
  Reply
Praveen Kumar September 15, 2019 at 5:17 pm #

Hi Jason,
One more thing which I don’t understand is why the sign & order of principal components are different than PCs that obtained from scikit-learn PCA?

[[ 0.70710678 0.70710678]
[ 0.70710678 -0.70710678]]

[[ 0.70710678 -0.70710678]
[ 0.70710678 0.70710678]]

Reply
- Jason Brownlee September 16, 2019 at 6:34 am #
  
  Different numerical solvers used under the covers – you can ignore the sign/order.
  
  Reply
Hessam October 12, 2019 at 8:51 am #

Dear Jason,

Thank you very much for this useful article.

A small note about the centering:
”
# center columns by subtracting column means
C = A – M
print(C)
# calculate covariance matrix of centered matrix
V = cov(C.T)
”
I guess that there is no need to center A, when we calculate the covariance.
cov(C.T) = cov(A.T)

However, it could be helpful for the readers to calculate the covariance from C:

V = np.matmul(C.T, C) / C.shape[1]

Reply
- Jason Brownlee October 13, 2019 at 8:22 am #
  
  Nice, thanks!
  
  Reply
- Pawel Szafałowicz March 2, 2020 at 8:55 pm #
  
  Hi Jason,
  
  Very usefull article.
  How to make a prediction for a single row by a model trained on data after PCA transforrmation?
  Do I have to make a PCA transformation on this new row also which seems senseless?
  
  Thanks in advance
  
  Reply
  - Jason Brownlee March 3, 2020 at 5:58 am #
    
    Use a pipeline that has the pca and model in it, fit on all data, then call predict.
    
    Reply
    - Pawel Szafałowicz March 4, 2020 at 1:49 am #
      
      Very thank you!
      
      Reply
Xiao November 2, 2019 at 9:11 pm #

In the discussion, you said we need use B = select(values, vectors) to selcet K number largest value and vectors, but How can I set the select value, How can I defind the code like K = 10?

Reply
- Jason Brownlee November 3, 2019 at 5:55 am #
  
  Perhaps test different values for your dataset.
  
  Reply
Suraj November 21, 2019 at 11:52 pm #

thank for great tutorial but i have question regarding how to get new data from the pca1 and pca2 to implement another machine learning alog ,

Reply
- Jason Brownlee November 22, 2019 at 6:04 am #
  
  Sorry, I don’t understand. What do you mean exactly?
  
  Reply
efronova April 3, 2020 at 7:45 am #

when calculating mean axis=1 calculates mean rowwise. I believe it should be axis = 0

Reply
- Jason Brownlee April 3, 2020 at 8:07 am #
  
  Note we calculate the mean on A.T not A.
  
  Reply
Jose Q April 10, 2020 at 3:46 pm #

Hi Jason,
Great post as usual!

If I train a model using the complete train data set, then I test it on unseen test data set, I get to some accuracy and recall results.
If I do the same training on the 3 principal components version of the same train data set, then I test it on the 3 principal components version of the same unseen test data set, then I get to different accuracy and recall results (these are better results).

Despite the temptation of having better accuracy results, I suppose that this improvement was circumstantial, so I guess we should use the complete data set (not the reduced PC version) because it represents the complete data variability, while PCA is a projection of the same data. I guess that in the long run we will have more consistent results in the complete data set.

What do you think?

Reply
- Jason Brownlee April 11, 2020 at 6:07 am #
  
  Thanks!
  
  You cannot “test” a model on new data where you do not have the target values.
  
  You can train a model on all data and make predictions on new data, but you cannot calculate a score, as you will not have the targets. You will already know the score of the model from your test harness (e.g. cross-validation, etc.)
  
  Reply
  - Jose Q April 11, 2020 at 7:39 am #
    
    Yes , I agree.
    I should’ve said hold-out data rather than new unseen data, because I do have the target values for those hold-out data.
    In the concrete case I am working, I have 22 days of data. I used the first 21 days for training (80%) and validation (20%) with great results in accuracy & recall in all the cross-validations.
    Then I used hold-out data from day 22 for testing, and that’s where I got a terrible accuracy and recall.
    After that is when I tried using PCAs data rather than regular data. Since I got (apparently) better accuracy results with PCs, I felt somehow that it wasn’t correct, so that’s why I sent my previous post.
    Thank you for your patiente in answering all these comments.
    Jose
    
    Reply
    - Jason Brownlee April 11, 2020 at 7:57 am #
      
      Right.
      
      For PCA, you can prepare or fit the transform on the train set then apply it to the train and test sets, just like scaling and other transforms. That would be the appropriate way to use it to avoid data leakage.
      
      Also, for time series, consider using walk-forward validation.
      
      Does that help?
      
      Reply
      - Jose Q April 11, 2020 at 10:14 am #
        
        Yes Jason! Thank you!
        I was already following your suggestion on PCA about fit transform on training set and apply it to test set to keep data transformations consistent.
        Thank you also for your post https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/ on walk-forward validation. I always learn something more with your posts.
        
        I understand that PCA is often used to make data easy to explore and visualize. The concern in my initial question was about the convenience or correctness of using PC data (small number of features) for training and predicting instead of using the original data set (large number of features). If you have a comment on this last point I appreciate.
        
        Thank you again
      - Jason Brownlee April 11, 2020 at 11:55 am #
        
        If it results in better performance, use it.
dong zhan August 7, 2020 at 3:25 pm #

thank you so much, spent a whole day learning PCA(matrices, null space, correlation, covariance, eigenvectors, etc) , finally got here, this is the best, connected the abstract theory to concrete reality, without this practice, I think, I can never really understand.

Reply
- Jason Brownlee August 8, 2020 at 5:58 am #
  
  Thanks, well done on your progress!
  
  Reply
Sachin September 22, 2020 at 12:59 am #

Hey Jason! thanks for this tutorial, I applied PCA on iris dataset and chose 2 components, I did it manually and also using sklearn library. But my 2nd component value signs has been changed from positive to negative vice versa when compared to the sklearn usage. Is that an issue?

Reply
- Jason Brownlee September 22, 2020 at 6:49 am #
  
  Yes, the signs can change, this is to be expected.
  
  I believe this is discussed in the above tutorial.
  
  Reply
Muhammad Usama Zahid January 24, 2021 at 3:04 am #

Sir!
can you please explain PCA with some example like iris or other.I mean loading the file from csv then splitting the vectors and labels, doing pca on vectors and then concatenating the pca vectors and labels, storing back to excel.
regards

Reply
- Jason Brownlee January 24, 2021 at 6:02 am #
  
  Perhaps this example will help:
  https://machinelearningmastery.com/principal-components-analysis-for-dimensionality-reduction-in-python/
  
  If you need help loading a dataset, see this:
  https://machinelearningmastery.com/load-machine-learning-data-python/
  
  Reply
sukhpal April 23, 2021 at 12:33 am #

sir is we include components which have higher value of correlation for classification or which have lesser value of co-relation components

Reply
- Jason Brownlee April 23, 2021 at 5:05 am #
  
  Sorry, I don’t understand your question, perhaps you could elaborate?
  
  Reply
y June 18, 2021 at 9:59 am #

how can i know the features selected by PCA?

Reply
- Jason Brownlee June 19, 2021 at 5:44 am #
  
  PCA does not select features, it creates new features from the data.
  
  Reply
E. Saf June 22, 2021 at 12:25 am #

Dear Dr. Jason,
Thank you a Lot for all your work.
I have a different case that I want to use dimensionality reduction model.
In fact, I have a dataset with 40 feature where 25 are categorical – nominal features.
Then the created space if we want to one hot encoding it is Giant. It wants 900 Gib to allocate.

Is there any method to deal a dimensionality reduction for categorical variables “”before”” one hot enconding them??

Best Regards,

Reply
- Jason Brownlee June 22, 2021 at 6:32 am #
  
  Good question, I’m not sure off the cuff. I recommend checking the literature. I bet there is a version of PCA that supports categorical inputs!
  
  Reply
  - E. Saf June 26, 2021 at 11:59 pm #
    
    The problem is that I want to do the reduction before transforming with onehotencoder.
    I found in the litterature the hash encoder. Do you advice working with it?
    
    Best Regards,
    
    Reply
    - Jason Brownlee June 27, 2021 at 4:38 am #
      
      No, sorry. Perhaps try it and compare results to other methods.
      
      Reply
Morteza November 1, 2021 at 10:23 pm #

Thank you so Much…….. Great

Reply
Chris Thron April 14, 2022 at 7:14 am #

This is a great tutorial, and I will share it with my students. But I don’t think you need to subtract the mean to compute the covariance–the covariance calculation does that automatically.

Reply
- James Carmichael April 15, 2022 at 7:36 am #
  
  Thank you for the feedback Chris!
  
  Reply
Chris Thron April 14, 2022 at 7:33 am #

Sorry, I see now you centered the data so that you could project the centered data vectors onto the eigenvectors. It might be clearer if you do the centering at the end, so you don’t leave the impression that centering is necessary to compute PCA.

Reply
- James Carmichael April 15, 2022 at 7:36 am #
  
  Thank you for the feedback Chris!
  
  Reply

Navigation

How to Calculate Principal Component Analysis (PCA) from Scratch in Python

Tutorial Overview

Need help with Linear Algebra for Machine Learning?

Principal Component Analysis

Manually Calculate Principal Component Analysis

Reusable Principal Component Analysis

Extensions

Further Reading

Books

API

Articles

Tutorials

Summary

Get a Handle on Linear Algebra for Machine Learning!

Develop a working understand of linear algebra

Finally Understand the Mathematics of Data

More On This Topic

99 Responses to How to Calculate Principal Component Analysis (PCA) from Scratch in Python

Leave a Reply Click here to cancel reply.