Last Updated on November 16, 2019

Fundamental statistics are useful tools in applied machine learning for a better understanding your data.

They are also the tools that provide the foundation for more advanced linear algebra operations and machine learning methods, such as the covariance matrix and principal component analysis respectively. As such, it is important to have a strong grip on fundamental statistics in the context of linear algebra notation.

In this tutorial, you will discover how fundamental statistical operations work and how to implement them using NumPy with notation and terminology from linear algebra.

After completing this tutorial, you will know:

- What the expected value, average, and mean are and how to calculate them.
- What the variance and standard deviation are and how to calculate them.
- What the covariance, correlation, and covariance matrix are and how to calculate them.

**Kick-start your project** with my new book Linear Algebra for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Updated Mar/2018**: Fixed a small typo in the result for vector variance example. Thanks Bob.

## Tutorial Overview

This tutorial is divided into 4 parts; they are:

- Expected Value
- Variance
- Covariance
- Covariance Matrix

### Need help with Linear Algebra for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Expected Value

In probability, the average value of some random variable X is called the expected value or the expectation.

The expected value uses the notation E with square brackets around the name of the variable; for example:

1 |
E[X] |

It is calculated as the probability weighted sum of values that can be drawn.

1 |
E[X] = sum(x1 * p1, x2 * p2, x3 * p3, ..., xn * pn) |

In simple cases, such as the flipping of a coin or rolling a dice, the probability of each event is just as likely. Therefore, the expected value can be calculated as the sum of all values multiplied by the reciprocal of the number of values.

1 |
E[X] = sum(x1, x2, x3, ..., xn) . 1/n |

In statistics, the mean, or more technically the arithmetic mean or sample mean, can be estimated from a sample of examples drawn from the domain. It is confusing because mean, average, and expected value are used interchangeably.

In the abstract, the mean is denoted by the lower case Greek letter mu and is calculated from the sample of observations, rather than all possible values.

1 |
mu = sum(x1, x2, x3, ..., xn) . 1/n |

Or, written more compactly:

1 |
mu = sum(x . P(x)) |

Where x is the vector of observations and P(x) is the calculated probability for each value.

When calculated for a specific variable, such as x, the mean is denoted as a lower case variable name with a line above, called x-bar.

1 2 |
_ x = sum from 1 to n (xi) . 1/n |

The arithmetic mean can be calculated for a vector or matrix in NumPy by using the mean() function.

The example below defines a 6-element vector and calculates the mean.

1 2 3 4 5 6 |
from numpy import array from numpy import mean v = array([1,2,3,4,5,6]) print(v) result = mean(v) print(result) |

Running the example first prints the defined vector and the mean of the values in the vector.

1 2 3 |
[1 2 3 4 5 6] 3.5 |

The mean function can calculate the row or column means of a matrix by specifying the axis argument and the value 0 or 1 respectively.

The example below defines a 2×6 matrix and calculates both column and row means.

1 2 3 4 5 6 7 8 |
from numpy import array from numpy import mean M = array([[1,2,3,4,5,6],[1,2,3,4,5,6]]) print(M) col_mean = mean(M, axis=0) print(col_mean) row_mean = mean(M, axis=1) print(row_mean) |

Running the example first prints the defined matrix, then the calculated column and row mean values.

1 2 3 4 5 6 |
[[1 2 3 4 5 6] [1 2 3 4 5 6]] [ 1. 2. 3. 4. 5. 6.] [ 3.5 3.5] |

## Variance

In probability, the variance of some random variable X is a measure of how much values in the distribution vary on average with respect to the mean.

The variance is denoted as the function Var() on the variable.

1 |
Var[X] |

Variance is calculated as the average squared difference of each value in the distribution from the expected value. Or the expected squared difference from the expected value.

1 |
Var[X] = E[(X - E[X])^2] |

Assuming the expected value of the variable has been calculated (E[X]), the variance of the random variable can be calculated as the sum of the squared difference of each example from the expected value multiplied by the probability of that value.

1 |
Var[X] = sum (p(x1) . (x1 - E[X])^2, p(x2) . (x2 - E[X])^2, ..., p(x1) . (xn - E[X])^2) |

If the probability of each example in the distribution is equal, variance calculation can drop the individual probabilities and multiply the sum of squared differences by the reciprocal of the number of examples in the distribution.

1 |
Var[X] = sum ((x1 - E[X])^2, (x2 - E[X])^2, ...,(xn - E[X])^2) . 1/n |

In statistics, the variance can be estimated from a sample of examples drawn from the domain.

In the abstract, the sample variance is denoted by the lower case sigma with a 2 superscript indicating the units are squared, not that you must square the final value. The sum of the squared differences is multiplied by the reciprocal of the number of examples minus 1 to correct for a bias.

1 |
sigma^2 = sum from 1 to n ( (xi - mu)^2 ) . 1 / (n - 1) |

In NumPy, the variance can be calculated for a vector or a matrix using the var() function. By default, the var() function calculates the population variance. To calculate the sample variance, you must set the ddof argument to the value 1.

The example below defines a 6-element vector and calculates the sample variance.

1 2 3 4 5 6 |
from numpy import array from numpy import var v = array([1,2,3,4,5,6]) print(v) result = var(v, ddof=1) print(result) |

Running the example first prints the defined vector and then the calculated sample variance of the values in the vector.

1 2 3 |
[1 2 3 4 5 6] 3.5 |

The var function can calculate the row or column variances of a matrix by specifying the axis argument and the value 0 or 1 respectively, the same as the mean function above.

The example below defines a 2×6 matrix and calculates both column and row sample variances.

1 2 3 4 5 6 7 8 |
from numpy import array from numpy import var M = array([[1,2,3,4,5,6],[1,2,3,4,5,6]]) print(M) col_mean = var(M, ddof=1, axis=0) print(col_mean) row_mean = var(M, ddof=1, axis=1) print(row_mean) |

Running the example first prints the defined matrix and then the column and row sample variance values.

1 2 3 4 5 6 |
[[1 2 3 4 5 6] [1 2 3 4 5 6]] [ 0. 0. 0. 0. 0. 0.] [ 3.5 3.5] |

The standard deviation is calculated as the square root of the variance and is denoted as lowercase “s”.

1 |
s = sqrt(sigma^2) |

To keep with this notation, sometimes the variance is indicated as s^2, with 2 as a superscript, again showing that the units are squared.

NumPy also provides a function for calculating the standard deviation directly via the std() function. As with the var() function, the ddof argumentmust be set to 1 to calculate the unbiased sample standard deviation and column and row standard deviations can be calculated by setting the axis argument to 0 and 1 respectively.

The example below demonstrates how to calculate the sample standard deviation for the rows and columns of a matrix.

1 2 3 4 5 6 7 8 |
from numpy import array from numpy import std M = array([[1,2,3,4,5,6],[1,2,3,4,5,6]]) print(M) col_mean = std(M, ddof=1, axis=0) print(col_mean) row_mean = std(M, ddof=1, axis=1) print(row_mean) |

Running the example first prints the defined matrix and then the column and row sample standard deviation values.

1 2 3 4 5 6 |
[[1 2 3 4 5 6] [1 2 3 4 5 6]] [ 0. 0. 0. 0. 0. 0.] [ 1.87082869 1.87082869] |

## Covariance

In probability, covariance is the measure of the joint probability for two random variables. It describes how the two variables change together.

It is denoted as the function cov(X, Y), where X and Y are the two random variables being considered.

1 |
cov(X,Y) |

Covariance is calculated as expected value or average of the product of the differences of each random variable from their expected values, where E[X] is the expected value for X and E[Y] is the expected value of y.

1 |
cov(X, Y) = E[(X - E[X]) . (Y - E[Y])] |

Assuming the expected values for X and Y have been calculated, the covariance can be calculated as the sum of the difference of x values from their expected value multiplied by the difference of the y values from their expected values multiplied by the reciprocal of the number of examples in the population.

1 |
cov(X, Y) = sum (x - E[X]) * (y - E[Y]) * 1/n |

In statistics, the sample covariance can be calculated in the same way, although with a bias correction, the same as with the variance.

1 |
cov(X, Y) = sum (x - E[X]) * (y - E[Y]) * 1/(n - 1) |

The sign of the covariance can be interpreted as whether the two variables increase together (positive) or decrease together (negative). The magnitude of the covariance is not easily interpreted. A covariance value of zero indicates that both variables are completely independent.

NumPy does not have a function to calculate the covariance between two variables directly. Instead, it has a function for calculating a covariance matrix called cov() that we can use to retrieve the covariance. By default, the cov()function will calculate the unbiased or sample covariance between the provided random variables.

The example below defines two vectors of equal length with one increasing and one decreasing. We would expect the covariance between these variables to be negative.

We access just the covariance for the two variables as the [0,1] element of the square covariance matrix returned.

1 2 3 4 5 6 7 8 |
from numpy import array from numpy import cov x = array([1,2,3,4,5,6,7,8,9]) print(x) y = array([9,8,7,6,5,4,3,2,1]) print(y) Sigma = cov(x,y)[0,1] print(Sigma) |

Running the example first prints the two vectors followed by the covariance for the values in the two vectors. The value is negative, as we expected.

1 2 3 4 |
[1 2 3 4 5 6 7 8 9] [9 8 7 6 5 4 3 2 1] -7.5 |

The covariance can be normalized to a score between -1 and 1 to make the magnitude interpretable by dividing it by the standard deviation of X and Y. The result is called the correlation of the variables, also called the Pearson correlation coefficient, named for the developer of the method.

1 |
r = cov(X, Y) / sX sY |

Where r is the correlation coefficient of X and Y, cov(X, Y) is the sample covariance of X and Y and sX and sY are the standard deviations of X and Y respectively.

NumPy provides the corrcoef() function for calculating the correlation between two variables directly. Like cov(), it returns a matrix, in this case a correlation matrix. As with the results from cov() we can access just the correlation of interest from the [0,1] value from the returned squared matrix.

1 2 3 4 5 6 7 8 |
from numpy import array from numpy import corrcoef x = array([1,2,3,4,5,6,7,8,9]) print(x) y = array([9,8,7,6,5,4,3,2,1]) print(y) Sigma = corrcoef(x,y) print(Sigma) |

Running the example first prints the two defined vectors followed by the correlation coefficient. We can see that the vectors are maximally negatively correlated as we designed.

1 2 3 4 |
[1 2 3 4 5 6 7 8 9] [9 8 7 6 5 4 3 2 1] -1.0 |

## Covariance Matrix

The covariance matrix is a square and symmetric matrix that describes the covariance between two or more random variables.

The diagonal of the covariance matrix are the variances of each of the random variables.

A covariance matrix is a generalization of the covariance of two variables and captures the way in which all variables in the dataset may change together.

The covariance matrix is denoted as the uppercase Greek letter Sigma. The covariance for each pair of random variables is calculated as above.

1 |
Sigma = E[(X - E[X] . (Y - E[Y])] |

Where:

1 |
Sigma(ij) = cov(Xi, Xj) |

And X is a matrix where each column represents a random variable.

The covariance matrix provides a useful tool for separating the structured relationships in a matrix of random variables. This can be used to decorrelate variables or applied as a transform to other variables. It is a key element used in the Principal Component Analysis data reduction method, or PCA for short.

The covariance matrix can be calculated in NumPy using the cov() function. By default, this function will calculate the sample covariance matrix.

The cov() function can be called with a single matrix containing columns on which to calculate the covariance matrix, or two arrays, such as one for each variable.

Below is an example that defines two 9-element vectors and calculates the unbiased covariance matrix from them.

1 2 3 4 5 6 7 8 |
from numpy import array from numpy import cov x = array([1,2,3,4,5,6,7,8,9]) print(x) y = array([9,8,7,6,5,4,3,2,1]) print(y) Sigma = cov(x,y) print(Sigma) |

Running the example first prints the two vectors and then the calculated covariance matrix.

The values of the arrays were contrived such that as one variable increases, the other decreases. We would expect to see a negative sign on the covariance for these two variables, and this is what we see in the covariance matrix.

1 2 3 4 5 6 |
[1 2 3 4 5 6 7 8 9] [9 8 7 6 5 4 3 2 1] [[ 7.5 -7.5] [-7.5 7.5]] |

The covariance matrix is used widely in linear algebra and the intersection of linear algebra and statistics called multivariate analysis. We have only had a small taste in this post.

## Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

- Explore each example using your own small contrived data.
- Load data from a CSV file and apply each operation to the data columns.
- Write your own functions to implement each statistical operation.

If you explore any of these extensions, I’d love to know.

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Books

- Applied Multivariate Statistical Analysis, 2012.
- Applied Multivariate Statistical Analysis, 2015.
- Chapter 12 Linear Algebra in Probability & Statistics, Introduction to Linear Algebra, Fifth Edition, 2016.
- Chapter 3, Probability and Information Theory, Deep Learning, 2016.

### API

- NumPy Statistics Functions
- numpy.mean() API
- numpy.var() API
- numpy.std() API
- numpy.cov() API
- numpy.corrcoef() API

### Articles

- Expected value on Wikipedia
- Mean on Wikipedia
- Variance on Wikipedia
- Standard deviation on Wikipedia
- Covariance on Wikipedia
- Sample mean and covariance
- Pearson correlation coefficient
- Covariance matrix on Wikipedia
- Estimation of covariance matrices on Wikipedia

### Posts

## Summary

In this tutorial, you discovered how fundamental statistical operations work and how to implement them using NumPy with notation and terminology from linear algebra.

Specifically, you learned:

- What the expected value, average, and mean are and how to calculate then.
- What the variance and standard deviation are and how to calculate them.
- What the covariance, correlation, and covariance matrix are and how to calculate them.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Hi Jason

Don’t understand something. Immediately below “The example below defines a 6-element vector and calculates the sample variance.” is a code block that purports to compute the variance.

Under that block the answer is shown as var = 3.5. This is not what I get. Going back to the definition of variance, I get var = 2.9167.

I’m confused.

Sorry to hear that. I can confirm that the code and result as listed are correct.

Are you able to confirm that you copied the code exactly?

Also, are you able to confirm that your Python libraries are up to date?

Hi Jason,

Just spotted a typo – the definition of covariance (second code block beneath the Covariance heading) is missing a closing parenthesis. It reads:

cov(X, Y) = E[(X – E[X] . (Y – E[Y])]

but should read:

cov(X, Y) = E[(X – E[X]) . (Y – E[Y])]

Cheers,

Ari

Thanks. Fixed.

x = array([1,2,3,4,5,6,7,8,9])

python does not need the array. Use python list.

Amazing article anyway.

Thanks!

Thanks.

What about non-linear distributions? Like set of data generated in exponential distribution. Python already take into account in the list? Thanks!

What do you mean exactly “in the list”?