The post Computational Linear Algebra for Coders Review appeared first on Machine Learning Mastery.

]]>It is an area that requires some previous experience of linear algebra and is focused on both the performance and precision of the operations. The company fast.ai released a free course titled “*Computational Linear Algebra*” on the topic of numerical linear algebra that includes Python notebooks and video lectures recorded at the University of San Francisco.

In this post, you will discover the fast.ai free course on computational linear algebra.

After reading this post, you will know:

- The motivation and prerequisites for the course.
- An overview of the topics covered in the course.
- Who exactly this course is a good fit for, and who it is not.

Let’s get started.

The course “*Computational Linear Algebra for Coders*” is a free online course provided by fast.ai. They are a company dedicated to providing free education resources related to deep learning.

The course was originally taught in 2017 by Rachel Thomas at the University of San Francisco as part of a masters degree program. Rachel Thomas is a professor at the University of San Francisco and co-founder of fast.ai and has a Ph.D. in mathematics.

The focus of the course is numerical methods for linear algebra. This is the application of matrix algebra on computers and addresses all of the concerns around the implementation and use of the methods such as performance and precision.

This course is focused on the question: How do we do matrix computations with acceptable speed and acceptable accuracy?

The course uses Python with examples using NumPy, scikit-learn, numba, pytorch, and more.

The material is taught using a top-down approach, much like MachineLearningMastery, intended to give a feeling for how to do things, before explaining how the methods work.

Knowing how these algorithms are implemented will allow you to better combine and utilize them, and will make it possible for you to customize them if needed.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The course does assume familiarity with linear algebra.

This includes topics such as vectors, matrices, operations such as matrix multiplication and transforms.

The course is not for novices to the field of linear algebra.

Three references are suggested for you to review prior to taking the course if you are new or rusty with linear algebra. They are:

- 3Blue 1Brown Essence of Linear Algebra, Video Course
- Immersive Linear Algebra, Interactive Textbook
- Chapter 2 of Deep Learning, 2016.

Further, while working through the course, references are provided as needed.

Two general reference texts are suggested up front. They are the following textbooks:

- Numerical Linear Algebra, 1997.
- Numerical Methods, 2012.

This section provides a summary of the 8 (9) parts to the course. They are:

- 0. Course Logistics
- 1. Why are we here?
- 2. Topic Modeling with NMF and SVD
- 3. Background Removal with Robust PCA
- 4. Compressed Sensing with Robust Regression
- 5. Predicting Health Outcomes with Linear Regressions
- 6. How to Implement Linear Regression
- 7. PageRank with Eigen Decompositions
- 8. Implementing QR Factorization

Really, there are only 8 parts to the course as the first is just administration details for the students that took the course at the University of San Francisco.

In this section, we will step through the 9 parts of the course and summarize their contents and topics covered to give you a feel for what to expect and to see whether it is a good fit for you.

This first lecture is not really part of the course.

It provides an introduction to the lecturer, the material, the way it will be taught, and the expectations of the student in the masters program.

I’ll be using a top-down teaching method, which is different from how most math courses operate. Typically, in a bottom-up approach, you first learn all the separate components you will be using, and then you gradually build them up into more complex structures. The problems with this are that students often lose motivation, don’t have a sense of the “big picture”, and don’t know what they’ll need.

The topics covered in this lecture are:

- Lecturer background
- Teaching Approach
- Importance of Technical Writing
- List of Excellent Technical Blogs
- Linear Algebra Review Resources

Videos and Notebook:

This part introduces the motivation for the course, and touches on the importance of matrix factorization: the importance of the performance and accuracy of these calculations and some example applications.

Matrices are everywhere, anything that can be put in an Excel spreadsheet is a matrix, and language and pictures can be represented as matrices as well.

A great point made in this lecture is how the whole class of matrix factorization methods and one specific method, the QR decomposition, were reported as being among the top 10 most important algorithms of the 20th century.

A list of the Top 10 Algorithms of science and engineering during the 20th century includes: the matrix decompositions approach to linear algebra. It also includes the QR algorithm

The topics covered in this lecture are:

- Matrix and Tensor Products
- Matrix Decompositions
- Accuracy
- Memory use
- Speed
- Parallelization & Vectorization

Videos and Notebook:

This part focuses on the use of matrix factorization in the application to topic modeling for text, specifically the Singular Value Decomposition method, or SVD.

Useful in this part are the comparisons of calculating the methods from scratch or with NumPy and with the scikit-learn library.

Topic modeling is a great way to get started with matrix factorizations.

The topics covered in this lecture are:

- Topic Frequency-Inverse Document Frequency (TF-IDF)
- Singular Value Decomposition (SVD)
- Non-negative Matrix Factorization (NMF)
- Stochastic Gradient Descent (SGD)
- Intro to PyTorch
- Truncated SVD

Videos and Notebook:

- Computational Linear Algebra 2: Topic Modelling with SVD & NMF
- Computational Linear Algebra 3: Review, New Perspective on NMF, & Randomized SVD
- Notebook

This part focuses on the Principal Component Analysis method, or PCA, that uses the eigendecomposition and multivariate statistics.

The focus is on using PCA on image data such as separating background from foreground to isolate changes. This part also introduces the LU decomposition from scratch.

When dealing with high-dimensional data sets, we often leverage on the fact that the data has low intrinsic dimensionality in order to alleviate the curse of dimensionality and scale (perhaps it lies in a low-dimensional subspace or lies on a low-dimensional manifold).

The topics covered in this lecture are:

- Load and View Video Data
- SVD
- Principal Component Analysis (PCA)
- L1 Norm Induces Sparsity
- Robust PCA
- LU factorization
- Stability of LU
- LU factorization with Pivoting
- History of Gaussian Elimination
- Block Matrix Multiplication

Videos and Notebook:

- Computational Linear Algebra 3: Review, New Perspective on NMF, & Randomized SVD
- Computational Linear Algebra 4: Randomized SVD & Robust PCA
- Computational Linear Algebra 5: Robust PCA & LU Factorization
- Notebook

This part introduces the important concepts of broadcasting used in NumPy arrays (and elsewhere) and sparse matrices that crop up a lot in machine learning.

The application focus of this part is the use of robust PCA for background removal in CT scans.

The term broadcasting describes how arrays with different shapes are treated during arithmetic operations. The term broadcasting was first used by Numpy, although is now used in other libraries such as Tensorflow and Matlab; the rules can vary by library.

The topics covered in this lecture are:

- Broadcasting
- Sparse matrices
- CT Scans and Compressed Sensing
- L1 and L2 regression

Videos and Notebook:

- Computational Linear Algebra 6: Block Matrix Mult, Broadcasting, & Sparse Storage
- Computational Linear Algebra 7: Compressed Sensing for CT Scans
- Notebook

This part focuses on the development of linear regression models demonstrated with scikit-learn.

The Numba library is also used to demonstrate how to speed up the matrix operations involved.

We would like to speed this up. We will use Numba, a Python library that compiles code directly to C.

The topics covered in this lecture are:

- Linear regression in sklearn
- Polynomial Features
- Speeding up with Numba
- Regularization and Noise

Videos and Notebook:

- Computational Linear Algebra 8: Numba, Polynomial Features, How to Implement Linear Regression
- Notebook

This part looks at how to solve linear least squares for linear regression using a suite of different matrix factorization methods. Results are compared to the implementation in scikit-learn.

Linear regression via QR has been recommended by numerical analysts as the standard method for years. It is natural, elegant, and good for “daily use”.

The topics covered in this lecture are:

- How did Scikit Learn do it?
- Naive solution
- Normal equations and Cholesky factorization
- QR factorization
- SVD
- Timing Comparison
- Conditioning & Stability
- Full vs Reduced Factorizations
- Matrix Inversion is Unstable

Videos and Notebook:

- Computational Linear Algebra 8: Numba, Polynomial Features, How to Implement Linear Regression
- Notebook

This part introduces the eigendecomposition and the implementation and application of the PageRank algorithm to a Wikipedia links dataset.

The QR algorithm uses something called the QR decomposition. Both are important, so don’t get them confused.

The topics covered in this lecture are:

- SVD
- DBpedia Dataset
- Power Method
- QR Algorithm
- Two-phase approach to finding eigenvalues
- Arnoldi Iteration

Videos and Notebook:

- Computational Linear Algebra 9: PageRank with Eigen Decompositions
- Computational Linear Algebra 10: QR Algorithm to find Eigenvalues, Implementing QR Decomposition
- Notebook

This final part introduces three ways to implement the QR decomposition from scratch and compares the precision and performance of each method.

We used QR factorization in computing eigenvalues and to compute least squares regression. It is an important building block in numerical linear algebra.

The topics covered in this lecture are:

- Gram-Schmidt
- Householder
- Stability Examples

Videos and Notebook:

- Computational Linear Algebra 10: QR Algorithm to find Eigenvalues, Implementing QR Decomposition
- Notebook

I think the course is excellent.

A fun walk through numerical linear algebra with a focus on applications and executable code.

The course delivers on the promise of focusing on the practical concerns of matrix operations such as memory, speed, and precision or numerical stability. The course begins with a careful look at issues of floating point precision and overflow.

Throughout the course, frequently comparisons are made between methods in terms of execution speed.

This course is not an introduction to linear algebra for developers, and if that is the expectation going in, you may be left behind.

The course does assume a reasonable fluency with the basics of linear algebra, notation, and operations. And it does not hide this assumption up front.

I don’t think this course is required if you are interested in deep learning or learning more about the linear algebra operations used in deep learning methods.

If you are implementing matrix algebra methods in your own work and you’re looking to get more out of them, I would highly recommend this course.

I would also recommend this course if you are generally interested in the practical implications of matrix algebra.

This section provides more resources on the topic if you are looking to go deeper.

- New fast.ai course: Computational Linear Algebra
- Computational Linear Algebra on GitHub
- Computational Linear Algebra Video Lectures
- Community Forums

- 3Blue 1Brown Essence of Linear Algebra, Video Course
- Immersive Linear Algebra, Interactive Textbook
- Chapter 2 of Deep Learning
- Numerical Linear Algebra, 1997.
- Numerical Methods, 2012.

In this post, you discovered the fast.ai free course on computational linear algebra.

Specifically, you learned:

- The motivation and prerequisites for the course.
- An overview of the topics covered in the course.
- Who exactly this course is a good fit for, and who it is not.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Computational Linear Algebra for Coders Review appeared first on Machine Learning Mastery.

]]>The post Introduction to Linear Algebra by Gilbert Strang for Machine Learning appeared first on Machine Learning Mastery.

]]>That textbook is “*Introduction to Linear Algebra*” by Gilbert Strang and it provides a reference for his linear algebra course taught at MIT to undergraduate students.

In this post, you will discover the book “*Introduction to Linear Algebra*” by Gilbert Strang and how you can make the best use of it as a machine learning practitioner.

After reading this post, you will know:

- About the goals and benefits of the book to a beginner or practitioner.
- The contents of the book and general topics presented in each chapter.
- A selected reading list targeted for machine learning practitioners looking to get up to speed fast.

Let’s get started.

Gilbert Strang teaches an introductory course to linear algebra at MIT.

His textbook titled “Introduction to Linear Algebra” is designed to support this course. His course and this textbook are widely regarded and often the first book recommended to undergraduate students looking to learn linear algebra.

The book does assume some mathematical background, namely some calculus and familiarity with vectors and matrices.

18.02 Multivariable Calculus is a formal prerequisite for MIT students wishing to enroll in 18.06 Linear Algebra, but knowledge of calculus is not required to learn the subject. […] To succeed in this course you will need to be comfortable with vectors, matrices, and three-dimensional coordinate systems.

— Prerequisites, Linear Algebra, MITOpenCourseware

His book is excellent, if not a little theoretical.

It can be used as a good starting point for machine learning practitioners interested in getting started or brushing up on their linear algebra. I think this is still the case if you do not have a background in calculus.

The century of data has begun! […] The truth is that vectors and matrices have become the language to know.

— Page ix, Introduction to Linear Algebra, Fifth Edition, 2016.

Concepts in the book are laid out clearly, often with diagrams, but the book moves quickly. The book expects you to keep up or you will fall behind.

That being said, each section has an overview of the concepts to be covered and ends with worked examples and quiz questions, the answers to which are available on the book’s website.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

This section provides a summary of the table of contents of the book.

- Chapter 1: Introduction to Vectors
- Chapter 2: Solving Linear Equations
- Chapter 3: Vector Spaces and Subspaces
- Chapter 4: Orthogonality
- Chapter 5: Determinants
- Chapter 6: Eigenvalues and Eigenvectors
- Chapter 7: The Singular Value Decomposition (SVD)
- Chapter 8: Linear Transformations
- Chapter 9: Complex Vectors and Matrices
- Chapter 10: Applications
- Chapter 11: Numerical Linear Algebra
- Chapter 12: Linear Algebra in Probability & Statistics

The book’s homepage provides a fuller summary including chapter sections.

The back cover provides a beautiful and elegant way of describing the goal of the book:

This book is designed to help students understand and solve the four central problems of linear algebra:

- Ax = b n b n Chapters 1-2 Linear Systems
- Ax = b m by n Chapters 3-4 Least Squares
- Ax = lambda x n by b Chapters 5-6 Eigenvalues
- Av = sigma u m by b Chapters 7-8 Singular values

(

The book is excellent and I recommend reading it from cover-to-cover, if you’re really into it.

But, as a machine learning practitioner, you do not need to read it all.

Below is a list of selected reading from the book that I recommend to get on top of linear algebra fast:

- Section 1.1 Vectors and Linear Combinations
- Section 1.2 Lengths and Dot Products
- Section 1.3 Matrices
- Section 2.4 Rules for Matrix Operations
- Section 2.5 Inverse Matrices
- Section 2.6 Elimination = Factorization: A = LU
- Section 2.7 Transposes and Permutations
- Section 4.3 Least Squares Approximations
- Section 5.1 The Properties of Determinants
- Section 6.1 Introduction to Eigenvalues
- Section 6.2 Diagonalizing a Matrix
- Section 6.4 Symmetric Matrices
- Section 7.1 Image Processing by Linear Algebra
- Section 7.2 Bases and Matrices in the SVD
- Section 7.3 Principal Component Analysis (PCA by the SVD)
- Section 12.1 Mean, Variance, and Probability
- Section 12.2 Covariance Matrices and Joint Probabilities

Further, I would make the following recommendations:

- Attempt the end of section questions and check your answers.
- Consider implementing the methods directly in Python using NumPy function calls.
- Research and find examples where some or all of these operations are used in machine learning algorithms, papers, or textbooks.

Did you explore these extensions?

Post your findings in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- Introduction to Linear Algebra, Fifth Edition, 2016.
- Book homepage
- MIT Course 18.06 Linear Algebra Course homepage
- MIT Course 18.06 Linear Algebra Course on MITOpenCourseware (2011)
- Gilbert Strang’s homepage
- Course Videos on YouTube (2005)

In this post, you discovered the book “Introduction to Linear Algebra” by Gilbert Strang and how you can make the best use of it as a machine learning practitioner.

Specifically, you learned:

- About the goals and benefits of the book to a beginner or practitioner.
- The contents of the book and general topics presented in each chapter.
- A selected reading list targeted for machine learning practitioners looking to get up to speed fast.

Have you read this book? What did you think?

Let me know in the comments below.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Introduction to Linear Algebra by Gilbert Strang for Machine Learning appeared first on Machine Learning Mastery.

]]>The post Linear Algebra for Deep Learning appeared first on Machine Learning Mastery.

]]>Generally, an understanding of linear algebra (or parts thereof) is presented as a prerequisite for machine learning. Although important, this area of mathematics is seldom covered by computer science or software engineering degree programs.

In their seminal textbook on deep learning, Ian Goodfellow and others present chapters covering the prerequisite mathematical concepts for deep learning, including a chapter on linear algebra.

In this post, you will discover the crash course in linear algebra for deep learning presented in the de facto textbook on deep learning.

After reading this post, you will know:

- The topics suggested as prerequisites for deep learning by experts in the field.
- The progression through these topics and their culmination.
- Suggestions for how to get the most out of the chapter as a crash course in linear algebra.

Let’s get started.

The book “Deep Learning” by Ian Goodfellow, Yoshua Bengio, and Aaron Courville is the de facto textbook for deep learning.

In the book, the authors provide a part titled “*Applied Math and Machine Learning Basics*” intended to provide the background in applied mathematics and machine learning required to understand the deep learning material presented in the rest of the book.

This part of the book includes four chapters; they are:

- Linear Algebra
- Probability and Information Theory
- Numerical Computation
- Machine Learning Basics

Given the expertise of the authors of the book, it is fair to say that the chapter on linear algebra provides a well reasoned set of prerequisites for deep learning, and perhaps more generally much of machine learning.

This part of the book introduces the basic mathematical concepts needed to understand deep learning.

— Page 30, Deep Learning, 2016.

Therefore, we can use the topics covered in the chapter on linear algebra as a guide to the topics you may be expected to be familiar with as a deep learning and machine learning practitioner.

Linear algebra is less likely to be covered in computer science courses than other types of math, such as discrete mathematics. This is specifically called out by the authors.

Linear algebra is a branch of mathematics that is widely used throughout science and engineering. However, because linear algebra is a form of continuous rather than discrete mathematics, many computer scientists have little experience with it.

— Page 31, Deep Learning, 2016.

We can take that the topics in this chapter are also laid out in a way tailored for computer science graduates with little to no prior exposure.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The chapter on linear algebra is divided into 12 sections.

As a first step, it is useful to use this as a high-level road map. The complete list of sections from the chapter are listed below.

- Scalars, Vectors, Matrices and Tensors
- Multiplying Matrices and Vectors
- Identity and Inverse Matrices
- Linear Dependence and Span
- Norms
- Special Kinds of Matrices and Vectors
- Eigendecomposition
- Singular Value Decomposition
- The Moore-Penrose Pseudoinverse
- The Trace Operator
- The Determinant
- Example: Principal Components Analysis

There’s not much value in enumerating the specifics covered in each section as the topics are mostly self explanatory, if familiar.

A reading of the chapter shows a progression in concepts and methods from the most primitive (vectors and matrices) to the derivation of the principal components analysis (known as PCA), a method used in machine learning.

It is a clean progression and well designed. Topics are presented with textual descriptions and consistent notation, allowing the reader to see exactly how elements come together through matrix factorization, the pseudoinverse, and ultimately PCA.

The focus is on the application of the linear algebra operations rather than theory. Although, no worked examples are given of any of the operations.

Finally, the derivation of PCA is perhaps a bit much. A beginner may want to skip this full derivation, or perhaps reduce it to the application of some of the elements learned throughout the chapter (e.g. eigendecomposition).

One area I would like to have seen covered is linear least squares and the use of various matrix algebra methods used to solve it, such as directly, LU, QR decomposition, and SVD. This might be more of a general machine learning perspective and less a deep learning perspective, and I can see why it was excluded.

The authors also suggest two other texts to consult if further depth in linear algebra is required.

They are:

- The Matrix Cookbook, Petersen and Pedersen, 2006.
- Linear Algebra, Shilov, 1977.

The Matrix Cookbook is a free PDF filled with the notations and equations of practically any matrix operation you can conceive.

These pages are a collection of facts (identities, approximations, inequalities, relations, …) about matrices and matters relating to them. It is collected in this form for the convenience of anyone who wants a quick desktop reference.

— page 2, The Matrix Cookbook, 2006.

Linear Algebra by Georgi Shilov is a classic and well regarded textbook on the topics designed for undergraduate students.

This book is intended as a text for undergraduate students majoring in mathematics and physics.

— Page v, Linear Algebra, 1977.

If you are a machine learning practitioner looking to use this chapter as a linear algebra crash course, then I would make a few recommendations to make the topics more concrete:

- Implement each operation in Python using NumPy functions on small contrived data.
- Implement each operation manually in Python without NumPy functions.
- Apply key operations, such as the factorization methods (eigendecomposition and SVD) and PCA to real but small datasets loaded from CSV.
- Create a cheat sheet of notation that you can use as a quick reference going forward.
- Research and list examples of each operation/topic used in machine learning papers or texts.

Did you take on any of these suggestions?

List your results in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- Deep Learning, 2016.
- The Matrix Cookbook, Petersen and Pedersen, 2006.
- Linear Algebra, Shilov, 1977.

In this post, you discovered the crash course in linear algebra for deep learning presented in the de facto textbook on deep learning.

Specifically, you learned:

- The topics suggested as prerequisites for deep learning by experts in the field.
- The progression through these topics and their culmination.
- Suggestions for how to get the most out of the chapter as a crash course in linear algebra.

Did you read this chapter of the Deep Learning book? What did you think of it?

Let me know in the comments below.

The post Linear Algebra for Deep Learning appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Sparse Matrices for Machine Learning appeared first on Machine Learning Mastery.

]]>Large sparse matrices are common in general and especially in applied machine learning, such as in data that contains counts, data encodings that map categories to counts, and even in whole subfields of machine learning such as natural language processing.

It is computationally expensive to represent and work with sparse matrices as though they are dense, and much improvement in performance can be achieved by using representations and operations that specifically handle the matrix sparsity.

In this tutorial, you will discover sparse matrices, the issues they present, and how to work with them directly in Python.

After completing this tutorial, you will know:

- That sparse matrices contain mostly zero values and are distinct from dense matrices.
- The myriad of areas where you are likely to encounter sparse matrices in data, data preparation, and sub-fields of machine learning.
- That there are many efficient ways to store and work with sparse matrices and SciPy provides implementations that you can use directly.

Let’s get started.

This tutorial is divided into 5 parts; they are:

- Sparse Matrix
- Problems with Sparsity
- Sparse Matrices in Machine Learning
- Working with Sparse Matrices
- Sparse Matrices in Python

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

A sparse matrix is a matrix that is comprised of mostly zero values.

Sparse matrices are distinct from matrices with mostly non-zero values, which are referred to as dense matrices.

A matrix is sparse if many of its coefficients are zero. The interest in sparsity arises because its exploitation can lead to enormous computational savings and because many large matrix problems that occur in practice are sparse.

— Page 1, Direct Methods for Sparse Matrices, Second Edition, 2017.

The sparsity of a matrix can be quantified with a score, which is the number of zero values in the matrix divided by the total number of elements in the matrix.

sparsity = count zero elements / total elements

Below is an example of a small 3 x 6 sparse matrix.

1, 0, 0, 1, 0, 0 A = (0, 0, 2, 0, 0, 1) 0, 0, 0, 2, 0, 0

The example has 13 zero values of the 18 elements in the matrix, giving this matrix a sparsity score of 0.722 or about 72%.

Sparse matrices can cause problems with regards to space and time complexity.

Very large matrices require a lot of memory, and some very large matrices that we wish to work with are sparse.

In practice, most large matrices are sparse — almost all entries are zeros.

— Page 465, Introduction to Linear Algebra, Fifth Edition, 2016.

An example of a very large matrix that is too large to be stored in memory is a link matrix that shows the links from one website to another.

An example of a smaller sparse matrix might be a word or term occurrence matrix for words in one book against all known words in English.

In both cases, the matrix contained is sparse with many more zero values than data values. The problem with representing these sparse matrices as dense matrices is that memory is required and must be allocated for each 32-bit or even 64-bit zero value in the matrix.

This is clearly a waste of memory resources as those zero values do not contain any information.

Assuming a very large sparse matrix can be fit into memory, we will want to perform operations on this matrix.

Simply, if the matrix contains mostly zero-values, i.e. no data, then performing operations across this matrix may take a long time where the bulk of the computation performed will involve adding or multiplying zero values together.

It is wasteful to use general methods of linear algebra on such problems, because most of the O(N^3) arithmetic operations devoted to solving the set of equations or inverting the matrix involve zero operands.

— Page 75, Numerical Recipes: The Art of Scientific Computing, Third Edition, 2007.

This is a problem of increased time complexity of matrix operations that increases with the size of the matrix.

This problem is compounded when we consider that even trivial machine learning methods may require many operations on each row, column, or even across the entire matrix, resulting in vastly longer execution times.

Sparse matrices turn up a lot in applied machine learning.

In this section, we will look at some common examples to motivate you to be aware of the issues of sparsity.

Sparse matrices come up in some specific types of data, most notably observations that record the occurrence or count of an activity.

Three examples include:

- Whether or not a user has watched a movie in a movie catalog.
- Whether or not a user has purchased a product in a product catalog.
- Count of the number of listens of a song in a song catalog.

Sparse matrices come up in encoding schemes used in the preparation of data.

Three common examples include:

- One-hot encoding, used to represent categorical data as sparse binary vectors.
- Count encoding, used to represent the frequency of words in a vocabulary for a document
- TF-IDF encoding, used to represent normalized word frequency scores in a vocabulary.

Some areas of study within machine learning must develop specialized methods to address sparsity directly as the input data is almost always sparse.

Three examples include:

- Natural language processing for working with documents of text.
- Recommender systems for working with product usage within a catalog.
- Computer vision when working with images that contain lots of black pixels.

If there are 100,000 words in the language model, then the feature vector has length 100,000, but for a short email message almost all the features will have count zero.

— Page 22, Artificial Intelligence: A Modern Approach, Third Edition, 2009.

The solution to representing and working with sparse matrices is to use an alternate data structure to represent the sparse data.

The zero values can be ignored and only the data or non-zero values in the sparse matrix need to be stored or acted upon.

There are multiple data structures that can be used to efficiently construct a sparse matrix; three common examples are listed below.

**Dictionary of Keys**. A dictionary is used where a row and column index is mapped to a value.**List of Lists**. Each row of the matrix is stored as a list, with each sublist containing the column index and the value.**Coordinate List**. A list of tuples is stored with each tuple containing the row index, column index, and the value.

There are also data structures that are more suitable for performing efficient operations; two commonly used examples are listed below.

**Compressed Sparse Row**. The sparse matrix is represented using three one-dimensional arrays for the non-zero values, the extents of the rows, and the column indexes.**Compressed Sparse Column**. The same as the Compressed Sparse Row method except the column indices are compressed and read first before the row indices.

The Compressed Sparse Row, also called CSR for short, is often used to represent sparse matrices in machine learning given the efficient access and matrix multiplication that it supports.

SciPy provides tools for creating sparse matrices using multiple data structures, as well as tools for converting a dense matrix to a sparse matrix.

Many linear algebra NumPy and SciPy functions that operate on NumPy arrays can transparently operate on SciPy sparse arrays. Further, machine learning libraries that use NumPy data structures can also operate transparently on SciPy sparse arrays, such as scikit-learn for general machine learning and Keras for deep learning.

A dense matrix stored in a NumPy array can be converted into a sparse matrix using the CSR representation by calling the *csr_matrix()* function.

In the example below, we define a 3 x 6 sparse matrix as a dense array, convert it to a CSR sparse representation, and then convert it back to a dense array by calling the *todense()* function.

# dense to sparse from numpy import array from scipy.sparse import csr_matrix # create dense matrix A = array([[1, 0, 0, 1, 0, 0], [0, 0, 2, 0, 0, 1], [0, 0, 0, 2, 0, 0]]) print(A) # convert to sparse matrix (CSR method) S = csr_matrix(A) print(S) # reconstruct dense matrix B = S.todense() print(B)

Running the example first prints the defined dense array, followed by the CSR representation, and then the reconstructed dense matrix.

[[1 0 0 1 0 0] [0 0 2 0 0 1] [0 0 0 2 0 0]] (0, 0) 1 (0, 3) 1 (1, 2) 2 (1, 5) 1 (2, 3) 2 [[1 0 0 1 0 0] [0 0 2 0 0 1] [0 0 0 2 0 0]]

NumPy does not provide a function to calculate the sparsity of a matrix.

Nevertheless, we can calculate it easily by first finding the density of the matrix and subtracting it from one. The number of non-zero elements in a NumPy array can be given by the *count_nonzero()* function and the total number of elements in the array can be given by the size property of the array. Array sparsity can therefore be calculated as

sparsity = 1.0 - count_nonzero(A) / A.size

The example below demonstrates how to calculate the sparsity of an array.

# calculate sparsity from numpy import array from numpy import count_nonzero # create dense matrix A = array([[1, 0, 0, 1, 0, 0], [0, 0, 2, 0, 0, 1], [0, 0, 0, 2, 0, 0]]) print(A) # calculate sparsity sparsity = 1.0 - count_nonzero(A) / A.size print(sparsity)

Running the example first prints the defined sparse matrix followed by the sparsity of the matrix.

[[1 0 0 1 0 0] [0 0 2 0 0 1] [0 0 0 2 0 0]] 0.7222222222222222

This section lists some ideas for extending the tutorial that you may wish to explore.

- Develop your own examples for converting a dense array to sparse and calculating sparsity.
- Develop an example for the each sparse matrix representation method supported by SciPy.
- Select one sparsity representation method and implement it yourself from scratch.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Introduction to Linear Algebra, Fifth Edition, 2016.
- Section 2.7 Sparse Linear Systems, Numerical Recipes: The Art of Scientific Computing, Third Edition, 2007.
- Artificial Intelligence: A Modern Approach, Third Edition, 2009.
- Direct Methods for Sparse Matrices, Second Edition, 2017.

- Sparse matrices (scipy.sparse) API
- scipy.sparse.csr_matrix() API
- numpy.count_nonzero() API
- numpy.ndarray.size API

In this tutorial, you discovered sparse matrices, the issues they present, and how to work with them directly in Python.

Specifically, you learned:

- That sparse matrices contain mostly zero values and are distinct from dense matrices.
- The myriad of areas where you are likely to encounter sparse matrices in data, data preparation, and sub-fields of machine learning.
- That there are many efficient ways to store and work with sparse matrices and SciPy provides implementations that you can use directly.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Sparse Matrices for Machine Learning appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Broadcasting with NumPy Arrays appeared first on Machine Learning Mastery.

]]>A way to overcome this is to duplicate the smaller array so that it is the dimensionality and size as the larger array. This is called array broadcasting and is available in NumPy when performing array arithmetic, which can greatly reduce and simplify your code.

In this tutorial, you will discover the concept of array broadcasting and how to implement it in NumPy.

After completing this tutorial, you will know:

- The problem of arithmetic with arrays with different sizes.
- The solution of broadcasting and common examples in one and two dimensions.
- The rule of array broadcasting and when broadcasting fails.

Let’s get started.

This tutorial is divided into 4 parts; they are:

- Limitation with Array Arithmetic
- Array Broadcasting
- Broadcasting in NumPy
- Limitations of Broadcasting

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

You can perform arithmetic directly on NumPy arrays, such as addition and subtraction.

For example, two arrays can be added together to create a new array where the values at each index are added together.

For example, an array a can be defined as [1, 2, 3] and array b can be defined as [1, 2, 3] and adding together will result in a new array with the values [2, 4, 6].

a = [1, 2, 3] b = [1, 2, 3] c = a + b c = [1 + 1, 2 + 2, 3 + 3]

Strictly, arithmetic may only be performed on arrays that have the same dimensions and dimensions with the same size.

This means that a one-dimensional array with the length of 10 can only perform arithmetic with another one-dimensional array with the length 10.

This limitation on array arithmetic is quite limiting indeed. Thankfully, NumPy provides a built-in workaround to allow arithmetic between arrays with differing sizes.

Broadcasting is the name given to the method that NumPy uses to allow array arithmetic between arrays with a different shape or size.

Although the technique was developed for NumPy, it has also been adopted more broadly in other numerical computational libraries, such as Theano, TensorFlow, and Octave.

Broadcasting solves the problem of arithmetic between arrays of differing shapes by in effect replicating the smaller array along the last mismatched dimension.

The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.

— Broadcasting, SciPy.org

NumPy does not actually duplicate the smaller array; instead, it makes memory and computationally efficient use of existing structures in memory that in effect achieve the same result.

The concept has also permeated linear algebra notation to simplify the explanation of simple operations.

In the context of deep learning, we also use some less conventional notation. We allow the addition of matrix and a vector, yielding another matrix: C = A + b, where Ci,j = Ai,j + bj. In other words, the vector b is added to each row of the matrix. This shorthand eliminates the need to define a matrix with b copied into each row before doing the addition. This implicit copying of b to many locations is called broadcasting.

— Page 34, Deep Learning, 2016.

We can make broadcasting concrete by looking at three examples in NumPy.

The examples in this section are not exhaustive, but instead are common to the types of broadcasting you may see or implement.

A single value or scalar can be used in arithmetic with a one-dimensional array.

For example, we can imagine a one-dimensional array “a” with three values [a1, a2, a3] added to a scalar “b”.

a = [a1, a2, a3] b

The scalar will need to be broadcast across the one-dimensional array by duplicating the value it 2 more times.

b = [b1, b2, b3]

The two one-dimensional arrays can then be added directly.

c = a + b c = [a1 + b1, a2 + b2, a3 + b3]

The example below demonstrates this in NumPy.

# scalar and one-dimensional from numpy import array a = array([1, 2, 3]) print(a) b = 2 print(b) c = a + b print(c)

Running the example first prints the defined one-dimensional array, then the scalar, followed by the result where the scalar is added to each value in the array.

[1 2 3] 2 [3 4 5]

A scalar value can be used in arithmetic with a two-dimensional array.

For example, we can imagine a two-dimensional array “A” with 2 rows and 3 columns added to the scalar “b”.

a11, a12, a13 A = (a21, a22, a23) b

The scalar will need to be broadcast across each row of the two-dimensional array by duplicating it 5 more times.

b11, b12, b13 B = (b21, b22, b23)

The two two-dimensional arrays can then be added directly.

C = A + B a11 + b11, a12 + b12, a13 + b13 C = (a21 + b21, a22 + b22, a23 + b23)

The example below demonstrates this in NumPy.

# scalar and two-dimensional from numpy import array A = array([[1, 2, 3], [1, 2, 3]]) print(A) b = 2 print(b) C = A + b print(C)

Running the example first prints the defined two-dimensional array, then the scalar, then the result of the addition with the value “2” added to each value in the array.

[[1 2 3] [1 2 3]] 2 [[3 4 5] [3 4 5]]

A one-dimensional array can be used in arithmetic with a two-dimensional array.

For example, we can imagine a two-dimensional array “A” with 2 rows and 3 columns added to a one-dimensional array “b” with 3 values.

a11, a12, a13 A = (a21, a22, a23) b = (b1, b2, b3)

The one-dimensional array is broadcast across each row of the two-dimensional array by creating a second copy to result in a new two-dimensional array “B”.

b11, b12, b13 B = (b21, b22, b23)

The two two-dimensional arrays can then be added directly.

C = A + B a11 + b11, a12 + b12, a13 + b13 C = (a21 + b21, a22 + b22, a23 + b23)

Below is a worked example in NumPy.

# one-dimensional and two-dimensional from numpy import array A = array([[1, 2, 3], [1, 2, 3]]) print(A) b = array([1, 2, 3]) print(b) C = A + b print(C)

Running the example first prints the defined two-dimensional array, then the defined one-dimensional array, followed by the result C where in effect each value in the two-dimensional array is doubled.

[[1 2 3] [1 2 3]] [1 2 3] [[2 4 6] [2 4 6]]

Broadcasting is a handy shortcut that proves very useful in practice when working with NumPy arrays.

That being said, it does not work for all cases, and in fact imposes a strict rule that must be satisfied for broadcasting to be performed.

Arithmetic, including broadcasting, can only be performed when the shape of each dimension in the arrays are equal or one has the dimension size of 1. The dimensions are considered in reverse order, starting with the trailing dimension; for example, looking at columns before rows in a two-dimensional case.

This make more sense when we consider that NumPy will in effect pad missing dimensions with a size of “1” when comparing arrays.

Therefore, the comparison between a two-dimensional array “A” with 2 rows and 3 columns and a vector “b” with 3 elements:

A.shape = (2 x 3) b.shape = (3)

In effect, this becomes a comparison between:

A.shape = (2 x 3) b.shape = (1 x 3)

This same notion applies to the comparison between a scalar that is treated as an array with the required number of dimensions:

A.shape = (2 x 3) b.shape = (1)

This becomes a comparison between:

A.shape = (2 x 3) b.shape = (1 x 1)

When the comparison fails, the broadcast cannot be performed, and an error is raised.

The example below attempts to broadcast a two-element array to a 2 x 3 array. This comparison is in effect:

A.shape = (2 x 3) b.shape = (1 x 2)

We can see that the last dimensions (columns) do not match and we would expect the broadcast to fail.

The example below demonstrates this in NumPy.

# broadcasting error from numpy import array A = array([[1, 2, 3], [1, 2, 3]]) print(A.shape) b = array([1, 2]) print(b.shape) C = A + b print(C)

Running the example first prints the shapes of the arrays then raises an error when attempting to broadcast, as we expected.

(2, 3) (2,) ValueError: operands could not be broadcast together with shapes (2,3) (2,)

This section lists some ideas for extending the tutorial that you may wish to explore.

- Create three new and different examples of broadcasting with NumPy arrays.
- Implement your own broadcasting function for manually broadcasting in one and two-dimensional cases.
- Benchmark NumPy broadcasting and your own custom broadcasting functions with one and two dimensional cases with very large arrays.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Chapter 2, Deep Learning, 2016.

- Broadcasting, NumPy API, SciPy.org
- Broadcasting semantics in TensorFlow
- Array Broadcasting in numpy, EricsBroadcastingDoc
- Broadcasting, Theano
- Broadcasting arrays in Numpy, 2015.
- Broadcasting in Octave

In this tutorial, you discovered the concept of array broadcasting and how to implement in NumPy.

Specifically, you learned:

- The problem of arithmetic with arrays with different sizes.
- The solution of broadcasting and common examples in one and two dimensions.
- The rule of array broadcasting and when broadcasting fails.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Broadcasting with NumPy Arrays appeared first on Machine Learning Mastery.

]]>The post 10 Examples of Linear Algebra in Machine Learning appeared first on Machine Learning Mastery.

]]>It is a key foundation to the field of machine learning, from notations used to describe the operation of algorithms to the implementation of algorithms in code.

Although linear algebra is integral to the field of machine learning, the tight relationship is often left unexplained or explained using abstract concepts such as vector spaces or specific matrix operations.

In this post, you will discover 10 common examples of machine learning that you may be familiar with that use, require and are really best understood using linear algebra.

After reading this post, you will know:

- The use of linear algebra structures when working with data, such as tabular datasets and images.
- Linear algebra concepts when working with data preparation, such as one hot encoding and dimensionality reduction.
- The ingrained use of linear algebra notation and methods in sub-fields such as deep learning, natural language processing, and recommender systems.

Let’s get started.

In this post, we will review 10 obvious and concrete examples of linear algebra in machine learning.

I tried to pick examples that you may be familiar with or have even worked with before. They are:

- Dataset and Data Files
- Images and Photographs
- One-Hot Encoding
- Linear Regression
- Regularization
- Principal Component Analysis
- Singular-Value Decomposition
- Latent Semantic Analysis
- Recommender Systems
- Deep Learning

Do you have your own favorite example of linear algebra in machine learning?

Let me know in the comments below.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In machine learning, you fit a model on a dataset.

This is the table-like set of numbers where each row represents an observation and each column represents a feature of the observation.

For example, below is a snippet of the Iris flowers dataset:

5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa

This data is in fact a matrix: a key data structure in linear algebra.

Further, when you split the data into inputs and outputs to fit a supervised machine learning model, such as the measurements and the flower species, you have a matrix (X) and a vector (y). The vector is another key data structure in linear algebra.

Each row has the same length, i.e. the same number of columns, therefore we can say that the data is vectorized where rows can be provided to a model one at a time or in a batch and the model can be pre-configured to expect rows of a fixed width.

Perhaps you are more used to working with images or photographs in computer vision applications.

Each image that you work with is itself a table structure with a width and height and one pixel value in each cell for black and white images or 3 pixel values in each cell for a color image.

A photo is yet another example of a matrix from linear algebra.

Operations on the image, such as cropping, scaling, shearing, and so on are all described using the notation and operations of linear algebra.

Sometimes you work with categorical data in machine learning.

Perhaps the class labels for classification problems, or perhaps categorical input variables.

It is common to encode categorical variables to make them easier to work with and learn by some techniques. A popular encoding for categorical variables is the one hot encoding.

A one hot encoding is where a table is created to represent the variable with one column for each category and a row for each example in the dataset. A check, or one-value, is added in the column for the categorical value for a given row, and a zero-value is added to all other columns.

For example, the color variable with the 3 rows:

red green blue ...

Might be encoded as:

red, green, blue 1, 0, 0 0, 1, 0 0, 0, 1 ...

Each row is encoded as a binary vector, a vector with zero or one values and this is an example of a sparse representation, a whole sub-field of linear algebra.

Linear regression is an old method from statistics for describing the relationships between variables.

It is often used in machine learning for predicting numerical values in simpler regression problems.

There are many ways to describe and solve the linear regression problem, i.e. finding a set of coefficients that when multiplied by each of the input variables and added together results in the best prediction of the output variable.

If you have used a machine learning tool or library, the most common way of solving linear regression is via a least squares optimization that is solved using matrix factorization methods from linear regression, such as an LU decomposition or a singular-value decomposition, or SVD.

Even the common way of summarizing the linear regression equation uses linear algebra notation:

y = A . b

Where y is the output variable A is the dataset and b are the model coefficients.

In applied machine learning, we often seek the simplest possible models that achieve the best skill on our problem.

Simpler models are often better at generalizing from specific examples to unseen data.

In many methods that involve coefficients, such as regression methods and artificial neural networks, simpler models are often characterized by models that have smaller coefficient values.

A technique that is often used to encourage a model to minimize the size of coefficients while it is being fit on data is called regularization. Common implementations include the L2 and L1 forms of regularization.

Both of these forms of regularization are in fact a measure of the magnitude or length of the coefficients as a vector and are methods lifted directly from linear algebra called the vector norm.

Often, a dataset has many columns, perhaps tens, hundreds, thousands, or more.

Modeling data with many features is challenging, and models built from data that include irrelevant features are often less skillful than models trained from the most relevant data.

It is hard to know which features of the data are relevant and which are not.

Methods for automatically reducing the number of columns of a dataset are called dimensionality reduction, and perhaps the most popular method is called the principal component analysis, or PCA for short.

This method is used in machine learning to create projections of high-dimensional data for both visualization and for training models.

The core of the PCA method is a matrix factorization method from linear algebra. The eigendecomposition can be used and more robust implementations may use the singular-value decomposition, or SVD.

Another popular dimensionality reduction method is the singular-value decomposition method, or SVD for short.

As mentioned, and as the name of the method suggests, it is a matrix factorization method from the field of linear algebra.

It has wide use in linear algebra and can be used directly in applications such as feature selection, visualization, noise reduction, and more.

We will see two more cases below of using the SVD in machine learning.

In the sub-field of machine learning for working with text data called natural language processing, it is common to represent documents as large matrices of word occurrences.

For example, the columns of the matrix may be the known words in the vocabulary and rows may be sentences, paragraphs, pages, or documents of text with cells in the matrix marked as the count or frequency of the number of times the word occurred.

This is a sparse matrix representation of the text. Matrix factorization methods, such as the singular-value decomposition can be applied to this sparse matrix, which has the effect of distilling the representation down to its most relevant essence. Documents processed in this way are much easier to compare, query, and use as the basis for a supervised machine learning model.

This form of data preparation is called Latent Semantic Analysis, or LSA for short, and is also known by the name Latent Semantic Indexing, or LSI.

Predictive modeling problems that involve the recommendation of products are called recommender systems, a sub-field of machine learning.

Examples include the recommendation of books based on previous purchases and purchases by customers like you on Amazon, and the recommendation of movies and TV shows to watch based on your viewing history and viewing history of subscribers like you on Netflix.

The development of recommender systems is primarily concerned with linear algebra methods. A simple example is in the calculation of the similarity between sparse customer behavior vectors using distance measures such as Euclidean distance or dot products.

Matrix factorization methods like the singular-value decomposition are used widely in recommender systems to distill item and user data to their essence for querying and searching and comparison.

Artificial neural networks are nonlinear machine learning algorithms that are inspired by elements of the information processing in the brain and have proven effective at a range of problems, not the least of which is predictive modeling.

Deep learning is the recent resurgence in the use of artificial neural networks with newer methods and faster hardware that allow for the development and training of larger and deeper (more layers) networks on very large datasets. Deep learning methods are routinely achieving state-of-the-art results on a range of challenging problems such as machine translation, photo captioning, speech recognition, and much more.

At their core, the execution of neural networks involves linear algebra data structures multiplied and added together. Scaled up to multiple dimensions, deep learning methods work with vectors, matrices, and even tensors of inputs and coefficients, where a tensor is a matrix with more than two dimensions.

Linear algebra is central to the description of deep learning methods via matrix notation to the implementation of deep learning methods such as Google’s TensorFlow Python library that has the word “tensor” in its name.

In this post, you discovered 10 common examples of machine learning that you may be familiar with that use and require linear algebra.

Specifically, you learned:

- The use of linear algebra structures when working with data such as tabular datasets and images.
- Linear algebra concepts when working with data preparation such as one hot encoding and dimensionality reduction.
- The ingrained use of linear algebra notation and methods in sub-fields such as deep learning, natural language processing, and recommender systems.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post 10 Examples of Linear Algebra in Machine Learning appeared first on Machine Learning Mastery.

]]>The post No Bullshit Guide To Linear Algebra Review appeared first on Machine Learning Mastery.

]]>Most are textbooks targeted at undergraduate students and are full of theoretical digressions that are barely relevant and mostly distracting to a beginner or practitioner to the field.

In this post, you will discover the book “No bullshit guide to linear algebra” that provides a gentle introduction to the field of linear algebra and assumes no prior mathematical knowledge.

After reading this post, you will know:

- About the goals and benefits of the book to a beginner or practitioner.
- The contents of the book and general topics presented in each chapter.
- A selected reading list targeted for machine learning practitioners looking to get up to speed fast.

Let’s get started.

The book provides an introduction to linear algebra, comparable to an undergraduate university course on the subject.

The key approach of the book is no crap and straight to the point. This means a laser focus on a given operation or technique and no (or few) detours or digressions.

The book was written by Ivan Savov, the second edition of which was released in 2017. Ivan has an undergraduate degree in electrical engineering and a Masters and Ph.D. in physics and has worked for the last 15 years as a private tutor for math and physics. He knows the subject and where students encounter difficulties.

What makes this an excellent book for the machine learning practitioner is that the book is self-contained. It does not assume any prior mathematics background and all prerequisite math, which is minimal, is covered in the first chapter titled “*Math fundamentals*.”

It is the perfect book if you have never studied linear algebra, or if you studied it in school decades ago and have forgotten practically everything.

Another aspect that makes this book great for machine learning practitioners is that it includes exercises.

Each section ends with a few pop-quiz style questions.

Each chapter ends with a problem set for you to work through.

Finally, Appendix A provides answers to all exercises in the book.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

This section provides a summary of the table of contents of the book.

**Math fundamentals**. Covers the prerequisite math topics required to start learning linear algebra. Topics include numbers, functions, trigonometry, complex numbers, and set notation.**Intro to linear algebra**. An introduction into vector and matrix algebra, the very foundation of linear algebra. Topics include vector and matrix operations and linearity.**Computational linear algebra**. This chapter covers the issues that you will encounter when you start to implement linear algebra and must deal with the operations at any kind of scale. Topics include matrix equations, matrix multiplication, and determinants. Some Python examples are given.**Geometric aspects of linear algebra**. Covers the geometric intuition for vector algebra, which is quite common. Topics include lines and planes, projections and vector spaces.**Linear transformations**. Covers the core fiber of linear algebra as Ivan describes it. Introduces linear transformations.**Theoretical linear algebra**. Covers the last steps of matrix algebra prior to applications. Covers topics such as matrix factorization methods, types of matrices, and more.**Applications**. This chapter covers an impressive list of applications of linear algebra to a range of domains from electronics, graphs, computer graphics, and more. An impressive chapter to make the methods learned throughout the book concrete.**Probability theory**. Provides a crash course on probability theory in the context of linear algebra including Markov chains and the PageRank algorithm.**Quantum mechanics**. Provides a crash course into quantum mechanics through the lens of linear algebra, a specialty area of the authors.

The book is excellent, and I recommend reading it from cover-to-cover, if you’re really into it.

But, as a machine learning practitioner, you do not need to read it all.

Below is a list of selected reading from the book that I recommend to get on top of linear algebra fast:

**Concept Maps**. Page v. A collection of mind-map type diagrams are provided directly after the table of contents that show how the concepts in the book, and, in fact, the concepts in the field of linear algebra, relate. If you are a visual thinker, these may help fit the pieces together.- Section 1.15,
**Vectors**. Page 69. Provides a terse introduction to vectors, prior to any vector algebra. Useful background. - Chapter 2,
**Intro to Linear Algebra**. Pages 101-130. Read this whole chapter. It covers:- Definitions of terms in linear algebra.
- Vector operations such as arithmetic and vector norm.
- Matrix operations such as arithmetic and dot product.
- Linearity and what exactly this key concept means in linear algebra
- Overview of how the different aspects of linear algebra (geometric, theory, etc.) relate.

- Section 3.2
**Matrix Equations**. Page 147. Includes explanations and clear diagrams for calculating matrix operations, not least the must-know matrix multiplication - Section 6.1
**Eigenvalues and eigenvectors**. Page 262. Provides an introduction to the eigendecomposition that is used as a key operation in methods such as the principal component analysis. - Section 6.2
**Special types of matrices**. Page 275. Provides an introduction to various different types of matrices such as diagonal, symmetric, orthogonal, and more. - Section 6.6
**Matrix Decompositions**. Page 295. An introduction matrix factorization methods, re-covering the eigendecomposition, but also covering the LU, QR, and Singular-Value decomposition. - Section 7.7
**Least squares approximate solutions**. Page 241. An introduction to the matrix formulation of least squares called linear least squares. - Appendix B,
**Notation**. A summary of math and linear algebra notation.

This section provides more resources on the topic if you are looking to go deeper.

- No Bullshit Guide To Linear Algebra on Amazon
- Mini Reference Publisher Homepage
- Ivan Savov on Twitter
- Linear algebra explained in four pages, 2013.

In this post, you discovered the book “No Bullshit Guide To Linear Algebra” that provides a gentle introduction to the field of linear algebra and assumes no prior mathematical knowledge.

Specifically, you learned:

- About the goals and benefits of the book to a beginner or practitioner.
- The contents of the book and general topics presented in each chapter.
- A selected reading list targeted for machine learning practitioners looking to get up to speed fast.

Have you read this book? What did you think?

Let me know in the comments below.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post No Bullshit Guide To Linear Algebra Review appeared first on Machine Learning Mastery.

]]>The post How to Solve Linear Regression Using Linear Algebra appeared first on Machine Learning Mastery.

]]>It is a staple of statistics and is often considered a good introductory machine learning method. It is also a method that can be reformulated using matrix notation and solved using matrix operations.

In this tutorial, you will discover the matrix formulation of linear regression and how to solve it using direct and matrix factorization methods.

After completing this tutorial, you will know:

- Linear regression and the matrix reformulation with the normal equations.
- How to solve linear regression using a QR matrix decomposition.
- How to solve linear regression using SVD and the pseudoinverse.

Let’s get started.

This tutorial is divided into 6 parts; they are:

- Linear Regression
- Matrix Formulation of Linear Regression
- Linear Regression Dataset
- Solve Directly
- Solve via QR Decomposition
- Solve via Singular-Value Decomposition

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Linear regression is a method for modeling the relationship between two scalar values: the input variable x and the output variable y.

The model assumes that y is a linear function or a weighted sum of the input variable.

y = f(x)

Or, stated with the coefficients.

y = b0 + b1 . x1

The model can also be used to model an output variable given multiple input variables called multivariate linear regression (below, brackets were added for readability).

y = b0 + (b1 . x1) + (b2 . x2) + ...

The objective of creating a linear regression model is to find the values for the coefficient values (b) that minimize the error in the prediction of the output variable y.

Matrix Formulation of Linear Regression

Linear regression can be stated using Matrix notation; for example:

y = X . b

Or, without the dot notation.

y = Xb

Where X is the input data and each column is a data feature, b is a vector of coefficients and y is a vector of output variables for each row in X.

x11, x12, x13 X = (x21, x22, x23) x31, x32, x33 x41, x42, x43 b1 b = (b2) b3 y1 y = (y2) y3 y4

Reformulated, the problem becomes a system of linear equations where the b vector values are unknown. This type of system is referred to as overdetermined because there are more equations than there are unknowns, i.e. each coefficient is used on each row of data.

It is a challenging problem to solve analytically because there are multiple inconsistent solutions, e.g. multiple possible values for the coefficients. Further, all solutions will have some error because there is no line that will pass nearly through all points, therefore the approach to solving the equations must be able to handle that.

The way this is typically achieved is by finding a solution where the values for b in the model minimize the squared error. This is called linear least squares.

||X . b - y||^2 = sum i=1 to m ( sum j=1 to n Xij . bj - yi)^2

This formulation has a unique solution as long as the input columns are independent (e.g. uncorrelated).

We cannot always get the error e = b – Ax down to zero. When e is zero, x is an exact solution to Ax = b. When the length of e is as small as possible, xhat is a least squares solution.

— Page 219, Introduction to Linear Algebra, Fifth Edition, 2016.

In matrix notation, this problem is formulated using the so-named normal equation:

X^T . X . b = X^T . y

This can be re-arranged in order to specify the solution for b as:

b = (X^T . X)^-1 . X^T . y

This can be solved directly, although given the presence of the matrix inverse can be numerically challenging or unstable.

In order to explore the matrix formulation of linear regression, let’s first define a dataset as a context.

We will use a simple 2D dataset where the data is easy to visualize as a scatter plot and models are easy to visualize as a line that attempts to fit the data points.

The example below defines a 5×2 matrix dataset, splits it into X and y components, and plots the dataset as a scatter plot.

from numpy import array from matplotlib import pyplot data = array([ [0.05, 0.12], [0.18, 0.22], [0.31, 0.35], [0.42, 0.38], [0.5, 0.49], ]) print(data) X, y = data[:,0], data[:,1] X = X.reshape((len(X), 1)) # plot dataset pyplot.scatter(X, y) pyplot.show()

Running the example first prints the defined dataset.

[[ 0.05 0.12] [ 0.18 0.22] [ 0.31 0.35] [ 0.42 0.38] [ 0.5 0.49]]

A scatter plot of the dataset is then created showing that a straight line cannot fit this data exactly.

The first approach is to attempt to solve the regression problem directly.

That is, given X, what are the set of coefficients b that when multiplied by X will give y. As we saw in a previous section, the normal equations define how to calculate b directly.

b = (X^T . X)^-1 . X^T . y

This can be calculated directly in NumPy using the inv() function for calculating the matrix inverse.

b = inv(X.T.dot(X)).dot(X.T).dot(y)

Once the coefficients are calculated, we can use them to predict outcomes given X.

yhat = X.dot(b)

Putting this together with the dataset defined in the previous section, the complete example is listed below.

# solve directly from numpy import array from numpy.linalg import inv from matplotlib import pyplot data = array([ [0.05, 0.12], [0.18, 0.22], [0.31, 0.35], [0.42, 0.38], [0.5, 0.49], ]) X, y = data[:,0], data[:,1] X = X.reshape((len(X), 1)) # linear least squares b = inv(X.T.dot(X)).dot(X.T).dot(y) print(b) # predict using coefficients yhat = X.dot(b) # plot data and predictions pyplot.scatter(X, y) pyplot.plot(X, yhat, color='red') pyplot.show()

Running the example performs the calculation and prints the coefficient vector b.

[ 1.00233226]

A scatter plot of the dataset is then created with a line plot for the model, showing a reasonable fit to the data.

A problem with this approach is the matrix inverse that is both computationally expensive and numerically unstable. An alternative approach is to use a matrix decomposition to avoid this operation. We will look at two examples in the following sections.

The QR decomposition is an approach of breaking a matrix down into its constituent elements.

A = Q . R

Where A is the matrix that we wish to decompose, Q a matrix with the size m x m, and R is an upper triangle matrix with the size m x n.

The QR decomposition is a popular approach for solving the linear least squares equation.

Stepping over all of the derivation, the coefficients can be found using the Q and R elements as follows:

b = R^-1 . Q.T . y

The approach still involves a matrix inversion, but in this case only on the simpler R matrix.

The QR decomposition can be found using the qr() function in NumPy. The calculation of the coefficients in NumPy looks as follows:

# QR decomposition Q, R = qr(X) b = inv(R).dot(Q.T).dot(y)

Tying this together with the dataset, the complete example is listed below.

# least squares via QR decomposition from numpy import array from numpy.linalg import inv from numpy.linalg import qr from matplotlib import pyplot data = array([ [0.05, 0.12], [0.18, 0.22], [0.31, 0.35], [0.42, 0.38], [0.5, 0.49], ]) X, y = data[:,0], data[:,1] X = X.reshape((len(X), 1)) # QR decomposition Q, R = qr(X) b = inv(R).dot(Q.T).dot(y) print(b) # predict using coefficients yhat = X.dot(b) # plot data and predictions pyplot.scatter(X, y) pyplot.plot(X, yhat, color='red') pyplot.show()

Running the example first prints the coefficient solution and plots the data with the model.

[ 1.00233226]

The QR decomposition approach is more computationally efficient and more numerically stable than calculating the normal equation directly, but does not work for all data matrices.

The Singular-Value Decomposition, or SVD for short, is a matrix decomposition method like the QR decomposition.

X = U . Sigma . V^*

Where A is the real n x m matrix that we wish to decompose, U is a m x m matrix, Sigma (often represented by the uppercase Greek letter Sigma) is an m x n diagonal matrix, and V^* is the conjugate transpose of an n x n matrix where * is a superscript.

Unlike the QR decomposition, all matrices have an SVD decomposition. As a basis for solving the system of linear equations for linear regression, SVD is more stable and the preferred approach.

Once decomposed, the coefficients can be found by calculating the pseudoinverse of the input matrix X and multiplying that by the output vector y.

b = X^+ . y

Where the pseudoinverse is calculated as following:

X^+ = U . D^+ . V^T

Where X^+ is the pseudoinverse of X and the + is a superscript, D^+ is the pseudoinverse of the diagonal matrix Sigma and V^T is the transpose of V^*.

Matrix inversion is not defined for matrices that are not square. […] When A has more columns than rows, then solving a linear equation using the pseudoinverse provides one of the many possible solutions.

— Page 46, Deep Learning, 2016.

We can get U and V from the SVD operation. D^+ can be calculated by creating a diagonal matrix from Sigma and calculating the reciprocal of each non-zero element in Sigma.

s11, 0, 0 Sigma = ( 0, s22, 0) 0, 0, s33 1/s11, 0, 0 D = ( 0, 1/s22, 0) 0, 0, 1/s33

We can calculate the SVD, then the pseudoinverse manually. Instead, NumPy provides the function pinv() that we can use directly.

The complete example is listed below.

# least squares via SVD with pseudoinverse from numpy import array from numpy.linalg import pinv from matplotlib import pyplot data = array([ [0.05, 0.12], [0.18, 0.22], [0.31, 0.35], [0.42, 0.38], [0.5, 0.49], ]) X, y = data[:,0], data[:,1] X = X.reshape((len(X), 1)) # calculate coefficients b = pinv(X).dot(y) print(b) # predict using coefficients yhat = X.dot(b) # plot data and predictions pyplot.scatter(X, y) pyplot.plot(X, yhat, color='red') pyplot.show()

Running the example prints the coefficient and plots the data with a red line showing the predictions from the model.

[ 1.00233226]

In fact, NumPy provides a function to replace these two steps in the lstsq() function that you can use directly.

This section lists some ideas for extending the tutorial that you may wish to explore.

- Implement linear regression using the built-in lstsq() NumPy function
- Test each linear regression on your own small contrived dataset.
- Load a tabular dataset and test each linear regression method and compare the results.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Section 7.7 Least squares approximate solutions. No Bullshit Guide To Linear Algebra, 2017.
- Section 4.3 Least Squares Approximations, Introduction to Linear Algebra, Fifth Edition, 2016.
- Lecture 11, Least Squares Problems, Numerical Linear Algebra, 1997.
- Chapter 5, Orthogonalization and Least Squares, Matrix Computations, 2012.
- Chapter 12, Singular-Value and Jordan Decompositions, Linear Algebra and Matrix Analysis for Statistics, 2014.
- Section 2.9 The Moore-Penrose Pseudoinverse, Deep Learning, 2016.
- Section 15.4 General Linear Least Squares, Numerical Recipes: The Art of Scientific Computing, Third Edition, 2007.

- numpy.linalg.inv() API
- numpy.linalg.qr() API
- numpy.linalg.svd() API
- numpy.diag() API
- numpy.linalg.pinv() API
- numpy.linalg.lstsq() API

- Linear regression on Wikipedia
- Least squares on Wikipedia
- Linear least squares (mathematics) on Wikipedia
- Overdetermined system on Wikipedia
- QR decomposition on Wikipedia
- Singular-value decomposition on Wikipedia
- Moore–Penrose inverse

In this tutorial, you discovered the matrix formulation of linear regression and how to solve it using direct and matrix factorization methods.

Specifically, you learned:

- Linear regression and the matrix reformulation with the normal equations.
- How to solve linear regression using a QR matrix decomposition.
- How to solve linear regression using SVD and the pseudoinverse.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Solve Linear Regression Using Linear Algebra appeared first on Machine Learning Mastery.

]]>The post How to Calculate the Principal Component Analysis from Scratch in Python appeared first on Machine Learning Mastery.

]]>It is a method that uses simple matrix operations from linear algebra and statistics to calculate a projection of the original data into the same number or fewer dimensions.

In this tutorial, you will discover the Principal Component Analysis machine learning method for dimensionality reduction and how to implement it from scratch in Python.

After completing this tutorial, you will know:

- The procedure for calculating the Principal Component Analysis and how to choose principal components.
- How to calculate the Principal Component Analysis from scratch in NumPy.
- How to calculate the Principal Component Analysis for reuse on more data in scikit-learn.

Let’s get started.

This tutorial is divided into 3 parts; they are:

- Principal Component Analysis
- Manually Calculate Principal Component Analysis
- Reusable Principal Component Analysis

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Principal Component Analysis, or PCA for short, is a method for reducing the dimensionality of data.

It can be thought of as a projection method where data with m-columns (features) is projected into a subspace with m or fewer columns, whilst retaining the essence of the original data.

The PCA method can be described and implemented using the tools of linear algebra.

PCA is an operation applied to a dataset, represented by an n x m matrix A that results in a projection of A which we will call B. Let’s walk through the steps of this operation.

a11, a12 A = (a21, a22) a31, a32 B = PCA(A)

The first step is to calculate the mean values of each column.

M = mean(A)

or

(a11 + a21 + a31) / 3 M(m11, m12) = (a12 + a22 + a32) / 3

Next, we need to center the values in each column by subtracting the mean column value.

C = A - M

The next step is to calculate the covariance matrix of the centered matrix C.

Correlation is a normalized measure of the amount and direction (positive or negative) that two columns change together. Covariance is a generalized and unnormalized version of correlation across multiple columns. A covariance matrix is a calculation of covariance of a given matrix with covariance scores for every column with every other column, including itself.

V = cov(C)

Finally, we calculate the eigendecomposition of the covariance matrix V. This results in a list of eigenvalues and a list of eigenvectors.

values, vectors = eig(V)

The eigenvectors represent the directions or components for the reduced subspace of B, whereas the eigenvalues represent the magnitudes for the directions.

The eigenvectors can be sorted by the eigenvalues in descending order to provide a ranking of the components or axes of the new subspace for A.

If all eigenvalues have a similar value, then we know that the existing representation may already be reasonably compressed or dense and that the projection may offer little. If there are eigenvalues close to zero, they represent components or axes of B that may be discarded.

A total of m or less components must be selected to comprise the chosen subspace. Ideally, we would select k eigenvectors, called principal components, that have the k largest eigenvalues.

B = select(values, vectors)

Other matrix decomposition methods can be used such as Singular-Value Decomposition, or SVD. As such, generally the values are referred to as singular values and the vectors of the subspace are referred to as principal components.

Once chosen, data can be projected into the subspace via matrix multiplication.

P = B^T . A

Where A is the original data that we wish to project, B^T is the transpose of the chosen principal components and P is the projection of A.

This is called the covariance method for calculating the PCA, although there are alternative ways to to calculate it.

There is no pca() function in NumPy, but we can easily calculate the Principal Component Analysis step-by-step using NumPy functions.

The example below defines a small 3×2 matrix, centers the data in the matrix, calculates the covariance matrix of the centered data, and then the eigendecomposition of the covariance matrix. The eigenvectors and eigenvalues are taken as the principal components and singular values and used to project the original data.

from numpy import array from numpy import mean from numpy import cov from numpy.linalg import eig # define a matrix A = array([[1, 2], [3, 4], [5, 6]]) print(A) # calculate the mean of each column M = mean(A.T, axis=1) print(M) # center columns by subtracting column means C = A - M print(C) # calculate covariance matrix of centered matrix V = cov(C.T) print(V) # eigendecomposition of covariance matrix values, vectors = eig(V) print(vectors) print(values) # project data P = vectors.T.dot(C.T) print(P.T)

Running the example first prints the original matrix, then the eigenvectors and eigenvalues of the centered covariance matrix, followed finally by the projection of the original matrix.

Interestingly, we can see that only the first eigenvector is required, suggesting that we could project our 3×2 matrix onto a 3×1 matrix with little loss.

[[1 2] [3 4] [5 6]] [[ 0.70710678 -0.70710678] [ 0.70710678 0.70710678]] [ 8. 0.] [[-2.82842712 0. ] [ 0. 0. ] [ 2.82842712 0. ]]

We can calculate a Principal Component Analysis on a dataset using the PCA() class in the scikit-learn library. The benefit of this approach is that once the projection is calculated, it can be applied to new data again and again quite easily.

When creating the class, the number of components can be specified as a parameter.

The class is first fit on a dataset by calling the fit() function, and then the original dataset or other data can be projected into a subspace with the chosen number of dimensions by calling the transform() function.

Once fit, the singular values and principal components can be accessed on the PCA class via the explained_variance_ and components_ attributes.

The example below demonstrates using this class by first creating an instance, fitting it on a 3×2 matrix, accessing the values and vectors of the projection, and transforming the original data.

# Principal Component Analysis from numpy import array from sklearn.decomposition import PCA # define a matrix A = array([[1, 2], [3, 4], [5, 6]]) print(A) # create the PCA instance pca = PCA(2) # fit on data pca.fit(A) # access values and vectors print(pca.components_) print(pca.explained_variance_) # transform data B = pca.transform(A) print(B)

Running the example first prints the 3×2 data matrix, then the principal components and values, followed by the projection of the original matrix.

We can see, that with some very minor floating point rounding that we achieve the same principal components, singular values, and projection as in the previous example.

[[1 2] [3 4] [5 6]] [[ 0.70710678 0.70710678] [ 0.70710678 -0.70710678]] [ 8.00000000e+00 2.25080839e-33] [[ -2.82842712e+00 2.22044605e-16] [ 0.00000000e+00 0.00000000e+00] [ 2.82842712e+00 -2.22044605e-16]]

This section lists some ideas for extending the tutorial that you may wish to explore.

- Re-run the examples with your own small contrived matrix values.
- Load a dataset and calculate the PCA on it and compare the results from the two methods.
- Search for and locate 10 examples where PCA has been used in machine learning papers.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Section 7.3 Principal Component Analysis (PCA by the SVD), Introduction to Linear Algebra, Fifth Edition, 2016.
- Section 2.12 Example: Principal Components Analysis, Deep Learning, 2016.

- Principal Component Analysis with numpy, 2011.
- PCA and image compression with numpy, 2011.
- Implementing a Principal Component Analysis (PCA), 2014.

In this tutorial, you discovered the Principal Component Analysis machine learning method for dimensionality reduction.

Specifically, you learned:

- The procedure for calculating the Principal Component Analysis and how to choose principal components.
- How to calculate the Principal Component Analysis from scratch in NumPy.
- How to calculate the Principal Component Analysis for reuse on more data in scikit-learn.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Calculate the Principal Component Analysis from Scratch in Python appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Expected Value, Variance, and Covariance with NumPy appeared first on Machine Learning Mastery.

]]>They are also the tools that provide the foundation for more advanced linear algebra operations and machine learning methods, such as the covariance matrix and principal component analysis respectively. As such, it is important to have a strong grip on fundamental statistics in the context of linear algebra notation.

In this tutorial, you will discover how fundamental statistical operations work and how to implement them using NumPy with notation and terminology from linear algebra.

After completing this tutorial, you will know:

- What the expected value, average, and mean are and how to calculate them.
- What the variance and standard deviation are and how to calculate them.
- What the covariance, correlation, and covariance matrix are and how to calculate them.

Let’s get started.

**Updated Mar/2018**: Fixed a small typo in the result for vector variance example. Thanks Bob.

This tutorial is divided into 4 parts; they are:

- Expected Value
- Variance
- Covariance
- Covariance Matrix

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In probability, the average value of some random variable X is called the expected value or the expectation.

The expected value uses the notation E with square brackets around the name of the variable; for example:

E[X]

It is calculated as the probability weighted sum of values that can be drawn.

E[X] = sum(x1 * p1, x2 * p2, x3 * p3, ..., xn * pn)

In simple cases, such as the flipping of a coin or rolling a dice, the probability of each event is just as likely. Therefore, the expected value can be calculated as the sum of all values multiplied by the reciprocal of the number of values.

E[X] = sum(x1, x2, x3, ..., xn) . 1/n

In statistics, the mean, or more technically the arithmetic mean or sample mean, can be estimated from a sample of examples drawn from the domain. It is confusing because mean, average, and expected value are used interchangeably.

In the abstract, the mean is denoted by the lower case Greek letter mu and is calculated from the sample of observations, rather than all possible values.

mu = sum(x1, x2, x3, ..., xn) . 1/n

Or, written more compactly:

mu = sum(x . P(x))

Where x is the vector of observations and P(x) is the calculated probability for each value.

When calculated for a specific variable, such as x, the mean is denoted as a lower case variable name with a line above, called x-bar.

_ x = sum from 1 to n (xi) . 1/n

The arithmetic mean can be calculated for a vector or matrix in NumPy by using the mean() function.

The example below defines a 6-element vector and calculates the mean.

from numpy import array from numpy import mean v = array([1,2,3,4,5,6]) print(v) result = mean(v) print(result)

Running the example first prints the defined vector and the mean of the values in the vector.

[1 2 3 4 5 6] 3.5

The mean function can calculate the row or column means of a matrix by specifying the axis argument and the value 0 or 1 respectively.

The example below defines a 2×6 matrix and calculates both column and row means.

from numpy import array from numpy import mean M = array([[1,2,3,4,5,6],[1,2,3,4,5,6]]) print(M) col_mean = mean(M, axis=0) print(col_mean) row_mean = mean(M, axis=1) print(row_mean)

Running the example first prints the defined matrix, then the calculated column and row mean values.

[[1 2 3 4 5 6] [1 2 3 4 5 6]] [ 1. 2. 3. 4. 5. 6.] [ 3.5 3.5]

In probability, the variance of some random variable X is a measure of how much values in the distribution vary on average with respect to the mean.

The variance is denoted as the function Var() on the variable.

Var[X]

Variance is calculated as the average squared difference of each value in the distribution from the expected value. Or the expected squared difference from the expected value.

Var[X] = E[(X - E[X])^2]

Assuming the expected value of the variable has been calculated (E[X]), the variance of the random variable can be calculated as the sum of the squared difference of each example from the expected value multiplied by the probability of that value.

Var[X] = sum (p(x1) . (x1 - E[X])^2, p(x2) . (x2 - E[X])^2, ..., p(x1) . (xn - E[X])^2)

If the probability of each example in the distribution is equal, variance calculation can drop the individual probabilities and multiply the sum of squared differences by the reciprocal of the number of examples in the distribution.

Var[X] = sum ((x1 - E[X])^2, (x2 - E[X])^2, ...,(xn - E[X])^2) . 1/n

In statistics, the variance can be estimated from a sample of examples drawn from the domain.

In the abstract, the sample variance is denoted by the lower case sigma with a 2 superscript indicating the units are squared, not that you must square the final value. The sum of the squared differences is multiplied by the reciprocal of the number of examples minus 1 to correct for a bias.

sigma^2 = sum from 1 to n ( (xi - mu)^2 ) . 1 / (n - 1)

In NumPy, the variance can be calculated for a vector or a matrix using the var() function. By default, the var() function calculates the population variance. To calculate the sample variance, you must set the ddof argument to the value 1.

The example below defines a 6-element vector and calculates the sample variance.

from numpy import array from numpy import var v = array([1,2,3,4,5,6]) print(v) result = var(v, ddof=1) print(result)

Running the example first prints the defined vector and then the calculated sample variance of the values in the vector.

[1 2 3 4 5 6] 3.5

The var function can calculate the row or column variances of a matrix by specifying the axis argument and the value 0 or 1 respectively, the same as the mean function above.

The example below defines a 2×6 matrix and calculates both column and row sample variances.

from numpy import array from numpy import var M = array([[1,2,3,4,5,6],[1,2,3,4,5,6]]) print(M) col_mean = var(M, ddof=1, axis=0) print(col_mean) row_mean = var(M, ddof=1, axis=1) print(row_mean)

Running the example first prints the defined matrix and then the column and row sample variance values.

[[1 2 3 4 5 6] [1 2 3 4 5 6]] [ 0. 0. 0. 0. 0. 0.] [ 3.5 3.5]

The standard deviation is calculated as the square root of the variance and is denoted as lowercase “s”.

s = sqrt(sigma^2)

To keep with this notation, sometimes the variance is indicated as s^2, with 2 as a superscript, again showing that the units are squared.

NumPy also provides a function for calculating the standard deviation directly via the std() function. As with the var() function, the ddof argumentmust be set to 1 to calculate the unbiased sample standard deviation and column and row standard deviations can be calculated by setting the axis argument to 0 and 1 respectively.

The example below demonstrates how to calculate the sample standard deviation for the rows and columns of a matrix.

from numpy import array from numpy import std M = array([[1,2,3,4,5,6],[1,2,3,4,5,6]]) print(M) col_mean = std(M, ddof=1, axis=0) print(col_mean) row_mean = std(M, ddof=1, axis=1) print(row_mean)

Running the example first prints the defined matrix and then the column and row sample standard deviation values.

[[1 2 3 4 5 6] [1 2 3 4 5 6]] [ 0. 0. 0. 0. 0. 0.] [ 1.87082869 1.87082869]

In probability, covariance is the measure of the joint probability for two random variables. It describes how the two variables change together.

It is denoted as the function cov(X, Y), where X and Y are the two random variables being considered.

cov(X,Y)

Covariance is calculated as expected value or average of the product of the differences of each random variable from their expected values, where E[X] is the expected value for X and E[Y] is the expected value of y.

cov(X, Y) = E[(X - E[X] . (Y - E[Y])]

Assuming the expected values for X and Y have been calculated, the covariance can be calculated as the sum of the difference of x values from their expected value multiplied by the difference of the y values from their expected values multiplied by the reciprocal of the number of examples in the population.

cov(X, Y) = sum (x - E[X]) * (y - E[Y]) * 1/n

In statistics, the sample covariance can be calculated in the same way, although with a bias correction, the same as with the variance.

cov(X, Y) = sum (x - E[X]) * (y - E[Y]) * 1/(n - 1)

The sign of the covariance can be interpreted as whether the two variables increase together (positive) or decrease together (negative). The magnitude of the covariance is not easily interpreted. A covariance value of zero indicates that both variables are completely independent.

NumPy does not have a function to calculate the covariance between two variables directly. Instead, it has a function for calculating a covariance matrix called cov() that we can use to retrieve the covariance. By default, the cov()function will calculate the unbiased or sample covariance between the provided random variables.

The example below defines two vectors of equal length with one increasing and one decreasing. We would expect the covariance between these variables to be negative.

We access just the covariance for the two variables as the [0,1] element of the square covariance matrix returned.

from numpy import array from numpy import cov x = array([1,2,3,4,5,6,7,8,9]) print(x) y = array([9,8,7,6,5,4,3,2,1]) print(y) Sigma = cov(x,y)[0,1] print(Sigma)

Running the example first prints the two vectors followed by the covariance for the values in the two vectors. The value is negative, as we expected.

[1 2 3 4 5 6 7 8 9] [9 8 7 6 5 4 3 2 1] -7.5

The covariance can be normalized to a score between -1 and 1 to make the magnitude interpretable by dividing it by the standard deviation of X and Y. The result is called the correlation of the variables, also called the Pearson correlation coefficient, named for the developer of the method.

r = cov(X, Y) / sX sY

Where r is the correlation coefficient of X and Y, cov(X, Y) is the sample covariance of X and Y and sX and sY are the standard deviations of X and Y respectively.

NumPy provides the corrcoef() function for calculating the correlation between two variables directly. Like cov(), it returns a matrix, in this case a correlation matrix. As with the results from cov() we can access just the correlation of interest from the [0,1] value from the returned squared matrix.

from numpy import array from numpy import corrcoef x = array([1,2,3,4,5,6,7,8,9]) print(x) y = array([9,8,7,6,5,4,3,2,1]) print(y) Sigma = corrcoef(x,y) print(Sigma)

Running the example first prints the two defined vectors followed by the correlation coefficient. We can see that the vectors are maximally negatively correlated as we designed.

[1 2 3 4 5 6 7 8 9] [9 8 7 6 5 4 3 2 1] -1.0

The covariance matrix is a square and symmetric matrix that describes the covariance between two or more random variables.

The diagonal of the covariance matrix are the variances of each of the random variables.

A covariance matrix is a generalization of the covariance of two variables and captures the way in which all variables in the dataset may change together.

The covariance matrix is denoted as the uppercase Greek letter Sigma. The covariance for each pair of random variables is calculated as above.

Sigma = E[(X - E[X] . (Y - E[Y])]

Where:

Sigma(ij) = cov(Xi, Xj)

And X is a matrix where each column represents a random variable.

The covariance matrix provides a useful tool for separating the structured relationships in a matrix of random variables. This can be used to decorrelate variables or applied as a transform to other variables. It is a key element used in the Principal Component Analysis data reduction method, or PCA for short.

The covariance matrix can be calculated in NumPy using the cov() function. By default, this function will calculate the sample covariance matrix.

The cov() function can be called with a single matrix containing columns on which to calculate the covariance matrix, or two arrays, such as one for each variable.

Below is an example that defines two 9-element vectors and calculates the unbiased covariance matrix from them.

from numpy import array from numpy import cov x = array([1,2,3,4,5,6,7,8,9]) print(x) y = array([9,8,7,6,5,4,3,2,1]) print(y) Sigma = cov(x,y) print(Sigma)

Running the example first prints the two vectors and then the calculated covariance matrix.

The values of the arrays were contrived such that as one variable increases, the other decreases. We would expect to see a negative sign on the covariance for these two variables, and this is what we see in the covariance matrix.

[1 2 3 4 5 6 7 8 9] [9 8 7 6 5 4 3 2 1] [[ 7.5 -7.5] [-7.5 7.5]]

The covariance matrix is used widely in linear algebra and the intersection of linear algebra and statistics called multivariate analysis. We have only had a small taste in this post.

This section lists some ideas for extending the tutorial that you may wish to explore.

- Explore each example using your own small contrived data.
- Load data from a CSV file and apply each operation to the data columns.
- Write your own functions to implement each statistical operation.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Applied Multivariate Statistical Analysis, 2012.
- Applied Multivariate Statistical Analysis, 2015.
- Chapter 12 Linear Algebra in Probability & Statistics, Introduction to Linear Algebra, Fifth Edition, 2016.
- Chapter 3, Probability and Information Theory, Deep Learning, 2016.

- NumPy Statistics Functions
- numpy.mean() API
- numpy.var() API
- numpy.std() API
- numpy.cov() API
- numpy.corrcoef() API

- Expected value on Wikipedia
- Mean on Wikipedia
- Variance on Wikipedia
- Standard deviation on Wikipedia
- Covariance on Wikipedia
- Sample mean and covariance
- Pearson correlation coefficient
- Covariance matrix on Wikipedia
- Estimation of covariance matrices on Wikipedia

In this tutorial, you discovered how fundamental statistical operations work and how to implement them using NumPy with notation and terminology from linear algebra.

Specifically, you learned:

- What the expected value, average, and mean are and how to calculate then.
- What the variance and standard deviation are and how to calculate them.
- What the covariance, correlation, and covariance matrix are and how to calculate them.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Expected Value, Variance, and Covariance with NumPy appeared first on Machine Learning Mastery.

]]>