The post Basics of Mathematical Notation for Machine Learning appeared first on Machine Learning Mastery.

]]>Often, all it takes is one term or one fragment of notation in an equation to completely derail your understanding of the entire procedure. This can be extremely frustrating, especially for machine learning beginners coming from the world of development.

You can make great progress if you know a few basic areas of mathematical notation and some tricks for working through the description of machine learning methods in papers and books.

In this tutorial, you will discover the basics of mathematical notation that you may come across when reading descriptions of techniques in machine learning.

After completing this tutorial, you will know:

- Notation for arithmetic, including variations of multiplication, exponents, roots, and logarithms.
- Notation for sequences and sets including indexing, summation, and set membership.
- 5 Techniques you can use to get help if you are struggling with mathematical notation.

Let’s get started.

**Update May/2018**: Added images for some notations to make the explanations clearer.

This tutorial is divided into 7 parts; they are:

- The Frustration with Math Notation
- Arithmetic Notation
- Greek Alphabet
- Sequence Notation
- Set Notation
- Other Notation
- Getting More Help

Are there other areas of basic math notation required for machine learning that you think I missed?

Let me know in the comments below.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

You will encounter mathematical notation when reading about machine learning algorithms.

For example, notation may be used to:

- Describe an algorithm.
- Describe data preparation.
- Describe results.
- Describe a test harness.
- Describe implications.

These descriptions may be in research papers, textbooks, blog posts, and elsewhere.

Often the terms are well defined, but there are also mathematical notation norms that you may not be familiar with.

All it takes is one term or one equation that you do not understand and your understanding of the entire method will be lost. I’ve suffered this problem myself many times, and it is incredibly frustrating!

In this tutorial, we will review some basic mathematical notation that will help you when reading descriptions of machine learning methods.

In this section, we will go over some less obvious notations for basic arithmetic as well as a few concepts you may have forgotten since school.

The notation for basic arithmetic is as you would write it. For example:

- Addition: 1 + 1 = 2
- Subtraction: 2 – 1 = 1
- Multiplication: 2 x 2 = 4
- Division: 2 / 2 = 1

Most mathematical operations have a sister operation that performs the inverse operation; for example, subtraction is the inverse of addition and division is the inverse of multiplication.

We often want to describe operations abstractly to separate them from specific data or specific implementations.

For this reason we see heavy use of algebra: that is uppercase and/or lowercase letters or words to represents terms or concepts in mathematical notation. It is also common to use letters from the Greek alphabet.

Each sub-field of math may have reserved letters: that is terms or letters that always mean the same thing. Nevertheless, algebraic terms should be defined as part of the description and if they are not, it may just be a poor description, not your fault.

Multiplication is a common notation and has a few short hands.

Often a little “x” or an asterisk “*” is used to represent multiplication:

c = a x b c = a * b

You may see a dot notation used; for example:

c = a . b

Which is the same as:

c = a * b

Alternately, you may see no operation and no white space separation between previously defined terms; for example:

c = ab

Which again is the same thing.

An exponent is a number raised to a power.

The notation is written as the original number, or the base, with a second number, or the exponent, shown as a superscript; for example:

2^3

Which would be calculated as 2 multiplied by itself 3 times, or cubing:

2 x 2 x 2 = 8

A number raised to the power 2 to is said to be its square.

2^2 = 2 x 2 = 4

The square of a number can be inverted by calculating the square root. This is shown using the notation of a number and with a tick above, I will use the “sqrt()” function here for simplicity.

sqrt(4) = 2

Here, we know the result and the exponent and we wish to find the base.

In fact, the root operation can be used to inverse any exponent, it just so happens that the default square root assumes an exponent of 2, represented by a subscript 2 in front of the square root tick.

For example, we can invert the cubing of a number by taking the cube root (note, the 3 is not a multiplication here, it is notation before the tick of the root sign):

2^3 = 8 3 sqrt(8) = 2

When we raise 10 to an integer exponent, we often call this an order of magnitude.

10^2 = 10 x 10 or 100

Another way to reverse this operation is by calculating the logarithm of the result 100 assuming a base of 10; in notation this is written as log10().

log10(100) = 2

Here, we know the result and the base and wish to find the exponent.

This allows us to move up and down orders of magnitude very easily. Taking the logarithm assuming the base of 2 is also commonly used, given the use of binary arithmetic used in computers. For example:

2^6 = 64 log2(64) = 6

Another popular logarithm is to assume the natural base called e. The e is reserved and is a special number or a constant called Euler’s number (pronounced “*oy-ler*“) that refers to a value with practically infinite precision.

e = 2.71828...

Raising e to a power is called a natural exponential function:

e^2 = 7.38905...

It can be inverted using the natural logarithm, which is denoted as ln():

ln(7.38905...) = 2

Without going into detail, the natural exponent and natural logarithm prove useful throughout mathematics to abstractly describe the continuous growth of some systems, e.g. systems that grow exponentially such as compound interest.

Greek letters are used throughout mathematical notation for variables, constants, functions, and more.

For example, in statistics we talk about the mean using the lowercase Greek letter mu, and the standard deviation as the lowercase Greek letter sigma. In linear regression, we talk about the coefficients as the lowercase letter beta. And so on.

It is useful to know all of the uppercase and lowercase Greek letters and how to pronounce them.

When I was a grad student, I printed the Greek alphabet and stuck it on my computer monitor so that I could memorize it. A useful trick!

Below is the full Greek alphabet.

The Wikipedia page titled “Greek letters used in mathematics, science, and engineering” is also a useful guide as it lists common uses for each Greek letter in different sub-fields of math and science.

Machine learning notation often describes an operation on a sequence.

A sequence may be an array of data or a list of terms.

A key to reading notation for sequences is the notation of indexing elements in the sequence.

Often the notation will specify the beginning and end of the sequence, such as 1 to n, where n will be the extent or length of the sequence.

Items in the sequence are index by a variable such as i, j, k as a subscript. This is just like array notation.

For example, a_i is the i^th element of the sequence a.

If the sequence is two dimensional, two indices may be used; for example:

b_{i,j} is the i,j^th element of the sequence b.

Mathematical operations can be performed over a sequence.

Two operations are performed on sequences so often that they have their own shorthand: the sum and the multiplication.

The sum over a sequence is denoted as the uppercase Greek letter sigma. It is specified with the variable and start of the sequence summation below the sigma (e.g. i = 1) and the index of the end of the summation above the sigma (e.g. n).

Sigma i = 1, n a_i

This is the sum of the sequence a starting at element 1 to element n.

The multiplication over a sequence is denoted as the uppercase Greek letter pi. It is specified in the same way as the sequence summation with the beginning and end of the operation below and above the letter respectively.

Pi i = 1, n a_i

This is the product of the sequence a starting at element 1 to element n.

A set is a group of unique items.

We may see set notation used when defining terms in machine learning.

A common set you may see is a set of numbers, such as a term defined as being within the set of integers or the set of real numbers.

Some common sets of numbers you may see include:

- Set of all natural numbers: N
- Set of all integers: Z
- Set of all real numbers: R

There are other sets; see Special sets on Wikipedia.

We often talk about real values or real numbers when defining terms rather than floating point values, which are really discrete creations for operations in computers.

It is common to see set membership in definitions of terms.

Set membership is denoted as a symbol that looks like an uppercase “E”.

a E R

Which means a is defined as being a member of the set R or the set of real numbers.

There is also a host of set operations; two common set operations include:

- Union, or aggregation: A U B
- Intersection, or overlap: A ^ B

Learn more about sets on Wikipedia.

There is other notation that you may come across.

I try to lay some of it out in this section.

It is common to define a method in the abstract and then define it again as a specific implementation with separate notation.

For example, if we are estimating a variable x, we may represent it using a notation that modifies the x; for example:

The same notation may have a different meaning in a different context, such as use on different objects or sub-fields of mathematics. For example, a common point of confusion is |x|, which, depending on context, can mean:

- |x|: The absolute or positive value of x.
- |x|: The length of the vector x.
- |x|: The cardinality of the set x.

This tutorial only covered the basics of mathematical notation. There are some subfields of mathematics that are more relevant to machine learning and should be reviewed in more detail. They are:

- Linear Algebra.
- Statistics.
- Probability.
- Calculus.

And perhaps a little bit of multivariate analysis and information theory.

Are there areas of mathematical notation that you think are missing from this post?

Let me know in the comments below.

This section lists some tips that you can use when you are struggling with mathematical notation in machine learning.

People wrote the paper or book you are reading.

People that can make mistakes, make omissions, and even make things confusing because they don’t fully understand what they are writing.

Relax the constraints of the notation you are reading slightly and think about the intent of the author. What are they trying to get across?

Perhaps you can even contact the author via email, Twitter, Facebook, LinkedIn, etc., and seek clarification. Remember that academics want other people to understand and use their work (mostly).

Wikipedia has lists of notation which can help narrow down on the meaning or intent of the notation you are reading.

Two places I recommend you start are:

- List of mathematical symbols on Wikipedia
- Greek letters used in mathematics, science, and engineering on Wikipedia

Mathematical operations are just functions on data.

Map everything you’re reading to pseudocode with variables, for-loops, and more.

You might want to use a scripting language as you go, along with small arrays of contrived data or even an Excel spreadsheet.

As your reading and understanding of the technique improves, your code-sketch of the technique will make more sense, and at the end you will have a mini prototype to play with.

I never used to take much stock in this approach until I saw an academic sketch out a very complex paper in a few lines of matlab with some contrived data. It knocked my socks off because I believed the system had to be coded completely and run with a “real” dataset and that the only option was to get the original code and data. I was very wrong. Also, looking back, the guy was gifted.

I now use this method all the time and sketch techniques in Python.

There is a trick I use when I’m trying to understand a new technique.

I find and read all the papers that reference the paper I’m reading with the new technique.

Reading other academics interpretation and re-explanation of the technique can often clarify my misunderstandings in the original description.

Not always though. Sometimes it can muddy the waters and introduce misleading explanations or new notation. But more often than not, it helps. After circling back to the original paper and re-reading it, I can often find cases where subsequent papers have actually made errors and misinterpretations of the original method.

There are places online where people love to explain math to others. Seriously!

Consider taking a screenshot of the notation you are struggling with, write out the full reference or link to it, and post it and your area of misunderstanding to a question-and-answer site.

Two great places to start are:

What are your tricks for working through mathematical notation?

Let me know in the comments below?

This section provides more resources on the topic if you are looking to go deeper.

- Section 0.1. Reading Mathematics [PDF], Vector Calculus, Linear Algebra, and Differential Forms, 2009.
- The Language and Grammar of Mathematics [PDF], Timothy Gowers
- Understanding Mathematics, a guide, Peter Alfeld.

In this tutorial, you discovered the basics of mathematical notation that you may come across when reading descriptions of techniques in machine learning.

Specifically, you learned:

- Notation for arithmetic, including variations of multiplication, exponents, roots, and logarithms.
- Notation for sequences and sets, including indexing, summation, and set membership.
- 5 Techniques you can use to get help if you are struggling with mathematical notation.

Are you struggling with mathematical notation?

Did any of the notation or tips in this post help?

Let me know in the comments below.

The post Basics of Mathematical Notation for Machine Learning appeared first on Machine Learning Mastery.

]]>The post Linear Algebra for Machine Learning (7-Day Mini-Course) appeared first on Machine Learning Mastery.

]]>Linear algebra is a field of mathematics that is universally agreed to be a prerequisite for a deeper understanding of machine learning.

Although linear algebra is a large field with many esoteric theories and findings, the nuts and bolts tools and notations taken from the field are required for machine learning practitioners. With a solid foundation of what linear algebra is, it is possible to focus on just the good or relevant parts.

In this crash course, you will discover how you can get started and confidently read and implement linear algebra notation used in machine learning with Python in 7 days.

This is a big and important post. You might want to bookmark it.

Let’s get started.

**Update Mar/2018**: Fixed a small typo in the SVD lesson.

Before we get started, let’s make sure you are in the right place.

This course is for developers that may know some applied machine learning. Maybe you know how to work through a predictive modeling problem end-to-end, or at least most of the main steps, with popular tools.

The lessons in this course do assume a few things about you, such as:

- You know your way around basic Python for programming.

You may know some basic NumPy for array manipulation. - You want to learn linear algebra to deepen your understanding and application of machine learning.

You do NOT need to know:

- You do not need to be a math wiz!
- You do not need to be a machine learning expert!

This crash course will take you from a developer that knows a little machine learning to a developer who can navigate the basics of linear algebra.

Note: This crash course assumes you have a working Python3 SciPy environment with at least NumPy installed. If you need help with your environment, you can follow the step-by-step tutorial here:

This crash course is broken down into 7 lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.

Below is a list of the 7 lessons that will get you started and productive with linear algebra for machine learning in Python:

**Lesson 01**: Linear Algebra for Machine Learning**Lesson 02**: Linear Algebra**Lesson 03**: Vectors**Lesson 04**: Matrices**Lesson 05**: Matrix Types and Operations**Lesson 06**: Matrix Factorization**Lesson 07**: Singular-Value Decomposition

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help on and about the linear algebra and the NumPy API and the best-of-breed tools in Python (hint: I have all of the answers directly on this blog; use the search box).

I do provide more help in the form of links to related posts because I want you to build up some confidence and inertia.

Post your results in the comments; I’ll cheer you on!

Hang in there; don’t give up.

Note: This is just a crash course. For a lot more detail and fleshed out tutorials, see my book on the topic titled “Basics of Linear Algebra for Machine Learning“.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this lesson, you will discover the 5 reasons why a machine learning practitioner should deepen their understanding of linear algebra.

You need to be able to read and write vector and matrix notation. Algorithms are described in books, papers, and on websites using vector and matrix notation.

In partnership with the notation of linear algebra are the arithmetic operations performed. You need to know how to add, subtract, and multiply scalars, vectors, and matrices.

You must learn linear algebra in order to be able to learn statistics. Especially multivariate statistics. In order to be able to read and interpret statistics, you must learn the notation and operations of linear algebra. Modern statistics uses both the notation and tools of linear algebra to describe the tools and techniques of statistical methods. From vectors for the means and variances of data, to covariance matrices that describe the relationships between multiple Gaussian variables.

Building on notation and arithmetic is the idea of matrix factorization, also called matrix decomposition.You need to know how to factorize a matrix and what it means. Matrix factorization is a key tool in linear algebra and used widely as an element of many more complex operations in both linear algebra (such as the matrix inverse) and machine learning (least squares).

You need to know how to use matrix factorization to solve linear least squares. Problems of this type can be framed as the minimization of squared error, called least squares, and can be recast in the language of linear algebra, called linear least squares. Linear least squares problems can be solved efficiently on computers using matrix operations such as matrix factorization.

If I could give one more reason, it would be: because it is fun. Seriously.

For this lesson, you must list 3 reasons why you, personally, want to learn linear algebra.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover a concise definition of linear algebra.

In this lesson, you will discover a concise definition of linear algebra.

Linear algebra is a branch of mathematics, but the truth of it is that linear algebra is the mathematics of data. Matrices and vectors are the language of data.

Linear algebra is about linear combinations. That is, using arithmetic on columns of numbers called vectors and 2D arrays of numbers called matrices, to create new columns and arrays of numbers.

The application of linear algebra in computers is often called numerical linear algebra.

It is more than just the implementation of linear algebra operations in code libraries; it also includes the careful handling of the problems of applied mathematics, such as working with the limited floating point precision of digital computers.

As linear algebra is the mathematics of data, the tools of linear algebra are used in many domains.

- Matrices in Engineering, such as a line of springs.
- Graphs and Networks, such as analyzing networks.
- Markov Matrices, Population, and Economics, such as population growth.
- Linear Programming, the simplex optimization method.
- Fourier Series: Linear Algebra for Functions, used widely in signal processing.
- Linear Algebra for Statistics and Probability, such as least squares for regression.
- Computer Graphics, such as the various translation, scaling and rotation of images.

For this lesson, you must find five quotes from research papers, blogs, or books that define the field of linear algebra.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover vectors and simple vector arithmetic.

In this lesson, you will discover vectors and simple vector arithmetic.

A vector is a tuple of one or more values called scalars.

Vectors are often represented using a lowercase character such as “v”; for example:

v = (v1, v2, v3)

Where v1, v2, v3 are scalar values, often real values.

We can represent a vector in Python as a NumPy array.

A NumPy array can be created from a list of numbers. For example, below we define a vector with the length of 3 and the integer values 1, 2, and 3.

# create a vector from numpy import array v = array([1, 2, 3]) print(v)

Two vectors of equal length can be multiplied together.

c = a * b

As with addition and subtraction, this operation is performed element-wise to result in a new vector of the same length.

a * b = (a1 * b1, a2 * b2, a3 * b3)

We can perform this operation directly in NumPy.

# multiply vectors from numpy import array a = array([1, 2, 3]) print(a) b = array([1, 2, 3]) print(b) c = a * b print(c)

For this lesson, you must implement other vector arithmetic operations such as addition, division, subtraction, and the vector dot product.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover matrices and simple matrix arithmetic.

In this lesson, you will discover matrices and simple matrix arithmetic.

A matrix is a two-dimensional array of scalars with one or more columns and one or more rows.

The notation for a matrix is often an uppercase letter, such as A, and entries are referred to by their two-dimensional subscript of row (i) and column (j), such as aij. For example:

A = ((a11, a12), (a21, a22), (a31, a32))

We can represent a matrix in Python using a two-dimensional NumPy array.

A NumPy array can be constructed given a list of lists. For example, below is a 2 row, 3 column matrix.

# create matrix from numpy import array A = array([[1, 2, 3], [4, 5, 6]]) print(A)

Two matrices with the same dimensions can be added together to create a new third matrix.

C = A + B

The scalar elements in the resulting matrix are calculated as the addition of the elements in each of the matrices being added.

We can implement this in python using the plus operator directly on the two NumPy arrays.

# add matrices from numpy import array A = array([[1, 2, 3], [4, 5, 6]]) print(A) B = array([[1, 2, 3], [4, 5, 6]]) print(B) C = A + B print(C)

Matrix multiplication, also called the matrix dot product, is more complicated than the previous operations and involves a rule, as not all matrices can be multiplied together.

C = A * B

The rule for matrix multiplication is as follows: The number of columns (n) in the first matrix (A) must equal the number of rows (m) in the second matrix (B).

For example, matrix A has the dimensions m rows and n columns and matrix B has the dimensions n and k. The n columns in A and n rows b are equal. The result is a new matrix with m rows and k columns.

C(m,k) = A(m,n) * B(n,k)

The intuition for the matrix multiplication is that we are calculating the dot product between each row in matrix A with each column in matrix B. For example, we can step down rows of column A and multiply each with column 1 in B to give the scalar values in column 1 of C.

The matrix multiplication operation can be implemented in NumPy using the dot() function.

# matrix dot product from numpy import array A = array([[1, 2], [3, 4], [5, 6]]) print(A) B = array([[1, 2], [3, 4]]) print(B) C = A.dot(B) print(C)

For this lesson, you must implement more matrix arithmetic operations such as subtraction, division, the Hadamard product, and vector-matrix multiplication.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover the different types of matrices and matrix operations.

In this lesson, you will discover the different types of matrices and matrix operations.

A defined matrix can be transposed, which creates a new matrix with the number of columns and rows flipped.

This is denoted by the superscript “T” next to the matrix.

C = A^T

We can transpose a matrix in NumPy by calling the T attribute.

# transpose matrix from numpy import array A = array([[1, 2], [3, 4], [5, 6]]) print(A) C = A.T print(C)

The operation of inverting a matrix is indicated by a -1 superscript next to the matrix; for example, A^-1. The result of the operation is referred to as the inverse of the original matrix; for example, B is the inverse of A.

B = A^-1

Not all matrices are invertible.

A matrix can be inverted in NumPy using the inv() function.

# invert matrix from numpy import array from numpy.linalg import inv # define matrix A = array([[1.0, 2.0], [3.0, 4.0]]) print(A) # invert matrix B = inv(A) print(B)

A square matrix is a matrix where the number of rows (n) equals the number of columns (m).

n = m

The square matrix is contrasted with the rectangular matrix where the number of rows and columns are not equal.

A symmetric matrix is a type of square matrix where the top-right triangle is the same as the bottom-left triangle.

To be symmetric, the axis of symmetry is always the main diagonal of the matrix, from the top left to the bottom right.

A symmetric matrix is always square and equal to its own transpose.

M = M^T

A triangular matrix is a type of square matrix that has all values in the upper-right or lower-left of the matrix with the remaining elements filled with zero values.

A triangular matrix with values only above the main diagonal is called an upper triangular matrix. Whereas, a triangular matrix with values only below the main diagonal is called a lower triangular matrix.

A diagonal matrix is one where values outside of the main diagonal have a zero value, where the main diagonal is taken from the top left of the matrix to the bottom right.

A diagonal matrix is often denoted with the variable D and may be represented as a full matrix or as a vector of values on the main diagonal.

For this lesson, you must develop examples for other matrix operations such as the determinant, trace, and rank.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover matrix factorization.

In this lesson, you will discover the basics of matrix factorization, also called matrix decomposition.

A matrix decomposition is a way of reducing a matrix into its constituent parts.

It is an approach that can simplify more complex matrix operations that can be performed on the decomposed matrix rather than on the original matrix itself.

A common analogy for matrix decomposition is the factoring of numbers, such as the factoring of 25 into 5 x 5. For this reason, matrix decomposition is also called matrix factorization. Like factoring real values, there are many ways to decompose a matrix, hence there are a range of different matrix decomposition techniques.

The LU decomposition is for square matrices and decomposes a matrix into L and U components.

A = L . U

Where A is the square matrix that we wish to decompose, L is the lower triangle matrix, and U is the upper triangle matrix. A variation of this decomposition that is numerically more stable to solve in practice is called the LUP decomposition, or the LU decomposition with partial pivoting.

A = P . L . U

The rows of the parent matrix are re-ordered to simplify the decomposition process and the additional P matrix specifies a way to permute the result or return the result to the original order. There are also other variations of the LU.

The LU decomposition is often used to simplify the solving of systems of linear equations, such as finding the coefficients in a linear regression.

The LU decomposition can be implemented in Python with the lu() function. More specifically, this function calculates an LPU decomposition.

# LU decomposition from numpy import array from scipy.linalg import lu # define a square matrix A = array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) print(A) # LU decomposition P, L, U = lu(A) print(P) print(L) print(U) # reconstruct B = P.dot(L).dot(U) print(B)

For this lesson, you must implement small examples of other simple methods for matrix factorization, such as the QR decomposition, the Cholesky decomposition, and the eigendecomposition.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover the Singular-Value Decomposition method for matrix factorization.

In this lesson, you will discover the Singular-Value Decomposition method for matrix factorization.

Singular-Value Decomposition

The Singular-Value Decomposition, or SVD for short, is a matrix decomposition method for reducing a matrix to its constituent parts in order to make certain subsequent matrix calculations simpler.

A = U . Sigma . V^T

Where A is the real m x n matrix that we wish to decompose, U is an m x m matrix, Sigma (often represented by the uppercase Greek letter Sigma) is an m x n diagonal matrix, and V^T is the transpose of an n x n matrix where T is a superscript.

The SVD can be calculated by calling the svd() function.

The function takes a matrix and returns the U, Sigma, and V^T elements. The Sigma diagonal matrix is returned as a vector of singular values. The V matrix is returned in a transposed form, e.g. V.T.

# Singular-value decomposition from numpy import array from scipy.linalg import svd # define a matrix A = array([[1, 2], [3, 4], [5, 6]]) print(A) # SVD U, s, V = svd(A) print(U) print(s) print(V)

For this lesson, you must list 5 applications of the SVD.

Bonus points if you can demonstrate each with a small example in Python.

Post your answer in the comments below. I would love to see what you discover.

This was the final lesson in the mini-course.

(

You made it. Well done!

Take a moment and look back at how far you have come.

You discovered:

- The importance of linear algebra to applied machine learning.
- What linear algebra is all about.
- What a vector is and how to perform vector arithmetic.
- What a matrix is and how to perform matrix arithmetic, including matrix multiplication.
- A suite of types of matrices, their properties, and advanced operations involving matrices.
- Matrix factorization methods and the LU decomposition method in detail.
- The popular Singular-Value decomposition method used in machine learning.

This is just the beginning of your journey with linear algebra for machine learning. Keep practicing and developing your skills.

Take the next step and check out my book on Linear Algebra for Machine Learning.

*How Did You Do with The Mini-Course?*

Did you enjoy this crash course?

*Do you have any questions? Were there any sticking points?*

Let me know. Leave a comment below.

The post Linear Algebra for Machine Learning (7-Day Mini-Course) appeared first on Machine Learning Mastery.

]]>The post Computational Linear Algebra for Coders Review appeared first on Machine Learning Mastery.

]]>It is an area that requires some previous experience of linear algebra and is focused on both the performance and precision of the operations. The company fast.ai released a free course titled “*Computational Linear Algebra*” on the topic of numerical linear algebra that includes Python notebooks and video lectures recorded at the University of San Francisco.

In this post, you will discover the fast.ai free course on computational linear algebra.

After reading this post, you will know:

- The motivation and prerequisites for the course.
- An overview of the topics covered in the course.
- Who exactly this course is a good fit for, and who it is not.

Let’s get started.

The course “*Computational Linear Algebra for Coders*” is a free online course provided by fast.ai. They are a company dedicated to providing free education resources related to deep learning.

The course was originally taught in 2017 by Rachel Thomas at the University of San Francisco as part of a masters degree program. Rachel Thomas is a professor at the University of San Francisco and co-founder of fast.ai and has a Ph.D. in mathematics.

The focus of the course is numerical methods for linear algebra. This is the application of matrix algebra on computers and addresses all of the concerns around the implementation and use of the methods such as performance and precision.

This course is focused on the question: How do we do matrix computations with acceptable speed and acceptable accuracy?

The course uses Python with examples using NumPy, scikit-learn, numba, pytorch, and more.

The material is taught using a top-down approach, much like MachineLearningMastery, intended to give a feeling for how to do things, before explaining how the methods work.

Knowing how these algorithms are implemented will allow you to better combine and utilize them, and will make it possible for you to customize them if needed.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The course does assume familiarity with linear algebra.

This includes topics such as vectors, matrices, operations such as matrix multiplication and transforms.

The course is not for novices to the field of linear algebra.

Three references are suggested for you to review prior to taking the course if you are new or rusty with linear algebra. They are:

- 3Blue 1Brown Essence of Linear Algebra, Video Course
- Immersive Linear Algebra, Interactive Textbook
- Chapter 2 of Deep Learning, 2016.

Further, while working through the course, references are provided as needed.

Two general reference texts are suggested up front. They are the following textbooks:

- Numerical Linear Algebra, 1997.
- Numerical Methods, 2012.

This section provides a summary of the 8 (9) parts to the course. They are:

- 0. Course Logistics
- 1. Why are we here?
- 2. Topic Modeling with NMF and SVD
- 3. Background Removal with Robust PCA
- 4. Compressed Sensing with Robust Regression
- 5. Predicting Health Outcomes with Linear Regressions
- 6. How to Implement Linear Regression
- 7. PageRank with Eigen Decompositions
- 8. Implementing QR Factorization

Really, there are only 8 parts to the course as the first is just administration details for the students that took the course at the University of San Francisco.

In this section, we will step through the 9 parts of the course and summarize their contents and topics covered to give you a feel for what to expect and to see whether it is a good fit for you.

This first lecture is not really part of the course.

It provides an introduction to the lecturer, the material, the way it will be taught, and the expectations of the student in the masters program.

I’ll be using a top-down teaching method, which is different from how most math courses operate. Typically, in a bottom-up approach, you first learn all the separate components you will be using, and then you gradually build them up into more complex structures. The problems with this are that students often lose motivation, don’t have a sense of the “big picture”, and don’t know what they’ll need.

The topics covered in this lecture are:

- Lecturer background
- Teaching Approach
- Importance of Technical Writing
- List of Excellent Technical Blogs
- Linear Algebra Review Resources

Videos and Notebook:

This part introduces the motivation for the course, and touches on the importance of matrix factorization: the importance of the performance and accuracy of these calculations and some example applications.

Matrices are everywhere, anything that can be put in an Excel spreadsheet is a matrix, and language and pictures can be represented as matrices as well.

A great point made in this lecture is how the whole class of matrix factorization methods and one specific method, the QR decomposition, were reported as being among the top 10 most important algorithms of the 20th century.

A list of the Top 10 Algorithms of science and engineering during the 20th century includes: the matrix decompositions approach to linear algebra. It also includes the QR algorithm

The topics covered in this lecture are:

- Matrix and Tensor Products
- Matrix Decompositions
- Accuracy
- Memory use
- Speed
- Parallelization & Vectorization

Videos and Notebook:

This part focuses on the use of matrix factorization in the application to topic modeling for text, specifically the Singular Value Decomposition method, or SVD.

Useful in this part are the comparisons of calculating the methods from scratch or with NumPy and with the scikit-learn library.

Topic modeling is a great way to get started with matrix factorizations.

The topics covered in this lecture are:

- Topic Frequency-Inverse Document Frequency (TF-IDF)
- Singular Value Decomposition (SVD)
- Non-negative Matrix Factorization (NMF)
- Stochastic Gradient Descent (SGD)
- Intro to PyTorch
- Truncated SVD

Videos and Notebook:

- Computational Linear Algebra 2: Topic Modelling with SVD & NMF
- Computational Linear Algebra 3: Review, New Perspective on NMF, & Randomized SVD
- Notebook

This part focuses on the Principal Component Analysis method, or PCA, that uses the eigendecomposition and multivariate statistics.

The focus is on using PCA on image data such as separating background from foreground to isolate changes. This part also introduces the LU decomposition from scratch.

When dealing with high-dimensional data sets, we often leverage on the fact that the data has low intrinsic dimensionality in order to alleviate the curse of dimensionality and scale (perhaps it lies in a low-dimensional subspace or lies on a low-dimensional manifold).

The topics covered in this lecture are:

- Load and View Video Data
- SVD
- Principal Component Analysis (PCA)
- L1 Norm Induces Sparsity
- Robust PCA
- LU factorization
- Stability of LU
- LU factorization with Pivoting
- History of Gaussian Elimination
- Block Matrix Multiplication

Videos and Notebook:

- Computational Linear Algebra 3: Review, New Perspective on NMF, & Randomized SVD
- Computational Linear Algebra 4: Randomized SVD & Robust PCA
- Computational Linear Algebra 5: Robust PCA & LU Factorization
- Notebook

This part introduces the important concepts of broadcasting used in NumPy arrays (and elsewhere) and sparse matrices that crop up a lot in machine learning.

The application focus of this part is the use of robust PCA for background removal in CT scans.

The term broadcasting describes how arrays with different shapes are treated during arithmetic operations. The term broadcasting was first used by Numpy, although is now used in other libraries such as Tensorflow and Matlab; the rules can vary by library.

The topics covered in this lecture are:

- Broadcasting
- Sparse matrices
- CT Scans and Compressed Sensing
- L1 and L2 regression

Videos and Notebook:

- Computational Linear Algebra 6: Block Matrix Mult, Broadcasting, & Sparse Storage
- Computational Linear Algebra 7: Compressed Sensing for CT Scans
- Notebook

This part focuses on the development of linear regression models demonstrated with scikit-learn.

The Numba library is also used to demonstrate how to speed up the matrix operations involved.

We would like to speed this up. We will use Numba, a Python library that compiles code directly to C.

The topics covered in this lecture are:

- Linear regression in sklearn
- Polynomial Features
- Speeding up with Numba
- Regularization and Noise

Videos and Notebook:

- Computational Linear Algebra 8: Numba, Polynomial Features, How to Implement Linear Regression
- Notebook

This part looks at how to solve linear least squares for linear regression using a suite of different matrix factorization methods. Results are compared to the implementation in scikit-learn.

Linear regression via QR has been recommended by numerical analysts as the standard method for years. It is natural, elegant, and good for “daily use”.

The topics covered in this lecture are:

- How did Scikit Learn do it?
- Naive solution
- Normal equations and Cholesky factorization
- QR factorization
- SVD
- Timing Comparison
- Conditioning & Stability
- Full vs Reduced Factorizations
- Matrix Inversion is Unstable

Videos and Notebook:

- Computational Linear Algebra 8: Numba, Polynomial Features, How to Implement Linear Regression
- Notebook

This part introduces the eigendecomposition and the implementation and application of the PageRank algorithm to a Wikipedia links dataset.

The QR algorithm uses something called the QR decomposition. Both are important, so don’t get them confused.

The topics covered in this lecture are:

- SVD
- DBpedia Dataset
- Power Method
- QR Algorithm
- Two-phase approach to finding eigenvalues
- Arnoldi Iteration

Videos and Notebook:

- Computational Linear Algebra 9: PageRank with Eigen Decompositions
- Computational Linear Algebra 10: QR Algorithm to find Eigenvalues, Implementing QR Decomposition
- Notebook

This final part introduces three ways to implement the QR decomposition from scratch and compares the precision and performance of each method.

We used QR factorization in computing eigenvalues and to compute least squares regression. It is an important building block in numerical linear algebra.

The topics covered in this lecture are:

- Gram-Schmidt
- Householder
- Stability Examples

Videos and Notebook:

- Computational Linear Algebra 10: QR Algorithm to find Eigenvalues, Implementing QR Decomposition
- Notebook

I think the course is excellent.

A fun walk through numerical linear algebra with a focus on applications and executable code.

The course delivers on the promise of focusing on the practical concerns of matrix operations such as memory, speed, and precision or numerical stability. The course begins with a careful look at issues of floating point precision and overflow.

Throughout the course, frequently comparisons are made between methods in terms of execution speed.

This course is not an introduction to linear algebra for developers, and if that is the expectation going in, you may be left behind.

The course does assume a reasonable fluency with the basics of linear algebra, notation, and operations. And it does not hide this assumption up front.

I don’t think this course is required if you are interested in deep learning or learning more about the linear algebra operations used in deep learning methods.

If you are implementing matrix algebra methods in your own work and you’re looking to get more out of them, I would highly recommend this course.

I would also recommend this course if you are generally interested in the practical implications of matrix algebra.

This section provides more resources on the topic if you are looking to go deeper.

- New fast.ai course: Computational Linear Algebra
- Computational Linear Algebra on GitHub
- Computational Linear Algebra Video Lectures
- Community Forums

- 3Blue 1Brown Essence of Linear Algebra, Video Course
- Immersive Linear Algebra, Interactive Textbook
- Chapter 2 of Deep Learning
- Numerical Linear Algebra, 1997.
- Numerical Methods, 2012.

In this post, you discovered the fast.ai free course on computational linear algebra.

Specifically, you learned:

- The motivation and prerequisites for the course.
- An overview of the topics covered in the course.
- Who exactly this course is a good fit for, and who it is not.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Computational Linear Algebra for Coders Review appeared first on Machine Learning Mastery.

]]>The post Linear Algebra for Deep Learning appeared first on Machine Learning Mastery.

]]>Generally, an understanding of linear algebra (or parts thereof) is presented as a prerequisite for machine learning. Although important, this area of mathematics is seldom covered by computer science or software engineering degree programs.

In their seminal textbook on deep learning, Ian Goodfellow and others present chapters covering the prerequisite mathematical concepts for deep learning, including a chapter on linear algebra.

In this post, you will discover the crash course in linear algebra for deep learning presented in the de facto textbook on deep learning.

After reading this post, you will know:

- The topics suggested as prerequisites for deep learning by experts in the field.
- The progression through these topics and their culmination.
- Suggestions for how to get the most out of the chapter as a crash course in linear algebra.

Let’s get started.

The book “Deep Learning” by Ian Goodfellow, Yoshua Bengio, and Aaron Courville is the de facto textbook for deep learning.

In the book, the authors provide a part titled “*Applied Math and Machine Learning Basics*” intended to provide the background in applied mathematics and machine learning required to understand the deep learning material presented in the rest of the book.

This part of the book includes four chapters; they are:

- Linear Algebra
- Probability and Information Theory
- Numerical Computation
- Machine Learning Basics

Given the expertise of the authors of the book, it is fair to say that the chapter on linear algebra provides a well reasoned set of prerequisites for deep learning, and perhaps more generally much of machine learning.

This part of the book introduces the basic mathematical concepts needed to understand deep learning.

— Page 30, Deep Learning, 2016.

Therefore, we can use the topics covered in the chapter on linear algebra as a guide to the topics you may be expected to be familiar with as a deep learning and machine learning practitioner.

Linear algebra is less likely to be covered in computer science courses than other types of math, such as discrete mathematics. This is specifically called out by the authors.

Linear algebra is a branch of mathematics that is widely used throughout science and engineering. However, because linear algebra is a form of continuous rather than discrete mathematics, many computer scientists have little experience with it.

— Page 31, Deep Learning, 2016.

We can take that the topics in this chapter are also laid out in a way tailored for computer science graduates with little to no prior exposure.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The chapter on linear algebra is divided into 12 sections.

As a first step, it is useful to use this as a high-level road map. The complete list of sections from the chapter are listed below.

- Scalars, Vectors, Matrices and Tensors
- Multiplying Matrices and Vectors
- Identity and Inverse Matrices
- Linear Dependence and Span
- Norms
- Special Kinds of Matrices and Vectors
- Eigendecomposition
- Singular Value Decomposition
- The Moore-Penrose Pseudoinverse
- The Trace Operator
- The Determinant
- Example: Principal Components Analysis

There’s not much value in enumerating the specifics covered in each section as the topics are mostly self explanatory, if familiar.

A reading of the chapter shows a progression in concepts and methods from the most primitive (vectors and matrices) to the derivation of the principal components analysis (known as PCA), a method used in machine learning.

It is a clean progression and well designed. Topics are presented with textual descriptions and consistent notation, allowing the reader to see exactly how elements come together through matrix factorization, the pseudoinverse, and ultimately PCA.

The focus is on the application of the linear algebra operations rather than theory. Although, no worked examples are given of any of the operations.

Finally, the derivation of PCA is perhaps a bit much. A beginner may want to skip this full derivation, or perhaps reduce it to the application of some of the elements learned throughout the chapter (e.g. eigendecomposition).

One area I would like to have seen covered is linear least squares and the use of various matrix algebra methods used to solve it, such as directly, LU, QR decomposition, and SVD. This might be more of a general machine learning perspective and less a deep learning perspective, and I can see why it was excluded.

The authors also suggest two other texts to consult if further depth in linear algebra is required.

They are:

- The Matrix Cookbook, Petersen and Pedersen, 2006.
- Linear Algebra, Shilov, 1977.

The Matrix Cookbook is a free PDF filled with the notations and equations of practically any matrix operation you can conceive.

These pages are a collection of facts (identities, approximations, inequalities, relations, …) about matrices and matters relating to them. It is collected in this form for the convenience of anyone who wants a quick desktop reference.

— page 2, The Matrix Cookbook, 2006.

Linear Algebra by Georgi Shilov is a classic and well regarded textbook on the topics designed for undergraduate students.

This book is intended as a text for undergraduate students majoring in mathematics and physics.

— Page v, Linear Algebra, 1977.

If you are a machine learning practitioner looking to use this chapter as a linear algebra crash course, then I would make a few recommendations to make the topics more concrete:

- Implement each operation in Python using NumPy functions on small contrived data.
- Implement each operation manually in Python without NumPy functions.
- Apply key operations, such as the factorization methods (eigendecomposition and SVD) and PCA to real but small datasets loaded from CSV.
- Create a cheat sheet of notation that you can use as a quick reference going forward.
- Research and list examples of each operation/topic used in machine learning papers or texts.

Did you take on any of these suggestions?

List your results in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- Deep Learning, 2016.
- The Matrix Cookbook, Petersen and Pedersen, 2006.
- Linear Algebra, Shilov, 1977.

In this post, you discovered the crash course in linear algebra for deep learning presented in the de facto textbook on deep learning.

Specifically, you learned:

- The topics suggested as prerequisites for deep learning by experts in the field.
- The progression through these topics and their culmination.
- Suggestions for how to get the most out of the chapter as a crash course in linear algebra.

Did you read this chapter of the Deep Learning book? What did you think of it?

Let me know in the comments below.

The post Linear Algebra for Deep Learning appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Sparse Matrices for Machine Learning appeared first on Machine Learning Mastery.

]]>Large sparse matrices are common in general and especially in applied machine learning, such as in data that contains counts, data encodings that map categories to counts, and even in whole subfields of machine learning such as natural language processing.

It is computationally expensive to represent and work with sparse matrices as though they are dense, and much improvement in performance can be achieved by using representations and operations that specifically handle the matrix sparsity.

In this tutorial, you will discover sparse matrices, the issues they present, and how to work with them directly in Python.

After completing this tutorial, you will know:

- That sparse matrices contain mostly zero values and are distinct from dense matrices.
- The myriad of areas where you are likely to encounter sparse matrices in data, data preparation, and sub-fields of machine learning.
- That there are many efficient ways to store and work with sparse matrices and SciPy provides implementations that you can use directly.

Let’s get started.

This tutorial is divided into 5 parts; they are:

- Sparse Matrix
- Problems with Sparsity
- Sparse Matrices in Machine Learning
- Working with Sparse Matrices
- Sparse Matrices in Python

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

A sparse matrix is a matrix that is comprised of mostly zero values.

Sparse matrices are distinct from matrices with mostly non-zero values, which are referred to as dense matrices.

A matrix is sparse if many of its coefficients are zero. The interest in sparsity arises because its exploitation can lead to enormous computational savings and because many large matrix problems that occur in practice are sparse.

— Page 1, Direct Methods for Sparse Matrices, Second Edition, 2017.

The sparsity of a matrix can be quantified with a score, which is the number of zero values in the matrix divided by the total number of elements in the matrix.

sparsity = count zero elements / total elements

Below is an example of a small 3 x 6 sparse matrix.

1, 0, 0, 1, 0, 0 A = (0, 0, 2, 0, 0, 1) 0, 0, 0, 2, 0, 0

The example has 13 zero values of the 18 elements in the matrix, giving this matrix a sparsity score of 0.722 or about 72%.

Sparse matrices can cause problems with regards to space and time complexity.

Very large matrices require a lot of memory, and some very large matrices that we wish to work with are sparse.

In practice, most large matrices are sparse — almost all entries are zeros.

— Page 465, Introduction to Linear Algebra, Fifth Edition, 2016.

An example of a very large matrix that is too large to be stored in memory is a link matrix that shows the links from one website to another.

An example of a smaller sparse matrix might be a word or term occurrence matrix for words in one book against all known words in English.

In both cases, the matrix contained is sparse with many more zero values than data values. The problem with representing these sparse matrices as dense matrices is that memory is required and must be allocated for each 32-bit or even 64-bit zero value in the matrix.

This is clearly a waste of memory resources as those zero values do not contain any information.

Assuming a very large sparse matrix can be fit into memory, we will want to perform operations on this matrix.

Simply, if the matrix contains mostly zero-values, i.e. no data, then performing operations across this matrix may take a long time where the bulk of the computation performed will involve adding or multiplying zero values together.

It is wasteful to use general methods of linear algebra on such problems, because most of the O(N^3) arithmetic operations devoted to solving the set of equations or inverting the matrix involve zero operands.

— Page 75, Numerical Recipes: The Art of Scientific Computing, Third Edition, 2007.

This is a problem of increased time complexity of matrix operations that increases with the size of the matrix.

This problem is compounded when we consider that even trivial machine learning methods may require many operations on each row, column, or even across the entire matrix, resulting in vastly longer execution times.

Sparse matrices turn up a lot in applied machine learning.

In this section, we will look at some common examples to motivate you to be aware of the issues of sparsity.

Sparse matrices come up in some specific types of data, most notably observations that record the occurrence or count of an activity.

Three examples include:

- Whether or not a user has watched a movie in a movie catalog.
- Whether or not a user has purchased a product in a product catalog.
- Count of the number of listens of a song in a song catalog.

Sparse matrices come up in encoding schemes used in the preparation of data.

Three common examples include:

- One-hot encoding, used to represent categorical data as sparse binary vectors.
- Count encoding, used to represent the frequency of words in a vocabulary for a document
- TF-IDF encoding, used to represent normalized word frequency scores in a vocabulary.

Some areas of study within machine learning must develop specialized methods to address sparsity directly as the input data is almost always sparse.

Three examples include:

- Natural language processing for working with documents of text.
- Recommender systems for working with product usage within a catalog.
- Computer vision when working with images that contain lots of black pixels.

If there are 100,000 words in the language model, then the feature vector has length 100,000, but for a short email message almost all the features will have count zero.

— Page 22, Artificial Intelligence: A Modern Approach, Third Edition, 2009.

The solution to representing and working with sparse matrices is to use an alternate data structure to represent the sparse data.

The zero values can be ignored and only the data or non-zero values in the sparse matrix need to be stored or acted upon.

There are multiple data structures that can be used to efficiently construct a sparse matrix; three common examples are listed below.

**Dictionary of Keys**. A dictionary is used where a row and column index is mapped to a value.**List of Lists**. Each row of the matrix is stored as a list, with each sublist containing the column index and the value.**Coordinate List**. A list of tuples is stored with each tuple containing the row index, column index, and the value.

There are also data structures that are more suitable for performing efficient operations; two commonly used examples are listed below.

**Compressed Sparse Row**. The sparse matrix is represented using three one-dimensional arrays for the non-zero values, the extents of the rows, and the column indexes.**Compressed Sparse Column**. The same as the Compressed Sparse Row method except the column indices are compressed and read first before the row indices.

The Compressed Sparse Row, also called CSR for short, is often used to represent sparse matrices in machine learning given the efficient access and matrix multiplication that it supports.

SciPy provides tools for creating sparse matrices using multiple data structures, as well as tools for converting a dense matrix to a sparse matrix.

Many linear algebra NumPy and SciPy functions that operate on NumPy arrays can transparently operate on SciPy sparse arrays. Further, machine learning libraries that use NumPy data structures can also operate transparently on SciPy sparse arrays, such as scikit-learn for general machine learning and Keras for deep learning.

A dense matrix stored in a NumPy array can be converted into a sparse matrix using the CSR representation by calling the *csr_matrix()* function.

In the example below, we define a 3 x 6 sparse matrix as a dense array, convert it to a CSR sparse representation, and then convert it back to a dense array by calling the *todense()* function.

# dense to sparse from numpy import array from scipy.sparse import csr_matrix # create dense matrix A = array([[1, 0, 0, 1, 0, 0], [0, 0, 2, 0, 0, 1], [0, 0, 0, 2, 0, 0]]) print(A) # convert to sparse matrix (CSR method) S = csr_matrix(A) print(S) # reconstruct dense matrix B = S.todense() print(B)

Running the example first prints the defined dense array, followed by the CSR representation, and then the reconstructed dense matrix.

[[1 0 0 1 0 0] [0 0 2 0 0 1] [0 0 0 2 0 0]] (0, 0) 1 (0, 3) 1 (1, 2) 2 (1, 5) 1 (2, 3) 2 [[1 0 0 1 0 0] [0 0 2 0 0 1] [0 0 0 2 0 0]]

NumPy does not provide a function to calculate the sparsity of a matrix.

Nevertheless, we can calculate it easily by first finding the density of the matrix and subtracting it from one. The number of non-zero elements in a NumPy array can be given by the *count_nonzero()* function and the total number of elements in the array can be given by the size property of the array. Array sparsity can therefore be calculated as

sparsity = 1.0 - count_nonzero(A) / A.size

The example below demonstrates how to calculate the sparsity of an array.

# calculate sparsity from numpy import array from numpy import count_nonzero # create dense matrix A = array([[1, 0, 0, 1, 0, 0], [0, 0, 2, 0, 0, 1], [0, 0, 0, 2, 0, 0]]) print(A) # calculate sparsity sparsity = 1.0 - count_nonzero(A) / A.size print(sparsity)

Running the example first prints the defined sparse matrix followed by the sparsity of the matrix.

[[1 0 0 1 0 0] [0 0 2 0 0 1] [0 0 0 2 0 0]] 0.7222222222222222

This section lists some ideas for extending the tutorial that you may wish to explore.

- Develop your own examples for converting a dense array to sparse and calculating sparsity.
- Develop an example for the each sparse matrix representation method supported by SciPy.
- Select one sparsity representation method and implement it yourself from scratch.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Introduction to Linear Algebra, Fifth Edition, 2016.
- Section 2.7 Sparse Linear Systems, Numerical Recipes: The Art of Scientific Computing, Third Edition, 2007.
- Artificial Intelligence: A Modern Approach, Third Edition, 2009.
- Direct Methods for Sparse Matrices, Second Edition, 2017.

- Sparse matrices (scipy.sparse) API
- scipy.sparse.csr_matrix() API
- numpy.count_nonzero() API
- numpy.ndarray.size API

In this tutorial, you discovered sparse matrices, the issues they present, and how to work with them directly in Python.

Specifically, you learned:

- That sparse matrices contain mostly zero values and are distinct from dense matrices.
- The myriad of areas where you are likely to encounter sparse matrices in data, data preparation, and sub-fields of machine learning.
- That there are many efficient ways to store and work with sparse matrices and SciPy provides implementations that you can use directly.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Sparse Matrices for Machine Learning appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Broadcasting with NumPy Arrays appeared first on Machine Learning Mastery.

]]>A way to overcome this is to duplicate the smaller array so that it is the dimensionality and size as the larger array. This is called array broadcasting and is available in NumPy when performing array arithmetic, which can greatly reduce and simplify your code.

In this tutorial, you will discover the concept of array broadcasting and how to implement it in NumPy.

After completing this tutorial, you will know:

- The problem of arithmetic with arrays with different sizes.
- The solution of broadcasting and common examples in one and two dimensions.
- The rule of array broadcasting and when broadcasting fails.

Let’s get started.

This tutorial is divided into 4 parts; they are:

- Limitation with Array Arithmetic
- Array Broadcasting
- Broadcasting in NumPy
- Limitations of Broadcasting

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

You can perform arithmetic directly on NumPy arrays, such as addition and subtraction.

For example, two arrays can be added together to create a new array where the values at each index are added together.

For example, an array a can be defined as [1, 2, 3] and array b can be defined as [1, 2, 3] and adding together will result in a new array with the values [2, 4, 6].

a = [1, 2, 3] b = [1, 2, 3] c = a + b c = [1 + 1, 2 + 2, 3 + 3]

Strictly, arithmetic may only be performed on arrays that have the same dimensions and dimensions with the same size.

This means that a one-dimensional array with the length of 10 can only perform arithmetic with another one-dimensional array with the length 10.

This limitation on array arithmetic is quite limiting indeed. Thankfully, NumPy provides a built-in workaround to allow arithmetic between arrays with differing sizes.

Broadcasting is the name given to the method that NumPy uses to allow array arithmetic between arrays with a different shape or size.

Although the technique was developed for NumPy, it has also been adopted more broadly in other numerical computational libraries, such as Theano, TensorFlow, and Octave.

Broadcasting solves the problem of arithmetic between arrays of differing shapes by in effect replicating the smaller array along the last mismatched dimension.

The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.

— Broadcasting, SciPy.org

NumPy does not actually duplicate the smaller array; instead, it makes memory and computationally efficient use of existing structures in memory that in effect achieve the same result.

The concept has also permeated linear algebra notation to simplify the explanation of simple operations.

In the context of deep learning, we also use some less conventional notation. We allow the addition of matrix and a vector, yielding another matrix: C = A + b, where Ci,j = Ai,j + bj. In other words, the vector b is added to each row of the matrix. This shorthand eliminates the need to define a matrix with b copied into each row before doing the addition. This implicit copying of b to many locations is called broadcasting.

— Page 34, Deep Learning, 2016.

We can make broadcasting concrete by looking at three examples in NumPy.

The examples in this section are not exhaustive, but instead are common to the types of broadcasting you may see or implement.

A single value or scalar can be used in arithmetic with a one-dimensional array.

For example, we can imagine a one-dimensional array “a” with three values [a1, a2, a3] added to a scalar “b”.

a = [a1, a2, a3] b

The scalar will need to be broadcast across the one-dimensional array by duplicating the value it 2 more times.

b = [b1, b2, b3]

The two one-dimensional arrays can then be added directly.

c = a + b c = [a1 + b1, a2 + b2, a3 + b3]

The example below demonstrates this in NumPy.

# scalar and one-dimensional from numpy import array a = array([1, 2, 3]) print(a) b = 2 print(b) c = a + b print(c)

Running the example first prints the defined one-dimensional array, then the scalar, followed by the result where the scalar is added to each value in the array.

[1 2 3] 2 [3 4 5]

A scalar value can be used in arithmetic with a two-dimensional array.

For example, we can imagine a two-dimensional array “A” with 2 rows and 3 columns added to the scalar “b”.

a11, a12, a13 A = (a21, a22, a23) b

The scalar will need to be broadcast across each row of the two-dimensional array by duplicating it 5 more times.

b11, b12, b13 B = (b21, b22, b23)

The two two-dimensional arrays can then be added directly.

C = A + B a11 + b11, a12 + b12, a13 + b13 C = (a21 + b21, a22 + b22, a23 + b23)

The example below demonstrates this in NumPy.

# scalar and two-dimensional from numpy import array A = array([[1, 2, 3], [1, 2, 3]]) print(A) b = 2 print(b) C = A + b print(C)

Running the example first prints the defined two-dimensional array, then the scalar, then the result of the addition with the value “2” added to each value in the array.

[[1 2 3] [1 2 3]] 2 [[3 4 5] [3 4 5]]

A one-dimensional array can be used in arithmetic with a two-dimensional array.

For example, we can imagine a two-dimensional array “A” with 2 rows and 3 columns added to a one-dimensional array “b” with 3 values.

a11, a12, a13 A = (a21, a22, a23) b = (b1, b2, b3)

The one-dimensional array is broadcast across each row of the two-dimensional array by creating a second copy to result in a new two-dimensional array “B”.

b11, b12, b13 B = (b21, b22, b23)

The two two-dimensional arrays can then be added directly.

C = A + B a11 + b11, a12 + b12, a13 + b13 C = (a21 + b21, a22 + b22, a23 + b23)

Below is a worked example in NumPy.

# one-dimensional and two-dimensional from numpy import array A = array([[1, 2, 3], [1, 2, 3]]) print(A) b = array([1, 2, 3]) print(b) C = A + b print(C)

Running the example first prints the defined two-dimensional array, then the defined one-dimensional array, followed by the result C where in effect each value in the two-dimensional array is doubled.

[[1 2 3] [1 2 3]] [1 2 3] [[2 4 6] [2 4 6]]

Broadcasting is a handy shortcut that proves very useful in practice when working with NumPy arrays.

That being said, it does not work for all cases, and in fact imposes a strict rule that must be satisfied for broadcasting to be performed.

Arithmetic, including broadcasting, can only be performed when the shape of each dimension in the arrays are equal or one has the dimension size of 1. The dimensions are considered in reverse order, starting with the trailing dimension; for example, looking at columns before rows in a two-dimensional case.

This make more sense when we consider that NumPy will in effect pad missing dimensions with a size of “1” when comparing arrays.

Therefore, the comparison between a two-dimensional array “A” with 2 rows and 3 columns and a vector “b” with 3 elements:

A.shape = (2 x 3) b.shape = (3)

In effect, this becomes a comparison between:

A.shape = (2 x 3) b.shape = (1 x 3)

This same notion applies to the comparison between a scalar that is treated as an array with the required number of dimensions:

A.shape = (2 x 3) b.shape = (1)

This becomes a comparison between:

A.shape = (2 x 3) b.shape = (1 x 1)

When the comparison fails, the broadcast cannot be performed, and an error is raised.

The example below attempts to broadcast a two-element array to a 2 x 3 array. This comparison is in effect:

A.shape = (2 x 3) b.shape = (1 x 2)

We can see that the last dimensions (columns) do not match and we would expect the broadcast to fail.

The example below demonstrates this in NumPy.

# broadcasting error from numpy import array A = array([[1, 2, 3], [1, 2, 3]]) print(A.shape) b = array([1, 2]) print(b.shape) C = A + b print(C)

Running the example first prints the shapes of the arrays then raises an error when attempting to broadcast, as we expected.

(2, 3) (2,) ValueError: operands could not be broadcast together with shapes (2,3) (2,)

This section lists some ideas for extending the tutorial that you may wish to explore.

- Create three new and different examples of broadcasting with NumPy arrays.
- Implement your own broadcasting function for manually broadcasting in one and two-dimensional cases.
- Benchmark NumPy broadcasting and your own custom broadcasting functions with one and two dimensional cases with very large arrays.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Chapter 2, Deep Learning, 2016.

- Broadcasting, NumPy API, SciPy.org
- Broadcasting semantics in TensorFlow
- Array Broadcasting in numpy, EricsBroadcastingDoc
- Broadcasting, Theano
- Broadcasting arrays in Numpy, 2015.
- Broadcasting in Octave

In this tutorial, you discovered the concept of array broadcasting and how to implement in NumPy.

Specifically, you learned:

- The problem of arithmetic with arrays with different sizes.
- The solution of broadcasting and common examples in one and two dimensions.
- The rule of array broadcasting and when broadcasting fails.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Broadcasting with NumPy Arrays appeared first on Machine Learning Mastery.

]]>The post 10 Examples of Linear Algebra in Machine Learning appeared first on Machine Learning Mastery.

]]>It is a key foundation to the field of machine learning, from notations used to describe the operation of algorithms to the implementation of algorithms in code.

Although linear algebra is integral to the field of machine learning, the tight relationship is often left unexplained or explained using abstract concepts such as vector spaces or specific matrix operations.

In this post, you will discover 10 common examples of machine learning that you may be familiar with that use, require and are really best understood using linear algebra.

After reading this post, you will know:

- The use of linear algebra structures when working with data, such as tabular datasets and images.
- Linear algebra concepts when working with data preparation, such as one hot encoding and dimensionality reduction.
- The ingrained use of linear algebra notation and methods in sub-fields such as deep learning, natural language processing, and recommender systems.

Let’s get started.

In this post, we will review 10 obvious and concrete examples of linear algebra in machine learning.

I tried to pick examples that you may be familiar with or have even worked with before. They are:

- Dataset and Data Files
- Images and Photographs
- One-Hot Encoding
- Linear Regression
- Regularization
- Principal Component Analysis
- Singular-Value Decomposition
- Latent Semantic Analysis
- Recommender Systems
- Deep Learning

Do you have your own favorite example of linear algebra in machine learning?

Let me know in the comments below.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In machine learning, you fit a model on a dataset.

This is the table-like set of numbers where each row represents an observation and each column represents a feature of the observation.

For example, below is a snippet of the Iris flowers dataset:

5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa

This data is in fact a matrix: a key data structure in linear algebra.

Further, when you split the data into inputs and outputs to fit a supervised machine learning model, such as the measurements and the flower species, you have a matrix (X) and a vector (y). The vector is another key data structure in linear algebra.

Each row has the same length, i.e. the same number of columns, therefore we can say that the data is vectorized where rows can be provided to a model one at a time or in a batch and the model can be pre-configured to expect rows of a fixed width.

Perhaps you are more used to working with images or photographs in computer vision applications.

Each image that you work with is itself a table structure with a width and height and one pixel value in each cell for black and white images or 3 pixel values in each cell for a color image.

A photo is yet another example of a matrix from linear algebra.

Operations on the image, such as cropping, scaling, shearing, and so on are all described using the notation and operations of linear algebra.

Sometimes you work with categorical data in machine learning.

Perhaps the class labels for classification problems, or perhaps categorical input variables.

It is common to encode categorical variables to make them easier to work with and learn by some techniques. A popular encoding for categorical variables is the one hot encoding.

A one hot encoding is where a table is created to represent the variable with one column for each category and a row for each example in the dataset. A check, or one-value, is added in the column for the categorical value for a given row, and a zero-value is added to all other columns.

For example, the color variable with the 3 rows:

red green blue ...

Might be encoded as:

red, green, blue 1, 0, 0 0, 1, 0 0, 0, 1 ...

Each row is encoded as a binary vector, a vector with zero or one values and this is an example of a sparse representation, a whole sub-field of linear algebra.

Linear regression is an old method from statistics for describing the relationships between variables.

It is often used in machine learning for predicting numerical values in simpler regression problems.

There are many ways to describe and solve the linear regression problem, i.e. finding a set of coefficients that when multiplied by each of the input variables and added together results in the best prediction of the output variable.

If you have used a machine learning tool or library, the most common way of solving linear regression is via a least squares optimization that is solved using matrix factorization methods from linear regression, such as an LU decomposition or a singular-value decomposition, or SVD.

Even the common way of summarizing the linear regression equation uses linear algebra notation:

y = A . b

Where y is the output variable A is the dataset and b are the model coefficients.

In applied machine learning, we often seek the simplest possible models that achieve the best skill on our problem.

Simpler models are often better at generalizing from specific examples to unseen data.

In many methods that involve coefficients, such as regression methods and artificial neural networks, simpler models are often characterized by models that have smaller coefficient values.

A technique that is often used to encourage a model to minimize the size of coefficients while it is being fit on data is called regularization. Common implementations include the L2 and L1 forms of regularization.

Both of these forms of regularization are in fact a measure of the magnitude or length of the coefficients as a vector and are methods lifted directly from linear algebra called the vector norm.

Often, a dataset has many columns, perhaps tens, hundreds, thousands, or more.

Modeling data with many features is challenging, and models built from data that include irrelevant features are often less skillful than models trained from the most relevant data.

It is hard to know which features of the data are relevant and which are not.

Methods for automatically reducing the number of columns of a dataset are called dimensionality reduction, and perhaps the most popular method is called the principal component analysis, or PCA for short.

This method is used in machine learning to create projections of high-dimensional data for both visualization and for training models.

The core of the PCA method is a matrix factorization method from linear algebra. The eigendecomposition can be used and more robust implementations may use the singular-value decomposition, or SVD.

Another popular dimensionality reduction method is the singular-value decomposition method, or SVD for short.

As mentioned, and as the name of the method suggests, it is a matrix factorization method from the field of linear algebra.

It has wide use in linear algebra and can be used directly in applications such as feature selection, visualization, noise reduction, and more.

We will see two more cases below of using the SVD in machine learning.

In the sub-field of machine learning for working with text data called natural language processing, it is common to represent documents as large matrices of word occurrences.

For example, the columns of the matrix may be the known words in the vocabulary and rows may be sentences, paragraphs, pages, or documents of text with cells in the matrix marked as the count or frequency of the number of times the word occurred.

This is a sparse matrix representation of the text. Matrix factorization methods, such as the singular-value decomposition can be applied to this sparse matrix, which has the effect of distilling the representation down to its most relevant essence. Documents processed in this way are much easier to compare, query, and use as the basis for a supervised machine learning model.

This form of data preparation is called Latent Semantic Analysis, or LSA for short, and is also known by the name Latent Semantic Indexing, or LSI.

Predictive modeling problems that involve the recommendation of products are called recommender systems, a sub-field of machine learning.

Examples include the recommendation of books based on previous purchases and purchases by customers like you on Amazon, and the recommendation of movies and TV shows to watch based on your viewing history and viewing history of subscribers like you on Netflix.

The development of recommender systems is primarily concerned with linear algebra methods. A simple example is in the calculation of the similarity between sparse customer behavior vectors using distance measures such as Euclidean distance or dot products.

Matrix factorization methods like the singular-value decomposition are used widely in recommender systems to distill item and user data to their essence for querying and searching and comparison.

Artificial neural networks are nonlinear machine learning algorithms that are inspired by elements of the information processing in the brain and have proven effective at a range of problems, not the least of which is predictive modeling.

Deep learning is the recent resurgence in the use of artificial neural networks with newer methods and faster hardware that allow for the development and training of larger and deeper (more layers) networks on very large datasets. Deep learning methods are routinely achieving state-of-the-art results on a range of challenging problems such as machine translation, photo captioning, speech recognition, and much more.

At their core, the execution of neural networks involves linear algebra data structures multiplied and added together. Scaled up to multiple dimensions, deep learning methods work with vectors, matrices, and even tensors of inputs and coefficients, where a tensor is a matrix with more than two dimensions.

Linear algebra is central to the description of deep learning methods via matrix notation to the implementation of deep learning methods such as Google’s TensorFlow Python library that has the word “tensor” in its name.

In this post, you discovered 10 common examples of machine learning that you may be familiar with that use and require linear algebra.

Specifically, you learned:

- The use of linear algebra structures when working with data such as tabular datasets and images.
- Linear algebra concepts when working with data preparation such as one hot encoding and dimensionality reduction.
- The ingrained use of linear algebra notation and methods in sub-fields such as deep learning, natural language processing, and recommender systems.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post 10 Examples of Linear Algebra in Machine Learning appeared first on Machine Learning Mastery.

]]>The post No Bullshit Guide To Linear Algebra Review appeared first on Machine Learning Mastery.

]]>Most are textbooks targeted at undergraduate students and are full of theoretical digressions that are barely relevant and mostly distracting to a beginner or practitioner to the field.

In this post, you will discover the book “No bullshit guide to linear algebra” that provides a gentle introduction to the field of linear algebra and assumes no prior mathematical knowledge.

After reading this post, you will know:

- About the goals and benefits of the book to a beginner or practitioner.
- The contents of the book and general topics presented in each chapter.
- A selected reading list targeted for machine learning practitioners looking to get up to speed fast.

Let’s get started.

The book provides an introduction to linear algebra, comparable to an undergraduate university course on the subject.

The key approach of the book is no crap and straight to the point. This means a laser focus on a given operation or technique and no (or few) detours or digressions.

The book was written by Ivan Savov, the second edition of which was released in 2017. Ivan has an undergraduate degree in electrical engineering and a Masters and Ph.D. in physics and has worked for the last 15 years as a private tutor for math and physics. He knows the subject and where students encounter difficulties.

What makes this an excellent book for the machine learning practitioner is that the book is self-contained. It does not assume any prior mathematics background and all prerequisite math, which is minimal, is covered in the first chapter titled “*Math fundamentals*.”

It is the perfect book if you have never studied linear algebra, or if you studied it in school decades ago and have forgotten practically everything.

Another aspect that makes this book great for machine learning practitioners is that it includes exercises.

Each section ends with a few pop-quiz style questions.

Each chapter ends with a problem set for you to work through.

Finally, Appendix A provides answers to all exercises in the book.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

This section provides a summary of the table of contents of the book.

**Math fundamentals**. Covers the prerequisite math topics required to start learning linear algebra. Topics include numbers, functions, trigonometry, complex numbers, and set notation.**Intro to linear algebra**. An introduction into vector and matrix algebra, the very foundation of linear algebra. Topics include vector and matrix operations and linearity.**Computational linear algebra**. This chapter covers the issues that you will encounter when you start to implement linear algebra and must deal with the operations at any kind of scale. Topics include matrix equations, matrix multiplication, and determinants. Some Python examples are given.**Geometric aspects of linear algebra**. Covers the geometric intuition for vector algebra, which is quite common. Topics include lines and planes, projections and vector spaces.**Linear transformations**. Covers the core fiber of linear algebra as Ivan describes it. Introduces linear transformations.**Theoretical linear algebra**. Covers the last steps of matrix algebra prior to applications. Covers topics such as matrix factorization methods, types of matrices, and more.**Applications**. This chapter covers an impressive list of applications of linear algebra to a range of domains from electronics, graphs, computer graphics, and more. An impressive chapter to make the methods learned throughout the book concrete.**Probability theory**. Provides a crash course on probability theory in the context of linear algebra including Markov chains and the PageRank algorithm.**Quantum mechanics**. Provides a crash course into quantum mechanics through the lens of linear algebra, a specialty area of the authors.

The book is excellent, and I recommend reading it from cover-to-cover, if you’re really into it.

But, as a machine learning practitioner, you do not need to read it all.

Below is a list of selected reading from the book that I recommend to get on top of linear algebra fast:

**Concept Maps**. Page v. A collection of mind-map type diagrams are provided directly after the table of contents that show how the concepts in the book, and, in fact, the concepts in the field of linear algebra, relate. If you are a visual thinker, these may help fit the pieces together.- Section 1.15,
**Vectors**. Page 69. Provides a terse introduction to vectors, prior to any vector algebra. Useful background. - Chapter 2,
**Intro to Linear Algebra**. Pages 101-130. Read this whole chapter. It covers:- Definitions of terms in linear algebra.
- Vector operations such as arithmetic and vector norm.
- Matrix operations such as arithmetic and dot product.
- Linearity and what exactly this key concept means in linear algebra
- Overview of how the different aspects of linear algebra (geometric, theory, etc.) relate.

- Section 3.2
**Matrix Equations**. Page 147. Includes explanations and clear diagrams for calculating matrix operations, not least the must-know matrix multiplication - Section 6.1
**Eigenvalues and eigenvectors**. Page 262. Provides an introduction to the eigendecomposition that is used as a key operation in methods such as the principal component analysis. - Section 6.2
**Special types of matrices**. Page 275. Provides an introduction to various different types of matrices such as diagonal, symmetric, orthogonal, and more. - Section 6.6
**Matrix Decompositions**. Page 295. An introduction matrix factorization methods, re-covering the eigendecomposition, but also covering the LU, QR, and Singular-Value decomposition. - Section 7.7
**Least squares approximate solutions**. Page 241. An introduction to the matrix formulation of least squares called linear least squares. - Appendix B,
**Notation**. A summary of math and linear algebra notation.

This section provides more resources on the topic if you are looking to go deeper.

- No Bullshit Guide To Linear Algebra on Amazon
- Mini Reference Publisher Homepage
- Ivan Savov on Twitter
- Linear algebra explained in four pages, 2013.

In this post, you discovered the book “No Bullshit Guide To Linear Algebra” that provides a gentle introduction to the field of linear algebra and assumes no prior mathematical knowledge.

Specifically, you learned:

- About the goals and benefits of the book to a beginner or practitioner.
- The contents of the book and general topics presented in each chapter.
- A selected reading list targeted for machine learning practitioners looking to get up to speed fast.

Have you read this book? What did you think?

Let me know in the comments below.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post No Bullshit Guide To Linear Algebra Review appeared first on Machine Learning Mastery.

]]>The post How to Solve Linear Regression Using Linear Algebra appeared first on Machine Learning Mastery.

]]>It is a staple of statistics and is often considered a good introductory machine learning method. It is also a method that can be reformulated using matrix notation and solved using matrix operations.

In this tutorial, you will discover the matrix formulation of linear regression and how to solve it using direct and matrix factorization methods.

After completing this tutorial, you will know:

- Linear regression and the matrix reformulation with the normal equations.
- How to solve linear regression using a QR matrix decomposition.
- How to solve linear regression using SVD and the pseudoinverse.

Let’s get started.

This tutorial is divided into 6 parts; they are:

- Linear Regression
- Matrix Formulation of Linear Regression
- Linear Regression Dataset
- Solve Directly
- Solve via QR Decomposition
- Solve via Singular-Value Decomposition

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Linear regression is a method for modeling the relationship between two scalar values: the input variable x and the output variable y.

The model assumes that y is a linear function or a weighted sum of the input variable.

y = f(x)

Or, stated with the coefficients.

y = b0 + b1 . x1

The model can also be used to model an output variable given multiple input variables called multivariate linear regression (below, brackets were added for readability).

y = b0 + (b1 . x1) + (b2 . x2) + ...

The objective of creating a linear regression model is to find the values for the coefficient values (b) that minimize the error in the prediction of the output variable y.

Matrix Formulation of Linear Regression

Linear regression can be stated using Matrix notation; for example:

y = X . b

Or, without the dot notation.

y = Xb

Where X is the input data and each column is a data feature, b is a vector of coefficients and y is a vector of output variables for each row in X.

x11, x12, x13 X = (x21, x22, x23) x31, x32, x33 x41, x42, x43 b1 b = (b2) b3 y1 y = (y2) y3 y4

Reformulated, the problem becomes a system of linear equations where the b vector values are unknown. This type of system is referred to as overdetermined because there are more equations than there are unknowns, i.e. each coefficient is used on each row of data.

It is a challenging problem to solve analytically because there are multiple inconsistent solutions, e.g. multiple possible values for the coefficients. Further, all solutions will have some error because there is no line that will pass nearly through all points, therefore the approach to solving the equations must be able to handle that.

The way this is typically achieved is by finding a solution where the values for b in the model minimize the squared error. This is called linear least squares.

||X . b - y||^2 = sum i=1 to m ( sum j=1 to n Xij . bj - yi)^2

This formulation has a unique solution as long as the input columns are independent (e.g. uncorrelated).

We cannot always get the error e = b – Ax down to zero. When e is zero, x is an exact solution to Ax = b. When the length of e is as small as possible, xhat is a least squares solution.

— Page 219, Introduction to Linear Algebra, Fifth Edition, 2016.

In matrix notation, this problem is formulated using the so-named normal equation:

X^T . X . b = X^T . y

This can be re-arranged in order to specify the solution for b as:

b = (X^T . X)^-1 . X^T . y

This can be solved directly, although given the presence of the matrix inverse can be numerically challenging or unstable.

In order to explore the matrix formulation of linear regression, let’s first define a dataset as a context.

We will use a simple 2D dataset where the data is easy to visualize as a scatter plot and models are easy to visualize as a line that attempts to fit the data points.

The example below defines a 5×2 matrix dataset, splits it into X and y components, and plots the dataset as a scatter plot.

from numpy import array from matplotlib import pyplot data = array([ [0.05, 0.12], [0.18, 0.22], [0.31, 0.35], [0.42, 0.38], [0.5, 0.49], ]) print(data) X, y = data[:,0], data[:,1] X = X.reshape((len(X), 1)) # plot dataset pyplot.scatter(X, y) pyplot.show()

Running the example first prints the defined dataset.

[[ 0.05 0.12] [ 0.18 0.22] [ 0.31 0.35] [ 0.42 0.38] [ 0.5 0.49]]

A scatter plot of the dataset is then created showing that a straight line cannot fit this data exactly.

The first approach is to attempt to solve the regression problem directly.

That is, given X, what are the set of coefficients b that when multiplied by X will give y. As we saw in a previous section, the normal equations define how to calculate b directly.

b = (X^T . X)^-1 . X^T . y

This can be calculated directly in NumPy using the inv() function for calculating the matrix inverse.

b = inv(X.T.dot(X)).dot(X.T).dot(y)

Once the coefficients are calculated, we can use them to predict outcomes given X.

yhat = X.dot(b)

Putting this together with the dataset defined in the previous section, the complete example is listed below.

# solve directly from numpy import array from numpy.linalg import inv from matplotlib import pyplot data = array([ [0.05, 0.12], [0.18, 0.22], [0.31, 0.35], [0.42, 0.38], [0.5, 0.49], ]) X, y = data[:,0], data[:,1] X = X.reshape((len(X), 1)) # linear least squares b = inv(X.T.dot(X)).dot(X.T).dot(y) print(b) # predict using coefficients yhat = X.dot(b) # plot data and predictions pyplot.scatter(X, y) pyplot.plot(X, yhat, color='red') pyplot.show()

Running the example performs the calculation and prints the coefficient vector b.

[ 1.00233226]

A scatter plot of the dataset is then created with a line plot for the model, showing a reasonable fit to the data.

A problem with this approach is the matrix inverse that is both computationally expensive and numerically unstable. An alternative approach is to use a matrix decomposition to avoid this operation. We will look at two examples in the following sections.

The QR decomposition is an approach of breaking a matrix down into its constituent elements.

A = Q . R

Where A is the matrix that we wish to decompose, Q a matrix with the size m x m, and R is an upper triangle matrix with the size m x n.

The QR decomposition is a popular approach for solving the linear least squares equation.

Stepping over all of the derivation, the coefficients can be found using the Q and R elements as follows:

b = R^-1 . Q.T . y

The approach still involves a matrix inversion, but in this case only on the simpler R matrix.

The QR decomposition can be found using the qr() function in NumPy. The calculation of the coefficients in NumPy looks as follows:

# QR decomposition Q, R = qr(X) b = inv(R).dot(Q.T).dot(y)

Tying this together with the dataset, the complete example is listed below.

# least squares via QR decomposition from numpy import array from numpy.linalg import inv from numpy.linalg import qr from matplotlib import pyplot data = array([ [0.05, 0.12], [0.18, 0.22], [0.31, 0.35], [0.42, 0.38], [0.5, 0.49], ]) X, y = data[:,0], data[:,1] X = X.reshape((len(X), 1)) # QR decomposition Q, R = qr(X) b = inv(R).dot(Q.T).dot(y) print(b) # predict using coefficients yhat = X.dot(b) # plot data and predictions pyplot.scatter(X, y) pyplot.plot(X, yhat, color='red') pyplot.show()

Running the example first prints the coefficient solution and plots the data with the model.

[ 1.00233226]

The QR decomposition approach is more computationally efficient and more numerically stable than calculating the normal equation directly, but does not work for all data matrices.

The Singular-Value Decomposition, or SVD for short, is a matrix decomposition method like the QR decomposition.

X = U . Sigma . V^*

Where A is the real n x m matrix that we wish to decompose, U is a m x m matrix, Sigma (often represented by the uppercase Greek letter Sigma) is an m x n diagonal matrix, and V^* is the conjugate transpose of an n x n matrix where * is a superscript.

Unlike the QR decomposition, all matrices have an SVD decomposition. As a basis for solving the system of linear equations for linear regression, SVD is more stable and the preferred approach.

Once decomposed, the coefficients can be found by calculating the pseudoinverse of the input matrix X and multiplying that by the output vector y.

b = X^+ . y

Where the pseudoinverse is calculated as following:

X^+ = U . D^+ . V^T

Where X^+ is the pseudoinverse of X and the + is a superscript, D^+ is the pseudoinverse of the diagonal matrix Sigma and V^T is the transpose of V^*.

Matrix inversion is not defined for matrices that are not square. […] When A has more columns than rows, then solving a linear equation using the pseudoinverse provides one of the many possible solutions.

— Page 46, Deep Learning, 2016.

We can get U and V from the SVD operation. D^+ can be calculated by creating a diagonal matrix from Sigma and calculating the reciprocal of each non-zero element in Sigma.

s11, 0, 0 Sigma = ( 0, s22, 0) 0, 0, s33 1/s11, 0, 0 D = ( 0, 1/s22, 0) 0, 0, 1/s33

We can calculate the SVD, then the pseudoinverse manually. Instead, NumPy provides the function pinv() that we can use directly.

The complete example is listed below.

# least squares via SVD with pseudoinverse from numpy import array from numpy.linalg import pinv from matplotlib import pyplot data = array([ [0.05, 0.12], [0.18, 0.22], [0.31, 0.35], [0.42, 0.38], [0.5, 0.49], ]) X, y = data[:,0], data[:,1] X = X.reshape((len(X), 1)) # calculate coefficients b = pinv(X).dot(y) print(b) # predict using coefficients yhat = X.dot(b) # plot data and predictions pyplot.scatter(X, y) pyplot.plot(X, yhat, color='red') pyplot.show()

Running the example prints the coefficient and plots the data with a red line showing the predictions from the model.

[ 1.00233226]

In fact, NumPy provides a function to replace these two steps in the lstsq() function that you can use directly.

This section lists some ideas for extending the tutorial that you may wish to explore.

- Implement linear regression using the built-in lstsq() NumPy function
- Test each linear regression on your own small contrived dataset.
- Load a tabular dataset and test each linear regression method and compare the results.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Section 7.7 Least squares approximate solutions. No Bullshit Guide To Linear Algebra, 2017.
- Section 4.3 Least Squares Approximations, Introduction to Linear Algebra, Fifth Edition, 2016.
- Lecture 11, Least Squares Problems, Numerical Linear Algebra, 1997.
- Chapter 5, Orthogonalization and Least Squares, Matrix Computations, 2012.
- Chapter 12, Singular-Value and Jordan Decompositions, Linear Algebra and Matrix Analysis for Statistics, 2014.
- Section 2.9 The Moore-Penrose Pseudoinverse, Deep Learning, 2016.
- Section 15.4 General Linear Least Squares, Numerical Recipes: The Art of Scientific Computing, Third Edition, 2007.

- numpy.linalg.inv() API
- numpy.linalg.qr() API
- numpy.linalg.svd() API
- numpy.diag() API
- numpy.linalg.pinv() API
- numpy.linalg.lstsq() API

- Linear regression on Wikipedia
- Least squares on Wikipedia
- Linear least squares (mathematics) on Wikipedia
- Overdetermined system on Wikipedia
- QR decomposition on Wikipedia
- Singular-value decomposition on Wikipedia
- Moore–Penrose inverse

In this tutorial, you discovered the matrix formulation of linear regression and how to solve it using direct and matrix factorization methods.

Specifically, you learned:

- Linear regression and the matrix reformulation with the normal equations.
- How to solve linear regression using a QR matrix decomposition.
- How to solve linear regression using SVD and the pseudoinverse.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Solve Linear Regression Using Linear Algebra appeared first on Machine Learning Mastery.

]]>The post How to Calculate Principal Component Analysis (PCA) from Scratch in Python appeared first on Machine Learning Mastery.

]]>It is a method that uses simple matrix operations from linear algebra and statistics to calculate a projection of the original data into the same number or fewer dimensions.

In this tutorial, you will discover the Principal Component Analysis machine learning method for dimensionality reduction and how to implement it from scratch in Python.

After completing this tutorial, you will know:

- The procedure for calculating the Principal Component Analysis and how to choose principal components.
- How to calculate the Principal Component Analysis from scratch in NumPy.
- How to calculate the Principal Component Analysis for reuse on more data in scikit-learn.

Let’s get started.

**Update Apr/2018**: Fixed typo in the explaination of the sklearn PCA attributes. Thanks kris.

This tutorial is divided into 3 parts; they are:

- Principal Component Analysis
- Manually Calculate Principal Component Analysis
- Reusable Principal Component Analysis

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Principal Component Analysis, or PCA for short, is a method for reducing the dimensionality of data.

It can be thought of as a projection method where data with m-columns (features) is projected into a subspace with m or fewer columns, whilst retaining the essence of the original data.

The PCA method can be described and implemented using the tools of linear algebra.

PCA is an operation applied to a dataset, represented by an n x m matrix A that results in a projection of A which we will call B. Let’s walk through the steps of this operation.

a11, a12 A = (a21, a22) a31, a32 B = PCA(A)

The first step is to calculate the mean values of each column.

M = mean(A)

or

(a11 + a21 + a31) / 3 M(m11, m12) = (a12 + a22 + a32) / 3

Next, we need to center the values in each column by subtracting the mean column value.

C = A - M

The next step is to calculate the covariance matrix of the centered matrix C.

Correlation is a normalized measure of the amount and direction (positive or negative) that two columns change together. Covariance is a generalized and unnormalized version of correlation across multiple columns. A covariance matrix is a calculation of covariance of a given matrix with covariance scores for every column with every other column, including itself.

V = cov(C)

Finally, we calculate the eigendecomposition of the covariance matrix V. This results in a list of eigenvalues and a list of eigenvectors.

values, vectors = eig(V)

The eigenvectors represent the directions or components for the reduced subspace of B, whereas the eigenvalues represent the magnitudes for the directions. For more on this topic, see the post:

The eigenvectors can be sorted by the eigenvalues in descending order to provide a ranking of the components or axes of the new subspace for A.

If all eigenvalues have a similar value, then we know that the existing representation may already be reasonably compressed or dense and that the projection may offer little. If there are eigenvalues close to zero, they represent components or axes of B that may be discarded.

A total of m or less components must be selected to comprise the chosen subspace. Ideally, we would select k eigenvectors, called principal components, that have the k largest eigenvalues.

B = select(values, vectors)

Other matrix decomposition methods can be used such as Singular-Value Decomposition, or SVD. As such, generally the values are referred to as singular values and the vectors of the subspace are referred to as principal components.

Once chosen, data can be projected into the subspace via matrix multiplication.

P = B^T . A

Where A is the original data that we wish to project, B^T is the transpose of the chosen principal components and P is the projection of A.

This is called the covariance method for calculating the PCA, although there are alternative ways to to calculate it.

There is no pca() function in NumPy, but we can easily calculate the Principal Component Analysis step-by-step using NumPy functions.

The example below defines a small 3×2 matrix, centers the data in the matrix, calculates the covariance matrix of the centered data, and then the eigendecomposition of the covariance matrix. The eigenvectors and eigenvalues are taken as the principal components and singular values and used to project the original data.

from numpy import array from numpy import mean from numpy import cov from numpy.linalg import eig # define a matrix A = array([[1, 2], [3, 4], [5, 6]]) print(A) # calculate the mean of each column M = mean(A.T, axis=1) print(M) # center columns by subtracting column means C = A - M print(C) # calculate covariance matrix of centered matrix V = cov(C.T) print(V) # eigendecomposition of covariance matrix values, vectors = eig(V) print(vectors) print(values) # project data P = vectors.T.dot(C.T) print(P.T)

Running the example first prints the original matrix, then the eigenvectors and eigenvalues of the centered covariance matrix, followed finally by the projection of the original matrix.

Interestingly, we can see that only the first eigenvector is required, suggesting that we could project our 3×2 matrix onto a 3×1 matrix with little loss.

[[1 2] [3 4] [5 6]] [[ 0.70710678 -0.70710678] [ 0.70710678 0.70710678]] [ 8. 0.] [[-2.82842712 0. ] [ 0. 0. ] [ 2.82842712 0. ]]

We can calculate a Principal Component Analysis on a dataset using the PCA() class in the scikit-learn library. The benefit of this approach is that once the projection is calculated, it can be applied to new data again and again quite easily.

When creating the class, the number of components can be specified as a parameter.

The class is first fit on a dataset by calling the fit() function, and then the original dataset or other data can be projected into a subspace with the chosen number of dimensions by calling the transform() function.

Once fit, the eigenvalues and principal components can be accessed on the PCA class via the *explained_variance_* and *components_* attributes.

The example below demonstrates using this class by first creating an instance, fitting it on a 3×2 matrix, accessing the values and vectors of the projection, and transforming the original data.

# Principal Component Analysis from numpy import array from sklearn.decomposition import PCA # define a matrix A = array([[1, 2], [3, 4], [5, 6]]) print(A) # create the PCA instance pca = PCA(2) # fit on data pca.fit(A) # access values and vectors print(pca.components_) print(pca.explained_variance_) # transform data B = pca.transform(A) print(B)

Running the example first prints the 3×2 data matrix, then the principal components and values, followed by the projection of the original matrix.

We can see, that with some very minor floating point rounding that we achieve the same principal components, singular values, and projection as in the previous example.

[[1 2] [3 4] [5 6]] [[ 0.70710678 0.70710678] [ 0.70710678 -0.70710678]] [ 8.00000000e+00 2.25080839e-33] [[ -2.82842712e+00 2.22044605e-16] [ 0.00000000e+00 0.00000000e+00] [ 2.82842712e+00 -2.22044605e-16]]

This section lists some ideas for extending the tutorial that you may wish to explore.

- Re-run the examples with your own small contrived matrix values.
- Load a dataset and calculate the PCA on it and compare the results from the two methods.
- Search for and locate 10 examples where PCA has been used in machine learning papers.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Section 7.3 Principal Component Analysis (PCA by the SVD), Introduction to Linear Algebra, Fifth Edition, 2016.
- Section 2.12 Example: Principal Components Analysis, Deep Learning, 2016.

- Principal Component Analysis with numpy, 2011.
- PCA and image compression with numpy, 2011.
- Implementing a Principal Component Analysis (PCA), 2014.

In this tutorial, you discovered the Principal Component Analysis machine learning method for dimensionality reduction.

Specifically, you learned:

- The procedure for calculating the Principal Component Analysis and how to choose principal components.
- How to calculate the Principal Component Analysis from scratch in NumPy.
- How to calculate the Principal Component Analysis for reuse on more data in scikit-learn.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Calculate Principal Component Analysis (PCA) from Scratch in Python appeared first on Machine Learning Mastery.

]]>