The post Face Recognition using Principal Component Analysis appeared first on Machine Learning Mastery.

]]>In this tutorial, we will see how we can build a primitive face recognition system with some simple linear algebra technique such as principal component analysis.

After completing this tutorial, you will know:

- The development of eigenface technique
- How to use principal component analysis to extract characteristic images from an image dataset
- How to express any image as a weighted sum of the characteristic images
- How to compare the similarity of images from the weight of principal components

Let’s get started.

This tutorial is divided into 3 parts; they are:

- Image and Face Recognition
- Overview of Eigenface
- Implementing Eigenface

In computer, pictures are represented as a matrix of pixels, with each pixel a particular color coded in some numerical values. It is natural to ask if computer can read the picture and understand what it is, and if so, whether we can describe the logic using matrix mathematics. To be less ambitious, people try to limit the scope of this problem to identifying human faces. An early attempt for face recognition is to consider the matrix as a high dimensional detail and we infer a lower dimension information vector from it, then try to recognize the person in lower dimension. It was necessary in the old time because the computer was not powerful and the amount of memory is very limited. However, by exploring how to **compress** image to a much smaller size, we developed a skill to compare if two images are portraying the same human face even if the pictures are not identical.

In 1987, a paper by Sirovich and Kirby considered the idea that all pictures of human face to be a weighted sum of a few “key pictures”. Sirovich and Kirby called these key pictures the “eigenpictures”, as they are the eigenvectors of the covariance matrix of the mean-subtracted pictures of human faces. In the paper they indeed provided the algorithm of principal component analysis of the face picture dataset in its matrix form. And the weights used in the weighted sum indeed correspond to the projection of the face picture into each eigenpicture.

In 1991, a paper by Turk and Pentland coined the term “eigenface”. They built on top of the idea of Sirovich and Kirby and use the weights and eigenpictures as characteristic features to recognize faces. The paper by Turk and Pentland laid out a memory-efficient way to compute the eigenpictures. It also proposed an algorithm on how the face recognition system can operate, including how to update the system to include new faces and how to combine it with a video capture system. The same paper also pointed out that the concept of eigenface can help reconstruction of partially obstructed picture.

Before we jump into the code, let’s outline the steps in using eigenface for face recognition, and point out how some simple linear algebra technique can help the task.

Assume we have a bunch of pictures of human faces, all in the same pixel dimension (e.g., all are r×c grayscale images). If we get M different pictures and **vectorize** each picture into L=r×c pixels, we can present the entire dataset as a L×M matrix (let’s call it matrix $A$), where each element in the matrix is the pixel’s grayscale value.

Recall that principal component analysis (PCA) can be applied to any matrix, and the result is a number of vectors called the **principal components**. Each principal component has the length same as the column length of the matrix. The different principal components from the same matrix are orthogonal to each other, meaning that the vector dot-product of any two of them is zero. Therefore the various principal components constructed a vector space for which each column in the matrix can be represented as a linear combination (i.e., weighted sum) of the principal components.

The way it is done is to first take $C=A – a$ where $a$ is the mean vector of the matrix $A$. So $C$ is the matrix that subtract each column of $A$ with the mean vector $a$. Then the covariance matrix is

$$S = C\cdot C^T$$

from which we find its eigenvectors and eigenvalues. The principal components are these eigenvectors in decreasing order of the eigenvalues. Because matrix $S$ is a L×L matrix, we may consider to find the eigenvectors of a M×M matrix $C^T\cdot C$ instead as the eigenvector $v$ for $C^T\cdot C$ can be transformed into eigenvector $u$ of $C\cdot C^T$ by $u=C\cdot v$, except we usually prefer to write $u$ as normalized vector (i.e., norm of $u$ is 1).

The physical meaning of the principal component vectors of $A$, or equivalently the eigenvectors of $S=C\cdot C^T$, is that they are the key directions that we can construct the columns of matrix $A$. The relative importance of the different principal component vectors can be inferred from the corresponding eigenvalues. The greater the eigenvalue, the more useful (i.e., holds more information about $A$) the principal component vector. Hence we can keep only the first K principal component vectors. If matrix $A$ is the dataset for face pictures, the first K principal component vectors are the top K most important “face pictures”. We call them the **eigenface** picture.

For any given face picture, we can project its mean-subtracted version onto the eigenface picture using vector dot-product. The result is how close this face picture is related to the eigenface. If the face picture is totally unrelated to the eigenface, we would expect its result is zero. For the K eigenfaces, we can find K dot-product for any given face picture. We can present the result as **weights** of this face picture with respect to the eigenfaces. The weight is usually presented as a vector.

Conversely, if we have a weight vector, we can add up each eigenfaces subjected to the weight and reconstruct a new face. Let’s denote the eigenfaces as matrix $F$, which is a L×K matrix, and the weight vector $w$ is a column vector. Then for any $w$ we can construct the picture of a face as

$$z=F\cdot w$$

which $z$ is resulted as a column vector of length L. Because we are only using the top K principal component vectors, we should expect the resulting face picture is distorted but retained some facial characteristic.

Since the eigenface matrix is constant for the dataset, a varying weight vector $w$ means a varying face picture. Therefore we can expect the pictures of the same person would provide similar weight vectors, even if the pictures are not identical. As a result, we may make use of the distance between two weight vectors (such as the L2-norm) as a metric of how two pictures resemble.

Now we attempt to implement the idea of eigenface with numpy and scikit-learn. We will also make use of OpenCV to read picture files. You may need to install the relevant package with `pip`

command:

pip install opencv-python

The dataset we use are the ORL Database of Faces, which is quite of age but we can download it from Kaggle:

The file is a zip file of around 4MB. It has pictures of 40 persons and each person has 10 pictures. Total to 400 pictures. In the following we assumed the file is downloaded to the local directory and named as `attface.zip`

.

We may extract the zip file to get the pictures, or we can also make use of the `zipfile`

package in Python to read the contents from the zip file directly:

import cv2 import zipfile import numpy as np faces = {} with zipfile.ZipFile("attface.zip") as facezip: for filename in facezip.namelist(): if not filename.endswith(".pgm"): continue # not a face picture with facezip.open(filename) as image: # If we extracted files from zip, we can use cv2.imread(filename) instead faces[filename] = cv2.imdecode(np.frombuffer(image.read(), np.uint8), cv2.IMREAD_GRAYSCALE)

The above is to read every PGM file in the zip. PGM is a grayscale image file format. We extract each PGM file into a byte string through `image.read()`

and convert it into a numpy array of bytes. Then we use OpenCV to decode the byte string into an array of pixels using `cv2.imdecode()`

. The file format will be detected automatically by OpenCV. We save each picture into a Python dictionary `faces`

for later use.

Here we can take a look on these picture of human faces, using matplotlib:

... import matplotlib.pyplot as plt fig, axes = plt.subplots(4,4,sharex=True,sharey=True,figsize=(8,10)) faceimages = list(faces.values())[-16:] # take last 16 images for i in range(16): axes[i%4][i//4].imshow(faceimages[i], cmap="gray") plt.show()

We can also find the pixel size of each picture:

... faceshape = list(faces.values())[0].shape print("Face image shape:", faceshape)

Face image shape: (112, 92)

The pictures of faces are identified by their file name in the Python dictionary. We can take a peek on the filenames:

... print(list(faces.keys())[:5])

['s1/1.pgm', 's1/10.pgm', 's1/2.pgm', 's1/3.pgm', 's1/4.pgm']

and therefore we can put faces of the same person into the same class. There are 40 classes and totally 400 pictures:

... classes = set(filename.split("/")[0] for filename in faces.keys()) print("Number of classes:", len(classes)) print("Number of pictures:", len(faces))

Number of classes: 40 Number of pictures: 400

To illustrate the capability of using eigenface for recognition, we want to hold out some of the pictures before we generate our eigenfaces. We hold out all the pictures of one person as well as one picture for another person as our test set. The remaining pictures are vectorized and converted into a 2D numpy array:

... # Take classes 1-39 for eigenfaces, keep entire class 40 and # image 10 of class 39 as out-of-sample test facematrix = [] facelabel = [] for key,val in faces.items(): if key.startswith("s40/"): continue # this is our test set if key == "s39/10.pgm": continue # this is our test set facematrix.append(val.flatten()) facelabel.append(key.split("/")[0]) # Create facematrix as (n_samples,n_pixels) matrix facematrix = np.array(facematrix)

Now we can perform principal component analysis on this dataset matrix. Instead of computing the PCA step by step, we make use of the PCA function in scikit-learn, which we can easily retrieve all results we needed:

... # Apply PCA to extract eigenfaces from sklearn.decomposition import PCA pca = PCA().fit(facematrix)

We can identify how significant is each principal component from the explained variance ratio:

... print(pca.explained_variance_ratio_)

[1.77824822e-01 1.29057925e-01 6.67093882e-02 5.63561346e-02 5.13040312e-02 3.39156477e-02 2.47893586e-02 2.27967054e-02 1.95632067e-02 1.82678428e-02 1.45655853e-02 1.38626271e-02 1.13318896e-02 1.07267786e-02 9.68365599e-03 9.17860717e-03 8.60995215e-03 8.21053028e-03 7.36580634e-03 7.01112888e-03 6.69450840e-03 6.40327943e-03 5.98295099e-03 5.49298705e-03 5.36083980e-03 4.99408106e-03 4.84854321e-03 4.77687371e-03 ... 1.12203331e-04 1.11102187e-04 1.08901471e-04 1.06750318e-04 1.05732991e-04 1.01913786e-04 9.98164783e-05 9.85530209e-05 9.51582720e-05 8.95603083e-05 8.71638147e-05 8.44340263e-05 7.95894118e-05 7.77912922e-05 7.06467912e-05 6.77447444e-05 2.21225931e-32]

or we can simply make up a moderate number, say, 50, and consider these many principal component vectors as the eigenface. For convenience, we extract the eigenface from PCA result and store it as a numpy array. Note that the eigenfaces are stored as rows in a matrix. We can convert it back to 2D if we want to display it. In below, we show some of the eigenfaces to see how they look like:

... # Take the first K principal components as eigenfaces n_components = 50 eigenfaces = pca.components_[:n_components] # Show the first 16 eigenfaces fig, axes = plt.subplots(4,4,sharex=True,sharey=True,figsize=(8,10)) for i in range(16): axes[i%4][i//4].imshow(eigenfaces[i].reshape(faceshape), cmap="gray") plt.show()

From this picture, we can see eigenfaces are blurry faces, but indeed each eigenfaces holds some facial characteristics that can be used to build a picture.

Since our goal is to build a face recognition system, we first calculate the weight vector for each input picture:

... # Generate weights as a KxN matrix where K is the number of eigenfaces and N the number of samples weights = eigenfaces @ (facematrix - pca.mean_).T

The above code is using matrix multiplication to replace loops. It is roughly equivalent to the following:

... weights = [] for i in range(facematrix.shape[0]): weight = [] for j in range(n_components): w = eigenfaces[j] @ (facematrix[i] - pca.mean_) weight.append(w) weights.append(weight)

Up to here, our face recognition system has been completed. We used pictures of 39 persons to build our eigenface. We use the test picture that belongs to one of these 39 persons (the one held out from the matrix that trained the PCA model) to see if it can successfully recognize the face:

... # Test on out-of-sample image of existing class query = faces["s39/10.pgm"].reshape(1,-1) query_weight = eigenfaces @ (query - pca.mean_).T euclidean_distance = np.linalg.norm(weights - query_weight, axis=0) best_match = np.argmin(euclidean_distance) print("Best match %s with Euclidean distance %f" % (facelabel[best_match], euclidean_distance[best_match])) # Visualize fig, axes = plt.subplots(1,2,sharex=True,sharey=True,figsize=(8,6)) axes[0].imshow(query.reshape(faceshape), cmap="gray") axes[0].set_title("Query") axes[1].imshow(facematrix[best_match].reshape(faceshape), cmap="gray") axes[1].set_title("Best match") plt.show()

Above, we first subtract the vectorized image by the average vector that retrieved from the PCA result. Then we compute the projection of this mean-subtracted vector to each eigenface and take it as the weight for this picture. Afterwards, we compare the weight vector of the picture in question to that of each existing picture and find the one with the smallest L2 distance as the best match. We can see that it indeed can successfully find the closest match in the same class:

Best match s39 with Euclidean distance 1559.997137

and we can visualize the result by comparing the closest match side by side:

We can try again with the picture of the 40th person that we held out from the PCA. We would never get it correct because it is a new person to our model. However, we want to see how wrong it can be as well as the value in the distance metric:

... # Test on out-of-sample image of new class query = faces["s40/1.pgm"].reshape(1,-1) query_weight = eigenfaces @ (query - pca.mean_).T euclidean_distance = np.linalg.norm(weights - query_weight, axis=0) best_match = np.argmin(euclidean_distance) print("Best match %s with Euclidean distance %f" % (facelabel[best_match], euclidean_distance[best_match])) # Visualize fig, axes = plt.subplots(1,2,sharex=True,sharey=True,figsize=(8,6)) axes[0].imshow(query.reshape(faceshape), cmap="gray") axes[0].set_title("Query") axes[1].imshow(facematrix[best_match].reshape(faceshape), cmap="gray") axes[1].set_title("Best match") plt.show()

We can see that it’s best match has a greater L2 distance:

Best match s5 with Euclidean distance 2690.209330

but we can see that the mistaken result has some resemblance to the picture in question:

In the paper by Turk and Petland, it is suggested that we set up a threshold for the L2 distance. If the best match’s distance is less than the threshold, we would consider the face is recognized to be the same person. If the distance is above the threshold, we claim the picture is someone we never saw even if a best match can be find numerically. In this case, we may consider to include this as a new person into our model by remembering this new weight vector.

Actually, we can do one step further, to generate new faces using eigenfaces, but the result is not very realistic. In below, we generate one using random weight vector and show it side by side with the “average face”:

... # Visualize the mean face and random face fig, axes = plt.subplots(1,2,sharex=True,sharey=True,figsize=(8,6)) axes[0].imshow(pca.mean_.reshape(faceshape), cmap="gray") axes[0].set_title("Mean face") random_weights = np.random.randn(n_components) * weights.std() newface = random_weights @ eigenfaces + pca.mean_ axes[1].imshow(newface.reshape(faceshape), cmap="gray") axes[1].set_title("Random face") plt.show()

How good is eigenface? It is surprisingly overachieved for the simplicity of the model. However, Turk and Pentland tested it with various conditions. It found that its accuracy was “an average of 96% with light variation, 85% with orientation variation, and 64% with size variation.” Hence it may not be very practical as a face recognition system. After all, the picture as a matrix will be distorted a lot in the principal component domain after zoom-in and zoom-out. Therefore the modern alternative is to use convolution neural network, which is more tolerant to various transformations.

Putting everything together, the following is the complete code:

import zipfile import cv2 import numpy as np import matplotlib.pyplot as plt from sklearn.decomposition import PCA # Read face image from zip file on the fly faces = {} with zipfile.ZipFile("attface.zip") as facezip: for filename in facezip.namelist(): if not filename.endswith(".pgm"): continue # not a face picture with facezip.open(filename) as image: # If we extracted files from zip, we can use cv2.imread(filename) instead faces[filename] = cv2.imdecode(np.frombuffer(image.read(), np.uint8), cv2.IMREAD_GRAYSCALE) # Show sample faces using matplotlib fig, axes = plt.subplots(4,4,sharex=True,sharey=True,figsize=(8,10)) faceimages = list(faces.values())[-16:] # take last 16 images for i in range(16): axes[i%4][i//4].imshow(faceimages[i], cmap="gray") print("Showing sample faces") plt.show() # Print some details faceshape = list(faces.values())[0].shape print("Face image shape:", faceshape) classes = set(filename.split("/")[0] for filename in faces.keys()) print("Number of classes:", len(classes)) print("Number of images:", len(faces)) # Take classes 1-39 for eigenfaces, keep entire class 40 and # image 10 of class 39 as out-of-sample test facematrix = [] facelabel = [] for key,val in faces.items(): if key.startswith("s40/"): continue # this is our test set if key == "s39/10.pgm": continue # this is our test set facematrix.append(val.flatten()) facelabel.append(key.split("/")[0]) # Create a NxM matrix with N images and M pixels per image facematrix = np.array(facematrix) # Apply PCA and take first K principal components as eigenfaces pca = PCA().fit(facematrix) n_components = 50 eigenfaces = pca.components_[:n_components] # Show the first 16 eigenfaces fig, axes = plt.subplots(4,4,sharex=True,sharey=True,figsize=(8,10)) for i in range(16): axes[i%4][i//4].imshow(eigenfaces[i].reshape(faceshape), cmap="gray") print("Showing the eigenfaces") plt.show() # Generate weights as a KxN matrix where K is the number of eigenfaces and N the number of samples weights = eigenfaces @ (facematrix - pca.mean_).T print("Shape of the weight matrix:", weights.shape) # Test on out-of-sample image of existing class query = faces["s39/10.pgm"].reshape(1,-1) query_weight = eigenfaces @ (query - pca.mean_).T euclidean_distance = np.linalg.norm(weights - query_weight, axis=0) best_match = np.argmin(euclidean_distance) print("Best match %s with Euclidean distance %f" % (facelabel[best_match], euclidean_distance[best_match])) # Visualize fig, axes = plt.subplots(1,2,sharex=True,sharey=True,figsize=(8,6)) axes[0].imshow(query.reshape(faceshape), cmap="gray") axes[0].set_title("Query") axes[1].imshow(facematrix[best_match].reshape(faceshape), cmap="gray") axes[1].set_title("Best match") plt.show() # Test on out-of-sample image of new class query = faces["s40/1.pgm"].reshape(1,-1) query_weight = eigenfaces @ (query - pca.mean_).T euclidean_distance = np.linalg.norm(weights - query_weight, axis=0) best_match = np.argmin(euclidean_distance) print("Best match %s with Euclidean distance %f" % (facelabel[best_match], euclidean_distance[best_match])) # Visualize fig, axes = plt.subplots(1,2,sharex=True,sharey=True,figsize=(8,6)) axes[0].imshow(query.reshape(faceshape), cmap="gray") axes[0].set_title("Query") axes[1].imshow(facematrix[best_match].reshape(faceshape), cmap="gray") axes[1].set_title("Best match") plt.show()

This section provides more resources on the topic if you are looking to go deeper.

- L. Sirovich and M. Kirby (1987). “Low-dimensional procedure for the characterization of human faces“.
*Journal of the Optical Society of America**A*. 4(3): 519–524. - M. Turk and A. Pentland (1991). “Eigenfaces for recognition“.
*Journal of Cognitive Neuroscience*. 3(1): 71–86.

- Introduction to Linear Algebra, Fifth Edition, 2016.

In this tutorial, you discovered how to build a face recognition system using eigenface, which is derived from principal component analysis.

Specifically, you learned:

- How to extract characteristic images from the image dataset using principal component analysis
- How to use the set of characteristic images to create a weight vector for any seen or unseen images
- How to use the weight vectors of different images to measure for their similarity, and apply this technique to face recognition
- How to generate a new random image from the characteristic images

The post Face Recognition using Principal Component Analysis appeared first on Machine Learning Mastery.

]]>The post Using Singular Value Decomposition to Build a Recommender System appeared first on Machine Learning Mastery.

]]>In this tutorial, we will see how a recommender system can be build just using linear algebra techniques.

After completing this tutorial, you will know:

- What has singular value decomposition done to a matrix
- How to interpret the result of singular value decomposition
- What data a single recommender system require, and how we can make use of SVD to analyze it
- How we can make use of the result from SVD to make recommendations

Let’s get started.

This tutorial is divided into 3 parts; they are:

- Review of Singular Value Decomposition
- The Meaning of Singular Value Decomposition in Recommender System
- Implementing a Recommender System

Just like a number such as 24 can be decomposed as factors 24=2×3×4, a matrix can also be expressed as multiplication of some other matrices. Because matrices are arrays of numbers, they have their own rules of multiplication. Consequently, they have different ways of factorization, or known as **decomposition**. QR decomposition or LU decomposition are common examples. Another example is **singular value decomposition**, which has no restriction to the shape or properties of the matrix to be decomposed.

Singular value decomposition assumes a matrix $M$ (for example, a $m\times n$ matrix) is decomposed as

$$

M = U\cdot \Sigma \cdot V^T

$$

where $U$ is a $m\times m$ matrix, $\Sigma$ is a diagonal matrix of $m\times n$, and $V^T$ is a $n\times n$ matrix. The diagonal matrix $\Sigma$ is an interesting one, which it can be non-square but only the entries on the diagonal could be non-zero. The matrices $U$ and $V^T$ are **orthonormal** matrices. Meaning the columns of $U$ or rows of $V$ are (1) orthogonal to each other and are (2) unit vectors. Vectors are orthogonal to each other if any two vectors’ dot product is zero. A vector is unit vector if its L2-norm is 1. Orthonormal matrix has the property that its transpose is its inverse. In other words, since $U$ is an orthonormal matrix, $U^T = U^{-1}$ or $U\cdot U^T=U^T\cdot U=I$, where $I$ is the identity matrix.

Singular value decomposition gets its name from the diagonal entries on $\Sigma$, which are called the singular values of matrix $M$. They are in fact, the square root of the eigenvalues of matrix $M\cdot M^T$. Just like a number factorized into primes, the singular value decomposition of a matrix reveals a lot about the structure of that matrix.

But actually what described above is called the **full SVD**. There is another version called **reduced SVD** or **compact SVD**. We still have to write $M = U\cdot\Sigma\cdot V^T$ but we have $\Sigma$ a $r\times r$ square diagonal matrix with $r$ the **rank** of matrix $M$, which is usually less than or equal to the smaller of $m$ and $n$. The matrix $U$ is than a $m\times r$ matrix and $V^T$ is a $r\times n$ matrix. Because matrices $U$ and $V^T$ are non-square, they are called **semi-orthonormal**, meaning $U^T\cdot U=I$ and $V^T\cdot V=I$, with $I$ in both case a $r\times r$ identity matrix.

If the matrix $M$ is rank $r$, than we can prove that the matrices $M\cdot M^T$ and $M^T\cdot M$ are both rank $r$. In singular value decomposition (the reduced SVD), the columns of matrix $U$ are eigenvectors of $M\cdot M^T$ and the rows of matrix $V^T$ are eigenvectors of $M^T\cdot M$. What’s interesting is that $M\cdot M^T$ and $M^T\cdot M$ are potentially in different size (because matrix $M$ can be non-square shape), but they have the same set of eigenvalues, which are the square of values on the diagonal of $\Sigma$.

This is why the result of singular value decomposition can reveal a lot about the matrix $M$.

Imagine we collected some book reviews such that books are columns and people are rows, and the entries are the ratings that a person gave to a book. In that case, $M\cdot M^T$ would be a table of person-to-person which the entries would mean the sum of the ratings one person gave match with another one. Similarly $M^T\cdot M$ would be a table of book-to-book which entries are the sum of the ratings received match with that received by another book. What can be the hidden connection between people and books? That could be the genre, or the author, or something of similar nature.

Let’s see how we can make use of the result from SVD to build a recommender system. Firstly, let’s download the dataset from this link (caution: it is 600MB big)

This dataset is the “Social Recommendation Data” from “Recommender Systems and Personalization Datasets“. It contains the reviews given by users on books on Librarything. What we are interested are the number of “stars” a user given to a book.

If we open up this tar file we will see a large file named “reviews.json”. We can extract it, or read the included file on the fly. First three lines of reviews.json are shown below:

import tarfile # Read downloaded file from: # http://deepyeti.ucsd.edu/jmcauley/datasets/librarything/lthing_data.tar.gz with tarfile.open("lthing_data.tar.gz") as tar: print("Files in tar archive:") tar.list() with tar.extractfile("lthing_data/reviews.json") as file: count = 0 for line in file: print(line) count += 1 if count > 3: break

The above will print:

Files in tar archive: ?rwxr-xr-x julian/julian 0 2016-09-30 17:58:55 lthing_data/ ?rw-r--r-- julian/julian 4824989 2014-01-02 13:55:12 lthing_data/edges.txt ?rw-rw-r-- julian/julian 1604368260 2016-09-30 17:58:25 lthing_data/reviews.json b"{'work': '3206242', 'flags': [], 'unixtime': 1194393600, 'stars': 5.0, 'nhelpful': 0, 'time': 'Nov 7, 2007', 'comment': 'This a great book for young readers to be introduced to the world of Middle Earth. ', 'user': 'van_stef'}\n" b"{'work': '12198649', 'flags': [], 'unixtime': 1333756800, 'stars': 5.0, 'nhelpful': 0, 'time': 'Apr 7, 2012', 'comment': 'Help Wanted: Tales of On The Job Terror from Evil Jester Press is a fun and scary read. This book is edited by Peter Giglio and has short stories by Joe McKinney, Gary Brandner, Henry Snider and many more. As if work wasnt already scary enough, this book gives you more reasons to be scared. Help Wanted is an excellent anthology that includes some great stories by some master storytellers.\\nOne of the stories includes Agnes: A Love Story by David C. Hayes, which tells the tale of a lawyer named Jack who feels unappreciated at work and by his wife so he starts a relationship with a photocopier. They get along well until the photocopier starts wanting the lawyer to kill for it. The thing I liked about this story was how the author makes you feel sorry for Jack. His two co-workers are happily married and love their jobs while Jack is married to a paranoid alcoholic and he hates and works at a job he cant stand. You completely understand how he can fall in love with a copier because he is a lonely soul that no one understands except the copier of course.\\nAnother story in Help Wanted is Work Life Balance by Jeff Strand. In this story a man works for a company that starts to let their employees do what they want at work. It starts with letting them come to work a little later than usual, then the employees are allowed to hug and kiss on the job. Things get really out of hand though when the company starts letting employees carry knives and stab each other, as long as it doesnt interfere with their job. This story is meant to be more funny then scary but still has its scary moments. Jeff Strand does a great job mixing humor and horror in this story.\\nAnother good story in Help Wanted: On The Job Terror is The Chapel Of Unrest by Stephen Volk. This is a gothic horror story that takes place in the 1800s and has to deal with an undertaker who has the duty of capturing and embalming a ghoul who has been eating dead bodies in a graveyard. Stephen Volk through his use of imagery in describing the graveyard, the chapel and the clothes of the time, transports you into an 1800s gothic setting that reminded me of Bram Stokers Dracula.\\nOne more story in this anthology that I have to mention is Expulsion by Eric Shapiro which tells the tale of a mad man going into a office to kill his fellow employees. This is a very short but very powerful story that gets you into the mind of a disgruntled employee but manages to end on a positive note. Though there were stories I didnt like in Help Wanted, all in all its a very good anthology. I highly recommend this book ', 'user': 'dwatson2'}\n" b"{'work': '12533765', 'flags': [], 'unixtime': 1352937600, 'nhelpful': 0, 'time': 'Nov 15, 2012', 'comment': 'Magoon, K. (2012). Fire in the streets. New York: Simon and Schuster/Aladdin. 336 pp. ISBN: 978-1-4424-2230-8. (Hardcover); $16.99.\\nKekla Magoon is an author to watch (http://www.spicyreads.org/Author_Videos.html- scroll down). One of my favorite books from 2007 is Magoons The Rock and the River. At the time, I mentioned in reviews that we have very few books that even mention the Black Panther Party, let alone deal with them in a careful, thorough way. Fire in the Streets continues the story Magoon began in her debut book. While her familys financial fortunes drip away, not helped by her mothers drinking and assortment of boyfriends, the Panthers provide a very real respite for Maxie. Sam is still dealing with the death of his brother. Maxies relationship with Sam only serves to confuse and upset them both. Her friends, Emmalee and Patrice, are slowly drifting away. The Panther Party is the only thing that seems to make sense and she basks in its routine and consistency. She longs to become a full member of the Panthers and constantly battles with her Panther brother Raheem over her maturity and ability to do more than office tasks. Maxie wants to have her own gun. When Maxie discovers that there is someone working with the Panthers that is leaking information to the government about Panther activity, Maxie investigates. Someone is attempting to destroy the only place that offers her shelter. Maxie is determined to discover the identity of the traitor, thinking that this will prove her worth to the organization. However, the truth is not simple and it is filled with pain. Unfortunately we still do not have many teen books that deal substantially with the Democratic National Convention in Chicago, the Black Panther Party, and the social problems in Chicago that lead to the civil unrest. Thankfully, Fire in the Streets lives up to the standard Magoon set with The Rock and the River. Readers will feel like they have stepped back in time. Magoons factual tidbits add journalistic realism to the story and only improves the atmosphere. Maxie has spunk. Readers will empathize with her Atlas-task of trying to hold onto her world. Fire in the Streets belongs in all middle school and high school libraries. While readers are able to read this story independently of The Rock and the River, I strongly urge readers to read both and in order. Magoons recognition by the Coretta Scott King committee and the NAACP Image awards are NOT mistakes!', 'user': 'edspicer'}\n" b'{\'work\': \'12981302\', \'flags\': [], \'unixtime\': 1364515200, \'stars\': 4.0, \'nhelpful\': 0, \'time\': \'Mar 29, 2013\', \'comment\': "Well, I definitely liked this book better than the last in the series. There was less fighting and more story. I liked both Toni and Ricky Lee and thought they were pretty good together. The banter between the two was sweet and often times funny. I enjoyed seeing some of the past characters and of course it\'s always nice to be introduced to new ones. I just wonder how many more of these books there will be. At least two hopefully, one each for Rory and Reece. ", \'user\': \'amdrane2\'}\n'

Each line in reviews.json is a record. We are going to extract the “user”, “work”, and “stars” field of each record as long as there are no missing data among these three. Despite the name, the records are not well-formed JSON strings (most notably it uses single quote rather than double quote). Therefore, we cannot use `json`

package from Python but to use `ast`

to decode such string:

... import ast reviews = [] with tarfile.open("lthing_data.tar.gz") as tar: with tar.extractfile("lthing_data/reviews.json") as file: for line in file: record = ast.literal_eval(line.decode("utf8")) if any(x not in record for x in ['user', 'work', 'stars']): continue reviews.append([record['user'], record['work'], record['stars']]) print(len(reviews), "records retrieved")

1387209 records retrieved

Now we should make a matrix of how different users rate each book. We make use of the pandas library to help convert the data we collected into a table:

... import pandas as pd reviews = pd.DataFrame(reviews, columns=["user", "work", "stars"]) print(reviews.head())

user work stars 0 van_stef 3206242 5.0 1 dwatson2 12198649 5.0 2 amdrane2 12981302 4.0 3 Lila_Gustavus 5231009 3.0 4 skinglist 184318 2.0

As an example, we try not to use all data in order to save time and memory. Here we consider only those users who reviewed more than 50 books and also those books who are reviewed by more than 50 users. This way, we trimmed our dataset to less than 15% of its original size:

... # Look for the users who reviewed more than 50 books usercount = reviews[["work","user"]].groupby("user").count() usercount = usercount[usercount["work"] >= 50] print(usercount.head())

work user 84 -Eva- 602 06nwingert 370 1983mk 63 1dragones 194

... # Look for the books who reviewed by more than 50 users workcount = reviews[["work","user"]].groupby("work").count() workcount = workcount[workcount["user"] >= 50] print(workcount.head())

user work 10000 106 10001 53 1000167 186 10001797 53 10005525 134

... # Keep only the popular books and active users reviews = reviews[reviews["user"].isin(usercount.index) & reviews["work"].isin(workcount.index)] print(reviews)

user work stars 0 van_stef 3206242 5.0 6 justine 3067 4.5 18 stephmo 1594925 4.0 19 Eyejaybee 2849559 5.0 35 LisaMaria_C 452949 4.5 ... ... ... ... 1387161 connie53 1653 4.0 1387177 BruderBane 24623 4.5 1387192 StuartAston 8282225 4.0 1387202 danielx 9759186 4.0 1387206 jclark88 8253945 3.0 [205110 rows x 3 columns]

Then we can make use of “pivot table” function in pandas to convert this into a matrix:

... reviewmatrix = reviews.pivot(index="user", columns="work", values="stars").fillna(0)

The result is a matrix of 5593 rows and 2898 columns

Here we represented 5593 users and 2898 books in a matrix. Then we apply the SVD (this will take a while):

... from numpy.linalg import svd matrix = reviewmatrix.values u, s, vh = svd(matrix, full_matrices=False)

By default, the `svd()`

returns a full singular value decomposition. We choose a reduced version so we can use smaller matrices to save memory. The columns of `vh`

correspond to the books. We can based on vector space model to find which book are most similar to the one we are looking at:

... import numpy as np def cosine_similarity(v,u): return (v @ u)/ (np.linalg.norm(v) * np.linalg.norm(u)) highest_similarity = -np.inf highest_sim_col = -1 for col in range(1,vh.shape[1]): similarity = cosine_similarity(vh[:,0], vh[:,col]) if similarity > highest_similarity: highest_similarity = similarity highest_sim_col = col print("Column %d is most similar to column 0" % highest_sim_col)

And in the above example, we try to find the book that is best match to to first column. The result is:

Column 906 is most similar to column 0

In a recommendation system, when a user picked a book, we may show her a few other books that are similar to the one she picked based on the cosine distance as calculated above.

Depends on the dataset, we may use truncated SVD to reduce the dimension of matrix `vh`

. In essence, this means we are removing several rows on `vh`

that the corresponding singular values in `s`

are small, before we use it to compute the similarity. This would likely make the prediction more accurate as those less important features of a book are removed from consideration.

Note that, in the decomposition $M=U\cdot\Sigma\cdot V^T$ we know the rows of $U$ are the users and columns of $V^T$ are books, we cannot identify what are the meanings of the columns of $U$ or rows of $V^T$ (an equivalently, that of $\Sigma$). We know they could be genres, for example, that provide some underlying connections between the users and the books but we cannot be sure what exactly are they. However, this does not stop us from using them as **features** in our recommendation system.

Tying all together, the following is the complete code:

import tarfile import ast import pandas as pd import numpy as np # Read downloaded file from: # http://deepyeti.ucsd.edu/jmcauley/datasets/librarything/lthing_data.tar.gz with tarfile.open("lthing_data.tar.gz") as tar: print("Files in tar archive:") tar.list() print("\nSample records:") with tar.extractfile("lthing_data/reviews.json") as file: count = 0 for line in file: print(line) count += 1 if count > 3: break # Collect records reviews = [] with tarfile.open("lthing_data.tar.gz") as tar: with tar.extractfile("lthing_data/reviews.json") as file: for line in file: try: record = ast.literal_eval(line.decode("utf8")) except: print(line.decode("utf8")) raise if any(x not in record for x in ['user', 'work', 'stars']): continue reviews.append([record['user'], record['work'], record['stars']]) print(len(reviews), "records retrieved") # Print a few sample of what we collected reviews = pd.DataFrame(reviews, columns=["user", "work", "stars"]) print(reviews.head()) # Look for the users who reviewed more than 50 books usercount = reviews[["work","user"]].groupby("user").count() usercount = usercount[usercount["work"] >= 50] # Look for the books who reviewed by more than 50 users workcount = reviews[["work","user"]].groupby("work").count() workcount = workcount[workcount["user"] >= 50] # Keep only the popular books and active users reviews = reviews[reviews["user"].isin(usercount.index) & reviews["work"].isin(workcount.index)] print("\nSubset of data:") print(reviews) # Convert records into user-book review score matrix reviewmatrix = reviews.pivot(index="user", columns="work", values="stars").fillna(0) matrix = reviewmatrix.values # Singular value decomposition u, s, vh = np.linalg.svd(matrix, full_matrices=False) # Find the highest similarity def cosine_similarity(v,u): return (v @ u)/ (np.linalg.norm(v) * np.linalg.norm(u)) highest_similarity = -np.inf highest_sim_col = -1 for col in range(1,vh.shape[1]): similarity = cosine_similarity(vh[:,0], vh[:,col]) if similarity > highest_similarity: highest_similarity = similarity highest_sim_col = col print("Column %d (book id %s) is most similar to column 0 (book id %s)" % (highest_sim_col, reviewmatrix.columns[col], reviewmatrix.columns[0]) )

This section provides more resources on the topic if you are looking to go deeper.

- Introduction to Linear Algebra, Fifth Edition, 2016.

In this tutorial, you discovered how to build a recommender system using singular value decomposition.

Specifically, you learned:

- What a singular value decomposition mean to a matrix
- How to interpret the result of a singular value decomposition
- Find similarity from the columns of matrix $V^T$ obtained from singular value decomposition, and make recommendations based on the similarity

The post Using Singular Value Decomposition to Build a Recommender System appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Vector Space Models appeared first on Machine Learning Mastery.

]]>In this tutorial, we will see what is a vector space model and what it can do.

After completing this tutorial, you will know:

- What is a vector space model and the properties of cosine similarity
- How cosine similarity can help you compare two vectors
- What is the difference between cosine similarity and L2 distance

Let’s get started.

This tutorial is divided into 3 parts; they are:

- Vector space and cosine formula
- Using vector space model for similarity
- Common use of vector space models and cosine distance

A vector space is a mathematical term that defines some vector operations. In layman’s term, we can imagine it is a $n$-dimensional metric space where each point is represented by a $n$-dimensional vector. In this space, we can do any vector addition or scalar-vector multiplications.

It is useful to consider a vector space because it is useful to represent things as a vector. For example in machine learning, we usually have a data point with multiple features. Therefore, it is convenient for us to represent a data point as a vector.

With a vector, we can compute its **norm**. The most common one is the L2-norm or the length of the vector. With two vectors in the same vector space, we can find their difference. Assume it is a 3-dimensional vector space, the two vectors are $(x_1, x_2, x_3)$ and $(y_1, y_2, y_3)$. Their difference is the vector $(y_1-x_1, y_2-x_2, y_3-x_3)$, and the L2-norm of the difference is the **distance** or more precisely the Euclidean distance between those two vectors:

$$

\sqrt{(y_1-x_1)^2+(y_2-x_2)^2+(y_3-x_3)^2}

$$

Besides distance, we can also consider the **angle** between two vectors. If we consider the vector $(x_1, x_2, x_3)$ as a line segment from the point $(0,0,0)$ to $(x_1,x_2,x_3)$ in the 3D coordinate system, then there is another line segment from $(0,0,0)$ to $(y_1,y_2, y_3)$. They make an angle at their intersection:

The angle between the two line segments can be found using the cosine formula:

$$

\cos \theta = \frac{a\cdot b} {\lVert a\rVert_2\lVert b\rVert_2}

$$

where $a\cdot b$ is the vector dot-product and $\lVert a\rVert_2$ is the L2-norm of vector $a$. This formula arises from considering the dot-product as the projection of vector $a$ onto the direction as pointed by vector $b$. The nature of cosine tells that, as the angle $\theta$ increases from 0 to 90 degrees, cosine decreases from 1 to 0. Sometimes we would call $1-\cos\theta$ the **cosine distance** because it runs from 0 to 1 as the two vectors are moving further away from each other. This is an important property that we are going to exploit in the vector space model.

Let’s look at an example of how the vector space model is useful.

World Bank collects various data about countries and regions in the world. While every country is different, we can try to compare countries under vector space model. For convenience, we will use the `pandas_datareader`

module in Python to read data from World Bank. You may install `pandas_datareader`

using `pip`

or `conda`

command:

pip install pandas_datareader

The data series collected by World Bank are named by an identifier. For example, “SP.URB.TOTL” is the total urban population of a country. Many of the series are yearly. When we download a series, we have to put in the start and end years. Usually the data are not updated on time. Hence it is best to look at the data a few years back rather than the most recent year to avoid missing data.

In below, we try to collect some economic data of *every* country in 2010:

from pandas_datareader import wb import pandas as pd pd.options.display.width = 0 names = [ "NE.EXP.GNFS.CD", # Exports of goods and services (current US$) "NE.IMP.GNFS.CD", # Imports of goods and services (current US$) "NV.AGR.TOTL.CD", # Agriculture, forestry, and fishing, value added (current US$) "NY.GDP.MKTP.CD", # GDP (current US$) "NE.RSB.GNFS.CD", # External balance on goods and services (current US$) ] df = wb.download(country="all", indicator=names, start=2010, end=2010).reset_index() countries = wb.get_countries() non_aggregates = countries[countries["region"] != "Aggregates"].name df_nonagg = df[df["country"].isin(non_aggregates)].dropna() print(df_nonagg)

country year NE.EXP.GNFS.CD NE.IMP.GNFS.CD NV.AGR.TOTL.CD NY.GDP.MKTP.CD NE.RSB.GNFS.CD 50 Albania 2010 3.337089e+09 5.792189e+09 2.141580e+09 1.192693e+10 -2.455100e+09 51 Algeria 2010 6.197541e+10 5.065473e+10 1.364852e+10 1.612073e+11 1.132067e+10 54 Angola 2010 5.157282e+10 3.568226e+10 5.179055e+09 8.379950e+10 1.589056e+10 55 Antigua and Barbuda 2010 9.142222e+08 8.415185e+08 1.876296e+07 1.148700e+09 7.270370e+07 56 Argentina 2010 8.020887e+10 6.793793e+10 3.021382e+10 4.236274e+11 1.227093e+10 .. ... ... ... ... ... ... ... 259 Venezuela, RB 2010 1.121794e+11 6.922736e+10 2.113513e+10 3.931924e+11 4.295202e+10 260 Vietnam 2010 8.347359e+10 9.299467e+10 2.130649e+10 1.159317e+11 -9.521076e+09 262 West Bank and Gaza 2010 1.367300e+09 5.264300e+09 8.716000e+08 9.681500e+09 -3.897000e+09 264 Zambia 2010 7.503513e+09 6.256989e+09 1.909207e+09 2.026556e+10 1.246524e+09 265 Zimbabwe 2010 3.569254e+09 6.440274e+09 1.157187e+09 1.204166e+10 -2.871020e+09 [174 rows x 7 columns]

In the above we obtained some economic metrics of each country in 2010. The function `wb.download()`

will download the data from World Bank and return a pandas dataframe. Similarly `wb.get_countries()`

will get the name of the countries and regions as identified by World Bank, which we will use this to filter out the non-countries aggregates such as “East Asia” and “World”. Pandas allows filtering rows by boolean indexing, which `df["country"].isin(non_aggregates)`

gives a boolean vector of which row is in the list of `non_aggregates`

and based on that, `df[df["country"].isin(non_aggregates)]`

selects only those. For various reasons not all countries will have all data. Hence we use `dropna()`

to remove those with missing data. In practice, we may want to apply some imputation techniques instead of merely removing them. But as an example, we proceed with the 174 remaining data points.

To better illustrate the idea rather than hiding the actual manipulation in pandas or numpy functions, we first extract the data for each country as a vector:

... vectors = {} for rowid, row in df_nonagg.iterrows(): vectors[row["country"]] = row[names].values print(vectors)

{'Albania': array([3337088824.25553, 5792188899.58985, 2141580308.0144, 11926928505.5231, -2455100075.33431], dtype=object), 'Algeria': array([61975405318.205, 50654732073.2396, 13648522571.4516, 161207310515.42, 11320673244.9655], dtype=object), 'Angola': array([51572818660.8665, 35682259098.1843, 5179054574.41704, 83799496611.2004, 15890559562.6822], dtype=object), ... 'West Bank and Gaza': array([1367300000.0, 5264300000.0, 871600000.0, 9681500000.0, -3897000000.0], dtype=object), 'Zambia': array([7503512538.82554, 6256988597.27752, 1909207437.82702, 20265559483.8548, 1246523941.54802], dtype=object), 'Zimbabwe': array([3569254400.0, 6440274000.0, 1157186600.0, 12041655200.0, -2871019600.0], dtype=object)}

The Python dictionary we created has the name of each country as a key and the economic metrics as a numpy array. There are 5 metrics, hence each is a vector of 5 dimensions.

What this helps us is that, we can use the vector representation of each country to see how similar it is to another. Let’s try both the L2-norm of the difference (the Euclidean distance) and the cosine distance. We pick one country, such as Australia, and compare it to all other countries on the list based on the selected economic metrics.

... import numpy as np euclid = {} cosine = {} target = "Australia" for country in vectors: vecA = vectors[target] vecB = vectors[country] dist = np.linalg.norm(vecA - vecB) cos = (vecA @ vecB) / (np.linalg.norm(vecA) * np.linalg.norm(vecB)) euclid[country] = dist # Euclidean distance cosine[country] = 1-cos # cosine distance

In the for-loop above, we set `vecA`

as the vector of the target country (i.e., Australia) and `vecB`

as that of the other country. Then we compute the L2-norm of their difference as the Euclidean distance between the two vectors. We also compute the cosine similarity using the formula and minus it from 1 to get the cosine distance. With more than a hundred countries, we can see which one has the shortest Euclidean distance to Australia:

... import pandas as pd df_distance = pd.DataFrame({"euclid": euclid, "cos": cosine}) print(df_distance.sort_values(by="euclid").head())

euclid cos Australia 0.000000e+00 -2.220446e-16 Mexico 1.533802e+11 7.949549e-03 Spain 3.411901e+11 3.057903e-03 Turkey 3.798221e+11 3.502849e-03 Indonesia 4.083531e+11 7.417614e-03

By sorting the result, we can see that Mexico is the closest to Australia under Euclidean distance. However, with cosine distance, it is Colombia the closest to Australia.

... df_distance.sort_values(by="cos").head()

euclid cos Australia 0.000000e+00 -2.220446e-16 Colombia 8.981118e+11 1.720644e-03 Cuba 1.126039e+12 2.483993e-03 Italy 1.088369e+12 2.677707e-03 Argentina 7.572323e+11 2.930187e-03

To understand why the two distances give different result, we can observe how the three countries’ metric compare to each other:

... print(df_nonagg[df_nonagg.country.isin(["Mexico", "Colombia", "Australia"])])

country year NE.EXP.GNFS.CD NE.IMP.GNFS.CD NV.AGR.TOTL.CD NY.GDP.MKTP.CD NE.RSB.GNFS.CD 59 Australia 2010 2.270501e+11 2.388514e+11 2.518718e+10 1.146138e+12 -1.180129e+10 91 Colombia 2010 4.682683e+10 5.136288e+10 1.812470e+10 2.865631e+11 -4.536047e+09 176 Mexico 2010 3.141423e+11 3.285812e+11 3.405226e+10 1.057801e+12 -1.443887e+10

From this table, we see that the metrics of Australia and Mexico are very close to each other in magnitude. However, if you compare the ratio of each metric within the same country, it is Colombia that match Australia better. In fact from the cosine formula, we can see that

$$

\cos \theta = \frac{a\cdot b} {\lVert a\rVert_2\lVert b\rVert_2} = \frac{a}{\lVert a\rVert_2} \cdot \frac{b} {\lVert b\rVert_2}

$$

which means the cosine of the angle between the two vector is the dot-product of the corresponding vectors after they were normalized to length of 1. Hence cosine distance is virtually applying a scaler to the data before computing the distance.

Putting these altogether, the following is the complete code

from pandas_datareader import wb import numpy as np import pandas as pd pd.options.display.width = 0 # Download data from World Bank names = [ "NE.EXP.GNFS.CD", # Exports of goods and services (current US$) "NE.IMP.GNFS.CD", # Imports of goods and services (current US$) "NV.AGR.TOTL.CD", # Agriculture, forestry, and fishing, value added (current US$) "NY.GDP.MKTP.CD", # GDP (current US$) "NE.RSB.GNFS.CD", # External balance on goods and services (current US$) ] df = wb.download(country="all", indicator=names, start=2010, end=2010).reset_index() # We remove aggregates and keep only countries with no missing data countries = wb.get_countries() non_aggregates = countries[countries["region"] != "Aggregates"].name df_nonagg = df[df["country"].isin(non_aggregates)].dropna() # Extract vector for each country vectors = {} for rowid, row in df_nonagg.iterrows(): vectors[row["country"]] = row[names].values # Compute the Euclidean and cosine distances euclid = {} cosine = {} target = "Australia" for country in vectors: vecA = vectors[target] vecB = vectors[country] dist = np.linalg.norm(vecA - vecB) cos = (vecA @ vecB) / (np.linalg.norm(vecA) * np.linalg.norm(vecB)) euclid[country] = dist # Euclidean distance cosine[country] = 1-cos # cosine distance # Print the results df_distance = pd.DataFrame({"euclid": euclid, "cos": cosine}) print("Closest by Euclidean distance:") print(df_distance.sort_values(by="euclid").head()) print() print("Closest by Cosine distance:") print(df_distance.sort_values(by="cos").head()) # Print the detail metrics print() print("Detail metrics:") print(df_nonagg[df_nonagg.country.isin(["Mexico", "Colombia", "Australia"])])

Vector space models are common in information retrieval systems. We can present documents (e.g., a paragraph, a long passage, a book, or even a sentence) as vectors. This vector can be as simple as counting of the words that the document contains (i.e., a bag-of-word model) or a complicated embedding vector (e.g., Doc2Vec). Then a query to find the most relevant document can be answered by ranking all documents by the cosine distance. Cosine distance should be used because we do not want to favor longer or shorter documents, but to focus on what it contains. Hence we leverage the normalization comes with it to consider how relevant are the documents to the query rather than how many times the words on the query are mentioned in a document.

If we consider each word in a document as a feature and compute the cosine distance, it is the “hard” distance because we do not care about words with similar meanings (e.g. “document” and “passage” have similar meanings but not “distance”). Embedding vectors such as word2vec would allow us to consider the ontology. Computing the cosine distance with the meaning of words considered is the “**soft cosine distance**“. Libraries such as gensim provides a way to do this.

Another use case of the cosine distance and vector space model is in computer vision. Imagine the task of recognizing hand gesture, we can make certain parts of the hand (e.g. five fingers) the key points. Then with the (x,y) coordinates of the key points lay out as a vector, we can compare with our existing database to see which cosine distance is the closest and determine which hand gesture it is. We need cosine distance because everyone’s hand has a different size. We do not want that to affect our decision on what gesture it is showing.

As you may imagine, there are much more examples you can use this technique.

This section provides more resources on the topic if you are looking to go deeper.

- Introduction to Linear Algebra, Fifth Edition, 2016.
- Introduction to Information Retrieval, 2008.

In this tutorial, you discovered the vector space model for measuring the similarities of vectors.

Specifically, you learned:

- How to construct a vector space model
- How to compute the cosine similarity and hence the cosine distance between two vectors in the vector space model
- How to interpret the difference between cosine distance and other distance metrics such as Euclidean distance
- What are the use of the vector space model

The post A Gentle Introduction to Vector Space Models appeared first on Machine Learning Mastery.

]]>The post Principal Component Analysis for Visualization appeared first on Machine Learning Mastery.

]]>In this tutorial, you will discover how to visualize data using PCA, as well as using visualization to help determining the parameter for dimensionality reduction.

After completing this tutorial, you will know:

- How to use visualize a high dimensional data
- What is explained variance in PCA
- Visually observe the explained variance from the result of PCA of high dimensional data

Let’s get started.

This tutorial is divided into two parts; they are:

- Scatter plot of high dimensional data
- Visualizing the explained variance

For this tutorial, we assume that you are already familiar with:

- How to Calculate Principal Component Analysis (PCA) from Scratch in Python

- Principal Component Analysis for Dimensionality Reduction in Python

Visualization is a crucial step to get insights from data. We can learn from the visualization that whether a pattern can be observed and hence estimate which machine learning model is suitable.

It is easy to depict things in two dimension. Normally a scatter plot with x- and y-axis are in two dimensional. Depicting things in three dimensional is a bit challenging but not impossible. In matplotlib, for example, can plot in 3D. The only problem is on paper or on screen, we can only look at a 3D plot at one viewport or projection at a time. In matplotlib, this is controlled by the degree of elevation and azimuth. Depicting things in four or five dimensions is impossible because we live in a three-dimensional world and have no idea of how things in such a high dimension would look like.

This is where a dimensionality reduction technique such as PCA comes into play. We can reduce the dimension to two or three so we can visualize it. Let’s start with an example.

We start with the wine dataset, which is a classification dataset with 13 features (i.e., the dataset is 13 dimensional) and 3 classes. There are 178 samples:

from sklearn.datasets import load_wine winedata = load_wine() X, y = winedata['data'], winedata['target'] print(X.shape) print(y.shape)

(178, 13) (178,)

Among the 13 features, we can pick any two and plot with matplotlib (we color-coded the different classes using the `c`

argument):

... import matplotlib.pyplot as plt plt.scatter(X[:,1], X[:,2], c=y) plt.show()

or we can also pick any three and show in 3D:

... ax = fig.add_subplot(projection='3d') ax.scatter(X[:,1], X[:,2], X[:,3], c=y) plt.show()

But this doesn’t reveal much of how the data looks like, because majority of the features are not shown. We now resort to principal component analysis:

... from sklearn.decomposition import PCA pca = PCA() Xt = pca.fit_transform(X) plot = plt.scatter(Xt[:,0], Xt[:,1], c=y) plt.legend(handles=plot.legend_elements()[0], labels=list(winedata['target_names'])) plt.show()

Here we transform the input data `X`

by PCA into `Xt`

. We consider only the first two columns, which contain the most information, and plot it in two dimensional. We can see that the purple class is quite distinctive, but there is still some overlap. If we scale the data before PCA, the result would be different:

... from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline pca = PCA() pipe = Pipeline([('scaler', StandardScaler()), ('pca', pca)]) Xt = pipe.fit_transform(X) plot = plt.scatter(Xt[:,0], Xt[:,1], c=y) plt.legend(handles=plot.legend_elements()[0], labels=list(winedata['target_names'])) plt.show()

Because PCA is sensitive to the scale, if we normalized each feature by `StandardScaler`

we can see a better result. Here the different classes are more distinctive. By looking at this plot, we are confident that a simple model such as SVM can classify this dataset in high accuracy.

Putting these together, the following is the complete code to generate the visualizations:

from sklearn.datasets import load_wine from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline import matplotlib.pyplot as plt # Load dataset winedata = load_wine() X, y = winedata['data'], winedata['target'] print("X shape:", X.shape) print("y shape:", y.shape) # Show any two features plt.figure(figsize=(8,6)) plt.scatter(X[:,1], X[:,2], c=y) plt.xlabel(winedata["feature_names"][1]) plt.ylabel(winedata["feature_names"][2]) plt.title("Two particular features of the wine dataset") plt.show() # Show any three features fig = plt.figure(figsize=(10,8)) ax = fig.add_subplot(projection='3d') ax.scatter(X[:,1], X[:,2], X[:,3], c=y) ax.set_xlabel(winedata["feature_names"][1]) ax.set_ylabel(winedata["feature_names"][2]) ax.set_zlabel(winedata["feature_names"][3]) ax.set_title("Three particular features of the wine dataset") plt.show() # Show first two principal components without scaler pca = PCA() plt.figure(figsize=(8,6)) Xt = pca.fit_transform(X) plot = plt.scatter(Xt[:,0], Xt[:,1], c=y) plt.legend(handles=plot.legend_elements()[0], labels=list(winedata['target_names'])) plt.xlabel("PC1") plt.ylabel("PC2") plt.title("First two principal components") plt.show() # Show first two principal components with scaler pca = PCA() pipe = Pipeline([('scaler', StandardScaler()), ('pca', pca)]) plt.figure(figsize=(8,6)) Xt = pipe.fit_transform(X) plot = plt.scatter(Xt[:,0], Xt[:,1], c=y) plt.legend(handles=plot.legend_elements()[0], labels=list(winedata['target_names'])) plt.xlabel("PC1") plt.ylabel("PC2") plt.title("First two principal components after scaling") plt.show()

If we apply the same method on a different dataset, such as MINST handwritten digits, the scatterplot is not showing distinctive boundary and therefore it needs a more complicated model such as neural network to classify:

from sklearn.datasets import load_digits from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline import matplotlib.pyplot as plt digitsdata = load_digits() X, y = digitsdata['data'], digitsdata['target'] pca = PCA() pipe = Pipeline([('scaler', StandardScaler()), ('pca', pca)]) plt.figure(figsize=(8,6)) Xt = pipe.fit_transform(X) plot = plt.scatter(Xt[:,0], Xt[:,1], c=y) plt.legend(handles=plot.legend_elements()[0], labels=list(digitsdata['target_names'])) plt.show()

PCA in essence is to rearrange the features by their linear combinations. Hence it is called a feature extraction technique. One characteristic of PCA is that the first principal component holds the most information about the dataset. The second principal component is more informative than the third, and so on.

To illustrate this idea, we can remove the principal components from the original dataset in steps and see how the dataset looks like. Let’s consider a dataset with fewer features, and show two features in a plot:

from sklearn.datasets import load_iris irisdata = load_iris() X, y = irisdata['data'], irisdata['target'] plt.figure(figsize=(8,6)) plt.scatter(X[:,0], X[:,1], c=y) plt.show()

This is the iris dataset which has only four features. The features are in comparable scales and hence we can skip the scaler. With a 4-features data, the PCA can produce at most 4 principal components:

... pca = PCA().fit(X) print(pca.components_)

[[ 0.36138659 -0.08452251 0.85667061 0.3582892 ] [ 0.65658877 0.73016143 -0.17337266 -0.07548102] [-0.58202985 0.59791083 0.07623608 0.54583143] [-0.31548719 0.3197231 0.47983899 -0.75365743]]

For example, the first row is the first principal axis on which the first principal component is created. For any data point $p$ with features $p=(a,b,c,d)$, since the principal axis is denoted by the vector $v=(0.36,-0.08,0.86,0.36)$, the first principal component of this data point has the value $0.36 \times a – 0.08 \times b + 0.86 \times c + 0.36\times d$ on the principal axis. Using vector dot product, this value can be denoted by

$$

p \cdot v

$$

Therefore, with the dataset $X$ as a 150 $\times$ 4 matrix (150 data points, each has 4 features), we can map each data point into to the value on this principal axis by matrix-vector multiplication:

$$

X \cdot v

$$

and the result is a vector of length 150. Now if we remove from each data point the corresponding value along the principal axis vector, that would be

$$

X – (X \cdot v) \cdot v^T

$$

where the transposed vector $v^T$ is a row and $X\cdot v$ is a column. The product $(X \cdot v) \cdot v^T$ follows matrix-matrix multiplication and the result is a $150\times 4$ matrix, same dimension as $X$.

If we plot the first two feature of $(X \cdot v) \cdot v^T$, it looks like this:

... # Remove PC1 Xmean = X - X.mean(axis=0) value = Xmean @ pca.components_[0] pc1 = value.reshape(-1,1) @ pca.components_[0].reshape(1,-1) Xremove = X - pc1 plt.scatter(Xremove[:,0], Xremove[:,1], c=y) plt.show()

The numpy array `Xmean`

is to shift the features of `X`

to centered at zero. This is required for PCA. Then the array `value`

is computed by matrix-vector multiplication.

The array `value`

is the magnitude of each data point mapped on the principal axis. So if we multiply this value to the principal axis vector we get back an array `pc1`

. Removing this from the original dataset `X`

, we get a new array `Xremove`

. In the plot we observed that the points on the scatter plot crumbled together and the cluster of each class is less distinctive than before. This means we removed a lot of information by removing the first principal component. If we repeat the same process again, the points are further crumbled:

... # Remove PC2 value = Xmean @ pca.components_[1] pc2 = value.reshape(-1,1) @ pca.components_[1].reshape(1,-1) Xremove = Xremove - pc2 plt.scatter(Xremove[:,0], Xremove[:,1], c=y) plt.show()

This looks like a straight line but actually not. If we repeat once more, all points collapse into a straight line:

... # Remove PC3 value = Xmean @ pca.components_[2] pc3 = value.reshape(-1,1) @ pca.components_[2].reshape(1,-1) Xremove = Xremove - pc3 plt.scatter(Xremove[:,0], Xremove[:,1], c=y) plt.show()

The points all fall on a straight line because we removed three principal components from the data where there are only four features. Hence our data matrix becomes **rank 1**. You can try repeat once more this process and the result would be all points collapse into a single point. The amount of information removed in each step as we removed the principal components can be found by the corresponding **explained variance ratio** from the PCA:

... print(pca.explained_variance_ratio_)

[0.92461872 0.05306648 0.01710261 0.00521218]

Here we can see, the first component explained 92.5% variance and the second component explained 5.3% variance. If we removed the first two principal components, the remaining variance is only 2.2%, hence visually the plot after removing two components looks like a straight line. In fact, when we check with the plots above, not only we see the points are crumbled, but the range in the x- and y-axes are also smaller as we removed the components.

In terms of machine learning, we can consider using only one single feature for classification in this dataset, namely the first principal component. We should expect to achieve no less than 90% of the original accuracy as using the full set of features:

... from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score from collections import Counter from sklearn.svm import SVC X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) clf = SVC(kernel="linear", gamma='auto').fit(X_train, y_train) print("Using all features, accuracy: ", clf.score(X_test, y_test)) print("Using all features, F1: ", f1_score(y_test, clf.predict(X_test), average="macro")) mean = X_train.mean(axis=0) X_train2 = X_train - mean X_train2 = (X_train2 @ pca.components_[0]).reshape(-1,1) clf = SVC(kernel="linear", gamma='auto').fit(X_train2, y_train) X_test2 = X_test - mean X_test2 = (X_test2 @ pca.components_[0]).reshape(-1,1) print("Using PC1, accuracy: ", clf.score(X_test2, y_test)) print("Using PC1, F1: ", f1_score(y_test, clf.predict(X_test2), average="macro"))

Using all features, accuracy: 1.0 Using all features, F1: 1.0 Using PC1, accuracy: 0.96 Using PC1, F1: 0.9645191409897292

The other use of the explained variance is on compression. Given the explained variance of the first principal component is large, if we need to store the dataset, we can store only the the projected values on the first principal axis ($X\cdot v$), as well as the vector $v$ of the principal axis. Then we can approximately reproduce the original dataset by multiplying them:

$$

X \approx (X\cdot v) \cdot v^T

$$

In this way, we need storage for only one value per data point instead of four values for four features. The approximation is more accurate if we store the projected values on multiple principal axes and add up multiple principal components.

Putting these together, the following is the complete code to generate the visualizations:

from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.decomposition import PCA from sklearn.metrics import f1_score from sklearn.svm import SVC import matplotlib.pyplot as plt # Load iris dataset irisdata = load_iris() X, y = irisdata['data'], irisdata['target'] plt.figure(figsize=(8,6)) plt.scatter(X[:,0], X[:,1], c=y) plt.xlabel(irisdata["feature_names"][0]) plt.ylabel(irisdata["feature_names"][1]) plt.title("Two features from the iris dataset") plt.show() # Show the principal components pca = PCA().fit(X) print("Principal components:") print(pca.components_) # Remove PC1 Xmean = X - X.mean(axis=0) value = Xmean @ pca.components_[0] pc1 = value.reshape(-1,1) @ pca.components_[0].reshape(1,-1) Xremove = X - pc1 plt.figure(figsize=(8,6)) plt.scatter(Xremove[:,0], Xremove[:,1], c=y) plt.xlabel(irisdata["feature_names"][0]) plt.ylabel(irisdata["feature_names"][1]) plt.title("Two features from the iris dataset after removing PC1") plt.show() # Remove PC2 Xmean = X - X.mean(axis=0) value = Xmean @ pca.components_[1] pc2 = value.reshape(-1,1) @ pca.components_[1].reshape(1,-1) Xremove = Xremove - pc2 plt.figure(figsize=(8,6)) plt.scatter(Xremove[:,0], Xremove[:,1], c=y) plt.xlabel(irisdata["feature_names"][0]) plt.ylabel(irisdata["feature_names"][1]) plt.title("Two features from the iris dataset after removing PC1 and PC2") plt.show() # Remove PC3 Xmean = X - X.mean(axis=0) value = Xmean @ pca.components_[2] pc3 = value.reshape(-1,1) @ pca.components_[2].reshape(1,-1) Xremove = Xremove - pc3 plt.figure(figsize=(8,6)) plt.scatter(Xremove[:,0], Xremove[:,1], c=y) plt.xlabel(irisdata["feature_names"][0]) plt.ylabel(irisdata["feature_names"][1]) plt.title("Two features from the iris dataset after removing PC1 to PC3") plt.show() # Print the explained variance ratio print("Explainedd variance ratios:") print(pca.explained_variance_ratio_) # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) # Run classifer on all features clf = SVC(kernel="linear", gamma='auto').fit(X_train, y_train) print("Using all features, accuracy: ", clf.score(X_test, y_test)) print("Using all features, F1: ", f1_score(y_test, clf.predict(X_test), average="macro")) # Run classifier on PC1 mean = X_train.mean(axis=0) X_train2 = X_train - mean X_train2 = (X_train2 @ pca.components_[0]).reshape(-1,1) clf = SVC(kernel="linear", gamma='auto').fit(X_train2, y_train) X_test2 = X_test - mean X_test2 = (X_test2 @ pca.components_[0]).reshape(-1,1) print("Using PC1, accuracy: ", clf.score(X_test2, y_test)) print("Using PC1, F1: ", f1_score(y_test, clf.predict(X_test2), average="macro"))

This section provides more resources on the topic if you are looking to go deeper.

- How to Calculate Principal Component Analysis (PCA) from Scratch in Python
- Principal Component Analysis for Dimensionality Reduction in Python

- scikit-learn toy datasets
- scikit-learn iris dataset
- scikit-learn wine dataset
- matplotlib scatter API
- The mplot3d toolkit

In this tutorial, you discovered how to visualize data using principal component analysis.

Specifically, you learned:

- Visualize a high dimensional dataset in 2D using PCA
- How to use the plot in PCA dimensions to help choosing an appropriate machine learning model
- How to observe the explained variance ratio of PCA
- What the explained variance ratio means for machine learning

The post Principal Component Analysis for Visualization appeared first on Machine Learning Mastery.

]]>The post How to Set Axis for Rows and Columns in NumPy appeared first on Machine Learning Mastery.

]]>They are particularly useful for representing data as vectors and matrices in machine learning.

Data in NumPy arrays can be accessed directly via column and row indexes, and this is reasonably straightforward. Nevertheless, sometimes we must perform operations on arrays of data such as sum or mean of values by row or column and this requires the axis of the operation to be specified.

Unfortunately, the column-wise and row-wise operations on NumPy arrays do not match our intuitions gained from row and column indexing, and this can cause confusion for beginners and seasoned machine learning practitioners alike. Specifically, operations like sum can be performed **column-wise using axis=0** and **row-wise using axis=1**.

In this tutorial, you will discover how to access and operate on NumPy arrays by row and by column.

After completing this tutorial, you will know:

- How to define NumPy arrays with rows and columns of data.
- How to access values in NumPy arrays by row and column indexes.
- How to perform operations on NumPy arrays by row and column axis.

Let’s get started.

This tutorial is divided into three parts; they are:

- NumPy Array With Rows and Columns
- Rows and Columns of Data in NumPy Arrays
- NumPy Array Operations By Row and Column
- Axis=None Array-Wise Operation
- Axis=0 Column-Wise Operation
- Axis=1 Row-Wise Operation

Before we dive into the NumPy array axis, let’s refresh our knowledge of NumPy arrays.

Typically in Python, we work with lists of numbers or lists of lists of numbers. For example, we can define a two-dimensional matrix of two rows of three numbers as a list of numbers as follows:

... # define data as a list data = [[1,2,3], [4,5,6]]

A NumPy array allows us to define and operate upon vectors and matrices of numbers in an efficient manner, e.g. a lot more efficient than simply Python lists. NumPy arrays are called NDArrays and can have virtually any number of dimensions, although, in machine learning, we are most commonly working with 1D and 2D arrays (or 3D arrays for images).

For example, we can convert our list of lists matrix to a NumPy array via the asarray() function:

... # convert to a numpy array data = asarray(data)

We can print the array directly and expect to see two rows of numbers, where each row has three numbers or columns.

... # summarize the array content print(data)

We can summarize the dimensionality of an array by printing the “*shape*” property, which is a tuple, where the number of values in the tuple defines the number of dimensions, and the integer in each position defines the size of the dimension.

For example, we expect the shape of our array to be (2,3) for two rows and three columns.

... # summarize the array shape print(data.shape)

Tying this all together, a complete example is listed below.

# create and summarize a numpy array from numpy import asarray # define data as a list data = [[1,2,3], [4,5,6]] # convert to a numpy array data = asarray(data) # summarize the array content print(data) # summarize the array shape print(data.shape)

Running the example defines our data as a list of lists, converts it to a NumPy array, then prints the data and shape.

We can see that when the array is printed, it has the expected shape of two rows with three columns. We can then see that the printed shape matches our expectations.

[[1 2 3] [4 5 6]] (2, 3)

For more on the basics of NumPy arrays, see the tutorial:

So far, so good.

But how do we access data in the array by row or column? More importantly, how can we perform operations on the array by-row or by-column?

Let’s take a closer look at these questions.

The “*shape*” property summarizes the dimensionality of our data.

Importantly, the first dimension defines the number of rows and the second dimension defines the number of columns. For example (2,3) defines an array with two rows and three columns, as we saw in the last section.

We can enumerate each row of data in an array by enumerating from index 0 to the first dimension of the array shape, e.g. shape[0]. We can access data in the array via the row and column index.

For example, data[0, 0] is the value at the first row and the first column, whereas data[0, :] is the values in the first row and all columns, e.g. the complete first row in our matrix.

The example below enumerates all rows in the data and prints each in turn.

# enumerate rows in a numpy array from numpy import asarray # define data as a list data = [[1,2,3], [4,5,6]] # convert to a numpy array data = asarray(data) # step through rows for row in range(data.shape[0]): print(data[row, :])

As expected, the results show the first row of data, then the second row of data.

[1 2 3] [4 5 6]

We can achieve the same effect for columns.

That is, we can enumerate data by columns. For example, data[:, 0] accesses all rows for the first column. We can enumerate all columns from column 0 to the final column defined by the second dimension of the “*shape*” property, e.g. shape[1].

The example below demonstrates this by enumerating all columns in our matrix.

# enumerate columns in a numpy array from numpy import asarray # define data as a list data = [[1,2,3], [4,5,6]] # convert to a numpy array data = asarray(data) # step through columns for col in range(data.shape[1]): print(data[:, col])

Running the example enumerates and prints each column in the matrix.

Given that the matrix has three columns, we can see that the result is that we print three columns, each as a one-dimensional vector. That is column 1 (index 0) that has values 1 and 4, column 2 (index 1) that has values 2 and 5, and column 3 (index 2) that has values 3 and 6.

It just looks funny because our columns don’t look like columns; they are turned on their side, rather than vertical.

[1 4] [2 5] [3 6]

Now we know how to access data in a numpy array by column and by row.

So far, so good, but what about operations on the array by column and array? That’s next.

We often need to perform operations on NumPy arrays by column or by row.

For example, we may need to sum values or calculate a mean for a matrix of data by row or by column.

This can be achieved by using the *sum()* or *mean()* NumPy function and specifying the “*axis*” on which to perform the operation.

We can specify the axis as the dimension across which the operation is to be performed, and this dimension does not match our intuition based on how we interpret the “*shape*” of the array and how we index data in the array.

**As such, this causes maximum confusion for beginners**.

That is, **axis=0** will perform the operation column-wise and **axis=1** will perform the operation row-wise. We can also specify the axis as None, which will perform the operation for the entire array.

In summary:

**axis=None**: Apply operation array-wise.**axis=0**: Apply operation column-wise, across all rows for each column.**axis=1**: Apply operation row-wise, across all columns for each row.

Let’s make this concrete with a worked example.

We will sum values in our array by each of the three axes.

Setting the **axis=None** when performing an operation on a NumPy array will perform the operation for the entire array.

This is often the default for most operations, such as sum, mean, std, and so on.

... # sum data by array result = data.sum(axis=None)

The example below demonstrates summing all values in an array, e.g. an array-wise operation.

# sum values array-wise from numpy import asarray # define data as a list data = [[1,2,3], [4,5,6]] # convert to a numpy array data = asarray(data) # summarize the array content print(data) # sum data by array result = data.sum(axis=None) # summarize the result print(result)

Running the example first prints the array, then performs the sum operation array-wise and prints the result.

We can see the array has six values that would sum to 21 if we add them manually and that the result of the sum operation performed array-wise matches this expectation.

[[1 2 3] [4 5 6]] 21

Setting the **axis=0** when performing an operation on a NumPy array will perform the operation column-wise, that is, across all rows for each column.

... # sum data by column result = data.sum(axis=0)

For example, given our data with two rows and three columns:

Data = [[1, 2, 3], 4, 5, 6]]

We expect a sum column-wise with axis=0 will result in three values, one for each column, as follows:

**Column 1**: 1 + 4 = 5**Column 2**: 2 + 5 = 7**Column 3**: 3 + 6 = 9

The example below demonstrates summing values in the array by column, e.g. a column-wise operation.

# sum values column-wise from numpy import asarray # define data as a list data = [[1,2,3], [4,5,6]] # convert to a numpy array data = asarray(data) # summarize the array content print(data) # sum data by column result = data.sum(axis=0) # summarize the result print(result)

Running the example first prints the array, then performs the sum operation column-wise and prints the result.

We can see the array has six values with two rows and three columns as expected; we can then see the column-wise operation result in a vector with three values, one for the sum of each column matching our expectation.

[[1 2 3] [4 5 6]] [5 7 9]

Setting the *axis=1* when performing an operation on a NumPy array will perform the operation row-wise, that is across all columns for each row.

... # sum data by row result = data.sum(axis=1)

For example, given our data with two rows and three columns:

Data = [[1, 2, 3], 4, 5, 6]]

We expect a sum row-wise with axis=1 will result in two values, one for each row, as follows:

**Row 1**: 1 + 2 + 3 = 6**Row 2**: 4 + 5 + 6 = 15

The example below demonstrates summing values in the array by row, e.g. a row-wise operation.

# sum values row-wise from numpy import asarray # define data as a list data = [[1,2,3], [4,5,6]] # convert to a numpy array data = asarray(data) # summarize the array content print(data) # sum data by row result = data.sum(axis=1) # summarize the result print(result)

Running the example first prints the array, then performs the sum operation row-wise and prints the result.

We can see the array has six values with two rows and three columns as expected; we can then see the row-wise operation result in a vector with two values, one for the sum of each row matching our expectation.

[[1 2 3] [4 5 6]] [ 6 15]

We now have a concrete idea of how to set axis appropriately when performing operations on our NumPy arrays.

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to NumPy Arrays in Python
- How to Index, Slice and Reshape NumPy Arrays for Machine Learning
- A Gentle Introduction to Broadcasting with NumPy Arrays

In this tutorial, you discovered how to access and operate on NumPy arrays by row and by column.

Specifically, you learned:

- How to define NumPy arrays with rows and columns of data.
- How to access values in NumPy arrays by row and column indexes.
- How to perform operations on NumPy arrays by row and column axis.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Set Axis for Rows and Columns in NumPy appeared first on Machine Learning Mastery.

]]>The post What Is Argmax in Machine Learning? appeared first on Machine Learning Mastery.

]]>For example, you may see “*argmax*” or “*arg max*” used in a research paper used to describe an algorithm. You may also be instructed to use the argmax function in your algorithm implementation.

This may be the first time that you encounter the argmax function and you may wonder what it is and how it works.

In this tutorial, you will discover the argmax function and how it is used in machine learning.

After completing this tutorial, you will know:

- Argmax is an operation that finds the argument that gives the maximum value from a target function.
- Argmax is most commonly used in machine learning for finding the class with the largest predicted probability.
- Argmax can be implemented manually, although the argmax() NumPy function is preferred in practice.

**Kick-start your project** with my new book Linear Algebra for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- What Is Argmax?
- How Is Argmax Used in Machine Learning?
- How to Implement Argmax in Python

Argmax is a mathematical function.

It is typically applied to another function that takes an argument. For example, given a function *g()* that takes the argument *x*, the *argmax* operation of that function would be described as follows:

- result = argmax(g(x))

The *argmax* function returns the argument or arguments (*arg*) for the target function that returns the maximum (*max*) value from the target function.

Consider the example where *g(x)* is calculated as the square of the *x* value and the domain or extent of input values (*x*) is limited to integers from 1 to 5:

- g(1) = 1^2 = 1
- g(2) = 2^2 = 4
- g(3) = 3^2 = 9
- g(4) = 4^2 = 16
- g(5) = 5^2 = 25

We can intuitively see that the argmax for the function *g(x)* is 5.

That is, the argument (*x*) to the target function *g()* that results in the largest value from the target function (25) is 5. Argmax provides a shorthand for specifying this argument in an abstract way without knowing what the value might be in a specific case.

- argmax(g(x)) = 5

Note that this is not the *max()* of the values returned from function. This would be 25.

It is also not the *max()* of the arguments, although in this case the argmax and max of the arguments is the same, e.g. 5. The *argmax()* is 5 because g returns the largest value (25) when 5 is provided, not because 5 is the largest argument.

Typically, “*argmax*” is written as two separate words, e.g. “*arg max*“. For example:

- result = arg max(g(x))

It is also common to use the arg max function as an operation without brackets surrounding the target function. This is often how you will see the operation written and used in a research paper or textbook. For example:

- result = arg max g(x)

You can also use a similar operation to find the arguments to the target function that result in the minimum value from the target function, called *argmin* or “*arg min*.”

The argmax function is used throughout the field of mathematics and machine learning.

Nevertheless, there are specific situations where you will see argmax used in applied machine learning and may need to implement it yourself.

The most common situation for using argmax that you will encounter in applied machine learning is in finding the index of an array that results in the largest value.

Recall that an array is a list or vector of numbers.

It is common for multi-class classification models to predict a vector of probabilities (or probability-like values), with one probability for each class label. The probabilities represent the likelihood that a sample belongs to each of the class labels.

The predicted probabilities are ordered such that the predicted probability at index 0 belongs to the first class, the predicted probability at index 1 belongs to the second class, and so on.

Often, a single class label prediction is required from a set of predicted probabilities for a multi-class classification problem.

This conversion from a vector of predicted probabilities to a class label is most often described using the argmax operation and most often implemented using the argmax function.

Let’s make this concrete with an example.

Consider a multi-class classification problem with three classes: “*red*“, “*blue*,” and “*green*.” The class labels are mapped to integer values for modeling, as follows:

- red = 0
- blue = 1
- green = 2

Each class label integer values maps to an index of a 3-element vector that may be predicted by a model specifying the likelihood that an example belongs to each class.

Consider a model has made one prediction for an input sample and predicted the following vector of probabilities:

- yhat = [0.4, 0.5, 0.1]

We can see that the example has a 40 percent probability of belonging to red, a 50 percent probability of belonging to blue, and a 10 percent probability of belonging to green.

We can apply the argmax function to the vector of probabilities. The vector is the function, the output of the function is the probabilities, and the input to the function is a vector element index or an array index.

- arg max yhat

We can intuitively see that in this case, the argmax of the vector of predicted probabilities (yhat) is 1, as the probability at array index 1 is the largest value.

Note that this is not the max() of the probabilities, which would be 0.5. Also note that this is not the max of the arguments, which would be 2. Instead it is the argument that results in the maximum value, e.g. 1 that results in 0.5.

- arg max yhat = 1

We can then map this integer value back to a class label, which would be “*blue*.”

- arg max yhat = “blue”

The argmax function can be implemented in Python for a given vector of numbers.

First, we can define a function called *argmax()* that enumerates a provided vector and returns the index with the largest value.

The complete example is listed below.

# argmax function def argmax(vector): index, value = 0, vector[0] for i,v in enumerate(vector): if v > value: index, value = i,v return index # define vector vector = [0.4, 0.5, 0.1] # get argmax result = argmax(vector) print('arg max of %s: %d' % (vector, result))

Running the example prints the argmax of our test data used in the previous section, which in this case is an index of 1.

arg max of [0.4, 0.5, 0.1]: 1

Thankfully, there is a built-in version of the argmax() function provided with the NumPy library.

This is the version that you should use in practice.

The example below demonstrates the *argmax()* NumPy function on the same vector of probabilities.

# numpy implementation of argmax from numpy import argmax # define vector vector = [0.4, 0.5, 0.1] # get argmax result = argmax(vector) print('arg max of %s: %d' % (vector, result))

Running the example prints an index of 1, as is expected.

arg max of [0.4, 0.5, 0.1]: 1

It is more likely that you will have a collection of predicted probabilities for multiple samples.

This would be stored as a matrix with rows of predicted probabilities and each column representing a class label. The desired result of an argmax on this matrix would be a vector with one index (or class label integer) for each row of predictions.

This can be achieved with the *argmax()* NumPy function by setting the “*axis*” argument. By default, the argmax would be calculated for the entire matrix, returning a single number. Instead, we can set the axis value to 1 and calculate the argmax across the columns for each row of data.

The example below demonstrates this with a matrix of four rows of predicted probabilities for the three class labels.

# numpy implementation of argmax from numpy import argmax from numpy import asarray # define vector probs = asarray([[0.4, 0.5, 0.1], [0.0, 0.0, 1.0], [0.9, 0.0, 0.1], [0.3, 0.3, 0.4]]) print(probs.shape) # get argmax result = argmax(probs, axis=1) print(result)

Running the example first prints the shape of the matrix of predicted probabilities, confirming we have four rows with three columns per row.

The argmax of the matrix is then calculated and printed as a vector, showing four values. This is what we expect, where each row results in a single argmax value or index with the largest probability.

(4, 3) [1 2 0 2]

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered the argmax function and how it is used in machine learning.

Specifically, you learned:

- Argmax is an operation that finds the argument that gives the maximum value from a target function.
- Argmax is most commonly used in machine learning for finding the class with the largest predicted probability.
- Argmax can be implemented manually, although the argmax() NumPy function is preferred in practice.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post What Is Argmax in Machine Learning? appeared first on Machine Learning Mastery.

]]>The post Basics of Mathematical Notation for Machine Learning appeared first on Machine Learning Mastery.

]]>Often, all it takes is one term or one fragment of notation in an equation to completely derail your understanding of the entire procedure. This can be extremely frustrating, especially for machine learning beginners coming from the world of development.

You can make great progress if you know a few basic areas of mathematical notation and some tricks for working through the description of machine learning methods in papers and books.

In this tutorial, you will discover the basics of mathematical notation that you may come across when reading descriptions of techniques in machine learning.

After completing this tutorial, you will know:

- Notation for arithmetic, including variations of multiplication, exponents, roots, and logarithms.
- Notation for sequences and sets including indexing, summation, and set membership.
- 5 Techniques you can use to get help if you are struggling with mathematical notation.

**Kick-start your project** with my new book Linear Algebra for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update May/2018**: Added images for some notations to make the explanations clearer.

This tutorial is divided into 7 parts; they are:

- The Frustration with Math Notation
- Arithmetic Notation
- Greek Alphabet
- Sequence Notation
- Set Notation
- Other Notation
- Getting More Help

Are there other areas of basic math notation required for machine learning that you think I missed?

Let me know in the comments below.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

You will encounter mathematical notation when reading about machine learning algorithms.

For example, notation may be used to:

- Describe an algorithm.
- Describe data preparation.
- Describe results.
- Describe a test harness.
- Describe implications.

These descriptions may be in research papers, textbooks, blog posts, and elsewhere.

Often the terms are well defined, but there are also mathematical notation norms that you may not be familiar with.

All it takes is one term or one equation that you do not understand and your understanding of the entire method will be lost. I’ve suffered this problem myself many times, and it is incredibly frustrating!

In this tutorial, we will review some basic mathematical notation that will help you when reading descriptions of machine learning methods.

In this section, we will go over some less obvious notations for basic arithmetic as well as a few concepts you may have forgotten since school.

The notation for basic arithmetic is as you would write it. For example:

- Addition: 1 + 1 = 2
- Subtraction: 2 – 1 = 1
- Multiplication: 2 x 2 = 4
- Division: 2 / 2 = 1

Most mathematical operations have a sister operation that performs the inverse operation; for example, subtraction is the inverse of addition and division is the inverse of multiplication.

We often want to describe operations abstractly to separate them from specific data or specific implementations.

For this reason we see heavy use of algebra: that is uppercase and/or lowercase letters or words to represents terms or concepts in mathematical notation. It is also common to use letters from the Greek alphabet.

Each sub-field of math may have reserved letters: that is terms or letters that always mean the same thing. Nevertheless, algebraic terms should be defined as part of the description and if they are not, it may just be a poor description, not your fault.

Multiplication is a common notation and has a few short hands.

Often a little “x” or an asterisk “*” is used to represent multiplication:

c = a x b c = a * b

You may see a dot notation used; for example:

c = a . b

Which is the same as:

c = a * b

Alternately, you may see no operation and no white space separation between previously defined terms; for example:

c = ab

Which again is the same thing.

An exponent is a number raised to a power.

The notation is written as the original number, or the base, with a second number, or the exponent, shown as a superscript; for example:

2^3

Which would be calculated as 2 multiplied by itself 3 times, or cubing:

2 x 2 x 2 = 8

A number raised to the power 2 to is said to be its square.

2^2 = 2 x 2 = 4

The square of a number can be inverted by calculating the square root. This is shown using the notation of a number and with a tick above, I will use the “sqrt()” function here for simplicity.

sqrt(4) = 2

Here, we know the result and the exponent and we wish to find the base.

In fact, the root operation can be used to inverse any exponent, it just so happens that the default square root assumes an exponent of 2, represented by a subscript 2 in front of the square root tick.

For example, we can invert the cubing of a number by taking the cube root (note, the 3 is not a multiplication here, it is notation before the tick of the root sign):

2^3 = 8 3 sqrt(8) = 2

When we raise 10 to an integer exponent, we often call this an order of magnitude.

10^2 = 10 x 10 or 100

Another way to reverse this operation is by calculating the logarithm of the result 100 assuming a base of 10; in notation this is written as log10().

log10(100) = 2

Here, we know the result and the base and wish to find the exponent.

This allows us to move up and down orders of magnitude very easily. Taking the logarithm assuming the base of 2 is also commonly used, given the use of binary arithmetic used in computers. For example:

2^6 = 64 log2(64) = 6

Another popular logarithm is to assume the natural base called e. The e is reserved and is a special number or a constant called Euler’s number (pronounced “*oy-ler*“) that refers to a value with practically infinite precision.

e = 2.71828...

Raising e to a power is called a natural exponential function:

e^2 = 7.38905...

It can be inverted using the natural logarithm, which is denoted as ln():

ln(7.38905...) = 2

Without going into detail, the natural exponent and natural logarithm prove useful throughout mathematics to abstractly describe the continuous growth of some systems, e.g. systems that grow exponentially such as compound interest.

Greek letters are used throughout mathematical notation for variables, constants, functions, and more.

For example, in statistics we talk about the mean using the lowercase Greek letter mu, and the standard deviation as the lowercase Greek letter sigma. In linear regression, we talk about the coefficients as the lowercase letter beta. And so on.

It is useful to know all of the uppercase and lowercase Greek letters and how to pronounce them.

When I was a grad student, I printed the Greek alphabet and stuck it on my computer monitor so that I could memorize it. A useful trick!

Below is the full Greek alphabet.

The Wikipedia page titled “Greek letters used in mathematics, science, and engineering” is also a useful guide as it lists common uses for each Greek letter in different sub-fields of math and science.

Machine learning notation often describes an operation on a sequence.

A sequence may be an array of data or a list of terms.

A key to reading notation for sequences is the notation of indexing elements in the sequence.

Often the notation will specify the beginning and end of the sequence, such as 1 to n, where n will be the extent or length of the sequence.

Items in the sequence are index by a variable such as i, j, k as a subscript. This is just like array notation.

For example, a_i is the i^th element of the sequence a.

If the sequence is two dimensional, two indices may be used; for example:

b_{i,j} is the i,j^th element of the sequence b.

Mathematical operations can be performed over a sequence.

Two operations are performed on sequences so often that they have their own shorthand: the sum and the multiplication.

The sum over a sequence is denoted as the uppercase Greek letter sigma. It is specified with the variable and start of the sequence summation below the sigma (e.g. i = 1) and the index of the end of the summation above the sigma (e.g. n).

Sigma i = 1, n a_i

This is the sum of the sequence a starting at element 1 to element n.

The multiplication over a sequence is denoted as the uppercase Greek letter pi. It is specified in the same way as the sequence summation with the beginning and end of the operation below and above the letter respectively.

Pi i = 1, n a_i

This is the product of the sequence a starting at element 1 to element n.

A set is a group of unique items.

We may see set notation used when defining terms in machine learning.

A common set you may see is a set of numbers, such as a term defined as being within the set of integers or the set of real numbers.

Some common sets of numbers you may see include:

- Set of all natural numbers: N
- Set of all integers: Z
- Set of all real numbers: R

There are other sets; see Special sets on Wikipedia.

We often talk about real values or real numbers when defining terms rather than floating point values, which are really discrete creations for operations in computers.

It is common to see set membership in definitions of terms.

Set membership is denoted as a symbol that looks like an uppercase “E”.

a E R

Which means a is defined as being a member of the set R or the set of real numbers.

There is also a host of set operations; two common set operations include:

- Union, or aggregation: A U B
- Intersection, or overlap: A ^ B

Learn more about sets on Wikipedia.

There is other notation that you may come across.

I try to lay some of it out in this section.

It is common to define a method in the abstract and then define it again as a specific implementation with separate notation.

For example, if we are estimating a variable x, we may represent it using a notation that modifies the x; for example:

The same notation may have a different meaning in a different context, such as use on different objects or sub-fields of mathematics. For example, a common point of confusion is |x|, which, depending on context, can mean:

- |x|: The absolute or positive value of x.
- |x|: The length of the vector x.
- |x|: The cardinality of the set x.

This tutorial only covered the basics of mathematical notation. There are some subfields of mathematics that are more relevant to machine learning and should be reviewed in more detail. They are:

- Linear Algebra.
- Statistics.
- Probability.
- Calculus.

And perhaps a little bit of multivariate analysis and information theory.

Are there areas of mathematical notation that you think are missing from this post?

Let me know in the comments below.

This section lists some tips that you can use when you are struggling with mathematical notation in machine learning.

People wrote the paper or book you are reading.

People that can make mistakes, make omissions, and even make things confusing because they don’t fully understand what they are writing.

Relax the constraints of the notation you are reading slightly and think about the intent of the author. What are they trying to get across?

Perhaps you can even contact the author via email, Twitter, Facebook, LinkedIn, etc., and seek clarification. Remember that academics want other people to understand and use their work (mostly).

Wikipedia has lists of notation which can help narrow down on the meaning or intent of the notation you are reading.

Two places I recommend you start are:

- List of mathematical symbols on Wikipedia
- Greek letters used in mathematics, science, and engineering on Wikipedia

Mathematical operations are just functions on data.

Map everything you’re reading to pseudocode with variables, for-loops, and more.

You might want to use a scripting language as you go, along with small arrays of contrived data or even an Excel spreadsheet.

As your reading and understanding of the technique improves, your code-sketch of the technique will make more sense, and at the end you will have a mini prototype to play with.

I never used to take much stock in this approach until I saw an academic sketch out a very complex paper in a few lines of matlab with some contrived data. It knocked my socks off because I believed the system had to be coded completely and run with a “real” dataset and that the only option was to get the original code and data. I was very wrong. Also, looking back, the guy was gifted.

I now use this method all the time and sketch techniques in Python.

There is a trick I use when I’m trying to understand a new technique.

I find and read all the papers that reference the paper I’m reading with the new technique.

Reading other academics interpretation and re-explanation of the technique can often clarify my misunderstandings in the original description.

Not always though. Sometimes it can muddy the waters and introduce misleading explanations or new notation. But more often than not, it helps. After circling back to the original paper and re-reading it, I can often find cases where subsequent papers have actually made errors and misinterpretations of the original method.

There are places online where people love to explain math to others. Seriously!

Consider taking a screenshot of the notation you are struggling with, write out the full reference or link to it, and post it and your area of misunderstanding to a question-and-answer site.

Two great places to start are:

What are your tricks for working through mathematical notation?

Let me know in the comments below?

This section provides more resources on the topic if you are looking to go deeper.

- Section 0.1. Reading Mathematics [PDF], Vector Calculus, Linear Algebra, and Differential Forms, 2009.
- The Language and Grammar of Mathematics [PDF], Timothy Gowers
- Understanding Mathematics, a guide, Peter Alfeld.

In this tutorial, you discovered the basics of mathematical notation that you may come across when reading descriptions of techniques in machine learning.

Specifically, you learned:

- Notation for arithmetic, including variations of multiplication, exponents, roots, and logarithms.
- Notation for sequences and sets, including indexing, summation, and set membership.
- 5 Techniques you can use to get help if you are struggling with mathematical notation.

Are you struggling with mathematical notation?

Did any of the notation or tips in this post help?

Let me know in the comments below.

The post Basics of Mathematical Notation for Machine Learning appeared first on Machine Learning Mastery.

]]>The post Linear Algebra for Machine Learning (7-Day Mini-Course) appeared first on Machine Learning Mastery.

]]>Linear algebra is a field of mathematics that is universally agreed to be a prerequisite for a deeper understanding of machine learning.

Although linear algebra is a large field with many esoteric theories and findings, the nuts and bolts tools and notations taken from the field are required for machine learning practitioners. With a solid foundation of what linear algebra is, it is possible to focus on just the good or relevant parts.

In this crash course, you will discover how you can get started and confidently read and implement linear algebra notation used in machine learning with Python in 7 days.

This is a big and important post. You might want to bookmark it.

**Kick-start your project** with my new book Linear Algebra for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Mar/2018**: Fixed a small typo in the SVD lesson.

Before we get started, let’s make sure you are in the right place.

This course is for developers that may know some applied machine learning. Maybe you know how to work through a predictive modeling problem end-to-end, or at least most of the main steps, with popular tools.

The lessons in this course do assume a few things about you, such as:

- You know your way around basic Python for programming.

You may know some basic NumPy for array manipulation. - You want to learn linear algebra to deepen your understanding and application of machine learning.

You do NOT need to know:

- You do not need to be a math wiz!
- You do not need to be a machine learning expert!

This crash course will take you from a developer that knows a little machine learning to a developer who can navigate the basics of linear algebra.

Note: This crash course assumes you have a working Python3 SciPy environment with at least NumPy installed. If you need help with your environment, you can follow the step-by-step tutorial here:

This crash course is broken down into 7 lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.

Below is a list of the 7 lessons that will get you started and productive with linear algebra for machine learning in Python:

**Lesson 01**: Linear Algebra for Machine Learning**Lesson 02**: Linear Algebra**Lesson 03**: Vectors**Lesson 04**: Matrices**Lesson 05**: Matrix Types and Operations**Lesson 06**: Matrix Factorization**Lesson 07**: Singular-Value Decomposition

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help on and about the linear algebra and the NumPy API and the best-of-breed tools in Python (hint: I have all of the answers directly on this blog; use the search box).

I do provide more help in the form of links to related posts because I want you to build up some confidence and inertia.

Post your results in the comments; I’ll cheer you on!

Hang in there; don’t give up.

Note: This is just a crash course. For a lot more detail and fleshed out tutorials, see my book on the topic titled “Basics of Linear Algebra for Machine Learning“.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this lesson, you will discover the 5 reasons why a machine learning practitioner should deepen their understanding of linear algebra.

You need to be able to read and write vector and matrix notation. Algorithms are described in books, papers, and on websites using vector and matrix notation.

In partnership with the notation of linear algebra are the arithmetic operations performed. You need to know how to add, subtract, and multiply scalars, vectors, and matrices.

You must learn linear algebra in order to be able to learn statistics. Especially multivariate statistics. In order to be able to read and interpret statistics, you must learn the notation and operations of linear algebra. Modern statistics uses both the notation and tools of linear algebra to describe the tools and techniques of statistical methods. From vectors for the means and variances of data, to covariance matrices that describe the relationships between multiple Gaussian variables.

Building on notation and arithmetic is the idea of matrix factorization, also called matrix decomposition.You need to know how to factorize a matrix and what it means. Matrix factorization is a key tool in linear algebra and used widely as an element of many more complex operations in both linear algebra (such as the matrix inverse) and machine learning (least squares).

You need to know how to use matrix factorization to solve linear least squares. Problems of this type can be framed as the minimization of squared error, called least squares, and can be recast in the language of linear algebra, called linear least squares. Linear least squares problems can be solved efficiently on computers using matrix operations such as matrix factorization.

If I could give one more reason, it would be: because it is fun. Seriously.

For this lesson, you must list 3 reasons why you, personally, want to learn linear algebra.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover a concise definition of linear algebra.

In this lesson, you will discover a concise definition of linear algebra.

Linear algebra is a branch of mathematics, but the truth of it is that linear algebra is the mathematics of data. Matrices and vectors are the language of data.

Linear algebra is about linear combinations. That is, using arithmetic on columns of numbers called vectors and 2D arrays of numbers called matrices, to create new columns and arrays of numbers.

The application of linear algebra in computers is often called numerical linear algebra.

It is more than just the implementation of linear algebra operations in code libraries; it also includes the careful handling of the problems of applied mathematics, such as working with the limited floating point precision of digital computers.

As linear algebra is the mathematics of data, the tools of linear algebra are used in many domains.

- Matrices in Engineering, such as a line of springs.
- Graphs and Networks, such as analyzing networks.
- Markov Matrices, Population, and Economics, such as population growth.
- Linear Programming, the simplex optimization method.
- Fourier Series: Linear Algebra for Functions, used widely in signal processing.
- Linear Algebra for Statistics and Probability, such as least squares for regression.
- Computer Graphics, such as the various translation, scaling and rotation of images.

For this lesson, you must find five quotes from research papers, blogs, or books that define the field of linear algebra.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover vectors and simple vector arithmetic.

In this lesson, you will discover vectors and simple vector arithmetic.

A vector is a tuple of one or more values called scalars.

Vectors are often represented using a lowercase character such as “v”; for example:

v = (v1, v2, v3)

Where v1, v2, v3 are scalar values, often real values.

We can represent a vector in Python as a NumPy array.

A NumPy array can be created from a list of numbers. For example, below we define a vector with the length of 3 and the integer values 1, 2, and 3.

# create a vector from numpy import array v = array([1, 2, 3]) print(v)

Two vectors of equal length can be multiplied together.

c = a * b

As with addition and subtraction, this operation is performed element-wise to result in a new vector of the same length.

a * b = (a1 * b1, a2 * b2, a3 * b3)

We can perform this operation directly in NumPy.

# multiply vectors from numpy import array a = array([1, 2, 3]) print(a) b = array([1, 2, 3]) print(b) c = a * b print(c)

For this lesson, you must implement other vector arithmetic operations such as addition, division, subtraction, and the vector dot product.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover matrices and simple matrix arithmetic.

In this lesson, you will discover matrices and simple matrix arithmetic.

A matrix is a two-dimensional array of scalars with one or more columns and one or more rows.

The notation for a matrix is often an uppercase letter, such as A, and entries are referred to by their two-dimensional subscript of row (i) and column (j), such as aij. For example:

A = ((a11, a12), (a21, a22), (a31, a32))

We can represent a matrix in Python using a two-dimensional NumPy array.

A NumPy array can be constructed given a list of lists. For example, below is a 2 row, 3 column matrix.

# create matrix from numpy import array A = array([[1, 2, 3], [4, 5, 6]]) print(A)

Two matrices with the same dimensions can be added together to create a new third matrix.

C = A + B

The scalar elements in the resulting matrix are calculated as the addition of the elements in each of the matrices being added.

We can implement this in python using the plus operator directly on the two NumPy arrays.

# add matrices from numpy import array A = array([[1, 2, 3], [4, 5, 6]]) print(A) B = array([[1, 2, 3], [4, 5, 6]]) print(B) C = A + B print(C)

Matrix multiplication, also called the matrix dot product, is more complicated than the previous operations and involves a rule, as not all matrices can be multiplied together.

C = A * B

The rule for matrix multiplication is as follows: The number of columns (n) in the first matrix (A) must equal the number of rows (m) in the second matrix (B).

For example, matrix A has the dimensions m rows and n columns and matrix B has the dimensions n and k. The n columns in A and n rows b are equal. The result is a new matrix with m rows and k columns.

C(m,k) = A(m,n) * B(n,k)

The intuition for the matrix multiplication is that we are calculating the dot product between each row in matrix A with each column in matrix B. For example, we can step down rows of column A and multiply each with column 1 in B to give the scalar values in column 1 of C.

The matrix multiplication operation can be implemented in NumPy using the dot() function.

# matrix dot product from numpy import array A = array([[1, 2], [3, 4], [5, 6]]) print(A) B = array([[1, 2], [3, 4]]) print(B) C = A.dot(B) print(C)

For this lesson, you must implement more matrix arithmetic operations such as subtraction, division, the Hadamard product, and vector-matrix multiplication.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover the different types of matrices and matrix operations.

In this lesson, you will discover the different types of matrices and matrix operations.

A defined matrix can be transposed, which creates a new matrix with the number of columns and rows flipped.

This is denoted by the superscript “T” next to the matrix.

C = A^T

We can transpose a matrix in NumPy by calling the T attribute.

# transpose matrix from numpy import array A = array([[1, 2], [3, 4], [5, 6]]) print(A) C = A.T print(C)

The operation of inverting a matrix is indicated by a -1 superscript next to the matrix; for example, A^-1. The result of the operation is referred to as the inverse of the original matrix; for example, B is the inverse of A.

B = A^-1

Not all matrices are invertible.

A matrix can be inverted in NumPy using the inv() function.

# invert matrix from numpy import array from numpy.linalg import inv # define matrix A = array([[1.0, 2.0], [3.0, 4.0]]) print(A) # invert matrix B = inv(A) print(B)

A square matrix is a matrix where the number of rows (n) equals the number of columns (m).

n = m

The square matrix is contrasted with the rectangular matrix where the number of rows and columns are not equal.

A symmetric matrix is a type of square matrix where the top-right triangle is the same as the bottom-left triangle.

To be symmetric, the axis of symmetry is always the main diagonal of the matrix, from the top left to the bottom right.

A symmetric matrix is always square and equal to its own transpose.

M = M^T

A triangular matrix is a type of square matrix that has all values in the upper-right or lower-left of the matrix with the remaining elements filled with zero values.

A triangular matrix with values only above the main diagonal is called an upper triangular matrix. Whereas, a triangular matrix with values only below the main diagonal is called a lower triangular matrix.

A diagonal matrix is one where values outside of the main diagonal have a zero value, where the main diagonal is taken from the top left of the matrix to the bottom right.

A diagonal matrix is often denoted with the variable D and may be represented as a full matrix or as a vector of values on the main diagonal.

For this lesson, you must develop examples for other matrix operations such as the determinant, trace, and rank.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover matrix factorization.

In this lesson, you will discover the basics of matrix factorization, also called matrix decomposition.

A matrix decomposition is a way of reducing a matrix into its constituent parts.

It is an approach that can simplify more complex matrix operations that can be performed on the decomposed matrix rather than on the original matrix itself.

A common analogy for matrix decomposition is the factoring of numbers, such as the factoring of 25 into 5 x 5. For this reason, matrix decomposition is also called matrix factorization. Like factoring real values, there are many ways to decompose a matrix, hence there are a range of different matrix decomposition techniques.

The LU decomposition is for square matrices and decomposes a matrix into L and U components.

A = L . U

Where A is the square matrix that we wish to decompose, L is the lower triangle matrix, and U is the upper triangle matrix. A variation of this decomposition that is numerically more stable to solve in practice is called the LUP decomposition, or the LU decomposition with partial pivoting.

A = P . L . U

The rows of the parent matrix are re-ordered to simplify the decomposition process and the additional P matrix specifies a way to permute the result or return the result to the original order. There are also other variations of the LU.

The LU decomposition is often used to simplify the solving of systems of linear equations, such as finding the coefficients in a linear regression.

The LU decomposition can be implemented in Python with the lu() function. More specifically, this function calculates an LPU decomposition.

# LU decomposition from numpy import array from scipy.linalg import lu # define a square matrix A = array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) print(A) # LU decomposition P, L, U = lu(A) print(P) print(L) print(U) # reconstruct B = P.dot(L).dot(U) print(B)

For this lesson, you must implement small examples of other simple methods for matrix factorization, such as the QR decomposition, the Cholesky decomposition, and the eigendecomposition.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover the Singular-Value Decomposition method for matrix factorization.

In this lesson, you will discover the Singular-Value Decomposition method for matrix factorization.

The Singular-Value Decomposition, or SVD for short, is a matrix decomposition method for reducing a matrix to its constituent parts in order to make certain subsequent matrix calculations simpler.

A = U . Sigma . V^T

Where A is the real m x n matrix that we wish to decompose, U is an m x m matrix, Sigma (often represented by the uppercase Greek letter Sigma) is an m x n diagonal matrix, and V^T is the transpose of an n x n matrix where T is a superscript.

The SVD can be calculated by calling the svd() function.

The function takes a matrix and returns the U, Sigma, and V^T elements. The Sigma diagonal matrix is returned as a vector of singular values. The V matrix is returned in a transposed form, e.g. V.T.

# Singular-value decomposition from numpy import array from scipy.linalg import svd # define a matrix A = array([[1, 2], [3, 4], [5, 6]]) print(A) # SVD U, s, V = svd(A) print(U) print(s) print(V)

For this lesson, you must list 5 applications of the SVD.

Bonus points if you can demonstrate each with a small example in Python.

Post your answer in the comments below. I would love to see what you discover.

This was the final lesson in the mini-course.

(

You made it. Well done!

Take a moment and look back at how far you have come.

You discovered:

- The importance of linear algebra to applied machine learning.
- What linear algebra is all about.
- What a vector is and how to perform vector arithmetic.
- What a matrix is and how to perform matrix arithmetic, including matrix multiplication.
- A suite of types of matrices, their properties, and advanced operations involving matrices.
- Matrix factorization methods and the LU decomposition method in detail.
- The popular Singular-Value decomposition method used in machine learning.

This is just the beginning of your journey with linear algebra for machine learning. Keep practicing and developing your skills.

Take the next step and check out my book on Linear Algebra for Machine Learning.

*How Did You Do with The Mini-Course?*

Did you enjoy this crash course?

*Do you have any questions? Were there any sticking points?*

Let me know. Leave a comment below.

The post Linear Algebra for Machine Learning (7-Day Mini-Course) appeared first on Machine Learning Mastery.

]]>The post Computational Linear Algebra for Coders Review appeared first on Machine Learning Mastery.

]]>It is an area that requires some previous experience of linear algebra and is focused on both the performance and precision of the operations. The company fast.ai released a free course titled “*Computational Linear Algebra*” on the topic of numerical linear algebra that includes Python notebooks and video lectures recorded at the University of San Francisco.

In this post, you will discover the fast.ai free course on computational linear algebra.

After reading this post, you will know:

- The motivation and prerequisites for the course.
- An overview of the topics covered in the course.
- Who exactly this course is a good fit for, and who it is not.

**Kick-start your project** with my new book Linear Algebra for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

The course “*Computational Linear Algebra for Coders*” is a free online course provided by fast.ai. They are a company dedicated to providing free education resources related to deep learning.

The course was originally taught in 2017 by Rachel Thomas at the University of San Francisco as part of a masters degree program. Rachel Thomas is a professor at the University of San Francisco and co-founder of fast.ai and has a Ph.D. in mathematics.

The focus of the course is numerical methods for linear algebra. This is the application of matrix algebra on computers and addresses all of the concerns around the implementation and use of the methods such as performance and precision.

This course is focused on the question: How do we do matrix computations with acceptable speed and acceptable accuracy?

The course uses Python with examples using NumPy, scikit-learn, numba, pytorch, and more.

The material is taught using a top-down approach, much like MachineLearningMastery, intended to give a feeling for how to do things, before explaining how the methods work.

Knowing how these algorithms are implemented will allow you to better combine and utilize them, and will make it possible for you to customize them if needed.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The course does assume familiarity with linear algebra.

This includes topics such as vectors, matrices, operations such as matrix multiplication and transforms.

The course is not for novices to the field of linear algebra.

Three references are suggested for you to review prior to taking the course if you are new or rusty with linear algebra. They are:

- 3Blue 1Brown Essence of Linear Algebra, Video Course
- Immersive Linear Algebra, Interactive Textbook
- Chapter 2 of Deep Learning, 2016.

Further, while working through the course, references are provided as needed.

Two general reference texts are suggested up front. They are the following textbooks:

- Numerical Linear Algebra, 1997.
- Numerical Methods, 2012.

This section provides a summary of the 8 (9) parts to the course. They are:

- 0. Course Logistics
- 1. Why are we here?
- 2. Topic Modeling with NMF and SVD
- 3. Background Removal with Robust PCA
- 4. Compressed Sensing with Robust Regression
- 5. Predicting Health Outcomes with Linear Regressions
- 6. How to Implement Linear Regression
- 7. PageRank with Eigen Decompositions
- 8. Implementing QR Factorization

Really, there are only 8 parts to the course as the first is just administration details for the students that took the course at the University of San Francisco.

In this section, we will step through the 9 parts of the course and summarize their contents and topics covered to give you a feel for what to expect and to see whether it is a good fit for you.

This first lecture is not really part of the course.

It provides an introduction to the lecturer, the material, the way it will be taught, and the expectations of the student in the masters program.

I’ll be using a top-down teaching method, which is different from how most math courses operate. Typically, in a bottom-up approach, you first learn all the separate components you will be using, and then you gradually build them up into more complex structures. The problems with this are that students often lose motivation, don’t have a sense of the “big picture”, and don’t know what they’ll need.

The topics covered in this lecture are:

- Lecturer background
- Teaching Approach
- Importance of Technical Writing
- List of Excellent Technical Blogs
- Linear Algebra Review Resources

Videos and Notebook:

This part introduces the motivation for the course, and touches on the importance of matrix factorization: the importance of the performance and accuracy of these calculations and some example applications.

Matrices are everywhere, anything that can be put in an Excel spreadsheet is a matrix, and language and pictures can be represented as matrices as well.

A great point made in this lecture is how the whole class of matrix factorization methods and one specific method, the QR decomposition, were reported as being among the top 10 most important algorithms of the 20th century.

A list of the Top 10 Algorithms of science and engineering during the 20th century includes: the matrix decompositions approach to linear algebra. It also includes the QR algorithm

The topics covered in this lecture are:

- Matrix and Tensor Products
- Matrix Decompositions
- Accuracy
- Memory use
- Speed
- Parallelization & Vectorization

Videos and Notebook:

This part focuses on the use of matrix factorization in the application to topic modeling for text, specifically the Singular Value Decomposition method, or SVD.

Useful in this part are the comparisons of calculating the methods from scratch or with NumPy and with the scikit-learn library.

Topic modeling is a great way to get started with matrix factorizations.

The topics covered in this lecture are:

- Topic Frequency-Inverse Document Frequency (TF-IDF)
- Singular Value Decomposition (SVD)
- Non-negative Matrix Factorization (NMF)
- Stochastic Gradient Descent (SGD)
- Intro to PyTorch
- Truncated SVD

Videos and Notebook:

- Computational Linear Algebra 2: Topic Modelling with SVD & NMF
- Computational Linear Algebra 3: Review, New Perspective on NMF, & Randomized SVD
- Notebook

This part focuses on the Principal Component Analysis method, or PCA, that uses the eigendecomposition and multivariate statistics.

The focus is on using PCA on image data such as separating background from foreground to isolate changes. This part also introduces the LU decomposition from scratch.

When dealing with high-dimensional data sets, we often leverage on the fact that the data has low intrinsic dimensionality in order to alleviate the curse of dimensionality and scale (perhaps it lies in a low-dimensional subspace or lies on a low-dimensional manifold).

The topics covered in this lecture are:

- Load and View Video Data
- SVD
- Principal Component Analysis (PCA)
- L1 Norm Induces Sparsity
- Robust PCA
- LU factorization
- Stability of LU
- LU factorization with Pivoting
- History of Gaussian Elimination
- Block Matrix Multiplication

Videos and Notebook:

- Computational Linear Algebra 3: Review, New Perspective on NMF, & Randomized SVD
- Computational Linear Algebra 4: Randomized SVD & Robust PCA
- Computational Linear Algebra 5: Robust PCA & LU Factorization
- Notebook

This part introduces the important concepts of broadcasting used in NumPy arrays (and elsewhere) and sparse matrices that crop up a lot in machine learning.

The application focus of this part is the use of robust PCA for background removal in CT scans.

The term broadcasting describes how arrays with different shapes are treated during arithmetic operations. The term broadcasting was first used by Numpy, although is now used in other libraries such as Tensorflow and Matlab; the rules can vary by library.

The topics covered in this lecture are:

- Broadcasting
- Sparse matrices
- CT Scans and Compressed Sensing
- L1 and L2 regression

Videos and Notebook:

- Computational Linear Algebra 6: Block Matrix Mult, Broadcasting, & Sparse Storage
- Computational Linear Algebra 7: Compressed Sensing for CT Scans
- Notebook

This part focuses on the development of linear regression models demonstrated with scikit-learn.

The Numba library is also used to demonstrate how to speed up the matrix operations involved.

We would like to speed this up. We will use Numba, a Python library that compiles code directly to C.

The topics covered in this lecture are:

- Linear regression in sklearn
- Polynomial Features
- Speeding up with Numba
- Regularization and Noise

Videos and Notebook:

- Computational Linear Algebra 8: Numba, Polynomial Features, How to Implement Linear Regression
- Notebook

This part looks at how to solve linear least squares for linear regression using a suite of different matrix factorization methods. Results are compared to the implementation in scikit-learn.

Linear regression via QR has been recommended by numerical analysts as the standard method for years. It is natural, elegant, and good for “daily use”.

The topics covered in this lecture are:

- How did Scikit Learn do it?
- Naive solution
- Normal equations and Cholesky factorization
- QR factorization
- SVD
- Timing Comparison
- Conditioning & Stability
- Full vs Reduced Factorizations
- Matrix Inversion is Unstable

Videos and Notebook:

- Computational Linear Algebra 8: Numba, Polynomial Features, How to Implement Linear Regression
- Notebook

This part introduces the eigendecomposition and the implementation and application of the PageRank algorithm to a Wikipedia links dataset.

The QR algorithm uses something called the QR decomposition. Both are important, so don’t get them confused.

The topics covered in this lecture are:

- SVD
- DBpedia Dataset
- Power Method
- QR Algorithm
- Two-phase approach to finding eigenvalues
- Arnoldi Iteration

Videos and Notebook:

- Computational Linear Algebra 9: PageRank with Eigen Decompositions
- Computational Linear Algebra 10: QR Algorithm to find Eigenvalues, Implementing QR Decomposition
- Notebook

This final part introduces three ways to implement the QR decomposition from scratch and compares the precision and performance of each method.

We used QR factorization in computing eigenvalues and to compute least squares regression. It is an important building block in numerical linear algebra.

The topics covered in this lecture are:

- Gram-Schmidt
- Householder
- Stability Examples

Videos and Notebook:

- Computational Linear Algebra 10: QR Algorithm to find Eigenvalues, Implementing QR Decomposition
- Notebook

I think the course is excellent.

A fun walk through numerical linear algebra with a focus on applications and executable code.

The course delivers on the promise of focusing on the practical concerns of matrix operations such as memory, speed, and precision or numerical stability. The course begins with a careful look at issues of floating point precision and overflow.

Throughout the course, frequently comparisons are made between methods in terms of execution speed.

This course is not an introduction to linear algebra for developers, and if that is the expectation going in, you may be left behind.

The course does assume a reasonable fluency with the basics of linear algebra, notation, and operations. And it does not hide this assumption up front.

I don’t think this course is required if you are interested in deep learning or learning more about the linear algebra operations used in deep learning methods.

If you are implementing matrix algebra methods in your own work and you’re looking to get more out of them, I would highly recommend this course.

I would also recommend this course if you are generally interested in the practical implications of matrix algebra.

This section provides more resources on the topic if you are looking to go deeper.

- New fast.ai course: Computational Linear Algebra
- Computational Linear Algebra on GitHub
- Computational Linear Algebra Video Lectures
- Community Forums

- 3Blue 1Brown Essence of Linear Algebra, Video Course
- Immersive Linear Algebra, Interactive Textbook
- Chapter 2 of Deep Learning
- Numerical Linear Algebra, 1997.
- Numerical Methods, 2012.

In this post, you discovered the fast.ai free course on computational linear algebra.

Specifically, you learned:

- The motivation and prerequisites for the course.
- An overview of the topics covered in the course.
- Who exactly this course is a good fit for, and who it is not.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Computational Linear Algebra for Coders Review appeared first on Machine Learning Mastery.

]]>The post Linear Algebra for Deep Learning appeared first on Machine Learning Mastery.

]]>Generally, an understanding of linear algebra (or parts thereof) is presented as a prerequisite for machine learning. Although important, this area of mathematics is seldom covered by computer science or software engineering degree programs.

In their seminal textbook on deep learning, Ian Goodfellow and others present chapters covering the prerequisite mathematical concepts for deep learning, including a chapter on linear algebra.

In this post, you will discover the crash course in linear algebra for deep learning presented in the de facto textbook on deep learning.

After reading this post, you will know:

- The topics suggested as prerequisites for deep learning by experts in the field.
- The progression through these topics and their culmination.
- Suggestions for how to get the most out of the chapter as a crash course in linear algebra.

**Kick-start your project** with my new book Linear Algebra for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

The book “Deep Learning” by Ian Goodfellow, Yoshua Bengio, and Aaron Courville is the de facto textbook for deep learning.

In the book, the authors provide a part titled “*Applied Math and Machine Learning Basics*” intended to provide the background in applied mathematics and machine learning required to understand the deep learning material presented in the rest of the book.

This part of the book includes four chapters; they are:

- Linear Algebra
- Probability and Information Theory
- Numerical Computation
- Machine Learning Basics

Given the expertise of the authors of the book, it is fair to say that the chapter on linear algebra provides a well reasoned set of prerequisites for deep learning, and perhaps more generally much of machine learning.

This part of the book introduces the basic mathematical concepts needed to understand deep learning.

— Page 30, Deep Learning, 2016.

Therefore, we can use the topics covered in the chapter on linear algebra as a guide to the topics you may be expected to be familiar with as a deep learning and machine learning practitioner.

Linear algebra is less likely to be covered in computer science courses than other types of math, such as discrete mathematics. This is specifically called out by the authors.

Linear algebra is a branch of mathematics that is widely used throughout science and engineering. However, because linear algebra is a form of continuous rather than discrete mathematics, many computer scientists have little experience with it.

— Page 31, Deep Learning, 2016.

We can take that the topics in this chapter are also laid out in a way tailored for computer science graduates with little to no prior exposure.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The chapter on linear algebra is divided into 12 sections.

As a first step, it is useful to use this as a high-level road map. The complete list of sections from the chapter are listed below.

- Scalars, Vectors, Matrices and Tensors
- Multiplying Matrices and Vectors
- Identity and Inverse Matrices
- Linear Dependence and Span
- Norms
- Special Kinds of Matrices and Vectors
- Eigendecomposition
- Singular Value Decomposition
- The Moore-Penrose Pseudoinverse
- The Trace Operator
- The Determinant
- Example: Principal Components Analysis

There’s not much value in enumerating the specifics covered in each section as the topics are mostly self explanatory, if familiar.

A reading of the chapter shows a progression in concepts and methods from the most primitive (vectors and matrices) to the derivation of the principal components analysis (known as PCA), a method used in machine learning.

It is a clean progression and well designed. Topics are presented with textual descriptions and consistent notation, allowing the reader to see exactly how elements come together through matrix factorization, the pseudoinverse, and ultimately PCA.

The focus is on the application of the linear algebra operations rather than theory. Although, no worked examples are given of any of the operations.

Finally, the derivation of PCA is perhaps a bit much. A beginner may want to skip this full derivation, or perhaps reduce it to the application of some of the elements learned throughout the chapter (e.g. eigendecomposition).

One area I would like to have seen covered is linear least squares and the use of various matrix algebra methods used to solve it, such as directly, LU, QR decomposition, and SVD. This might be more of a general machine learning perspective and less a deep learning perspective, and I can see why it was excluded.

The authors also suggest two other texts to consult if further depth in linear algebra is required.

They are:

- The Matrix Cookbook, Petersen and Pedersen, 2006.
- Linear Algebra, Shilov, 1977.

The Matrix Cookbook is a free PDF filled with the notations and equations of practically any matrix operation you can conceive.

These pages are a collection of facts (identities, approximations, inequalities, relations, …) about matrices and matters relating to them. It is collected in this form for the convenience of anyone who wants a quick desktop reference.

— page 2, The Matrix Cookbook, 2006.

Linear Algebra by Georgi Shilov is a classic and well regarded textbook on the topics designed for undergraduate students.

This book is intended as a text for undergraduate students majoring in mathematics and physics.

— Page v, Linear Algebra, 1977.

If you are a machine learning practitioner looking to use this chapter as a linear algebra crash course, then I would make a few recommendations to make the topics more concrete:

- Implement each operation in Python using NumPy functions on small contrived data.
- Implement each operation manually in Python without NumPy functions.
- Apply key operations, such as the factorization methods (eigendecomposition and SVD) and PCA to real but small datasets loaded from CSV.
- Create a cheat sheet of notation that you can use as a quick reference going forward.
- Research and list examples of each operation/topic used in machine learning papers or texts.

Did you take on any of these suggestions?

List your results in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- Deep Learning, 2016.
- The Matrix Cookbook, Petersen and Pedersen, 2006.
- Linear Algebra, Shilov, 1977.

In this post, you discovered the crash course in linear algebra for deep learning presented in the de facto textbook on deep learning.

Specifically, you learned:

- The topics suggested as prerequisites for deep learning by experts in the field.
- The progression through these topics and their culmination.
- Suggestions for how to get the most out of the chapter as a crash course in linear algebra.

Did you read this chapter of the Deep Learning book? What did you think of it?

Let me know in the comments below.

The post Linear Algebra for Deep Learning appeared first on Machine Learning Mastery.

]]>