The post A Tour of Attention-Based Architectures appeared first on Machine Learning Mastery.
]]>In this tutorial, you will discover the salient neural architectures that have been used in conjunction with attention.
After completing this tutorial, you will gain a better understanding of how the attention mechanism is incorporated into different neural architectures and for which purpose.
Let’s get started.
This tutorial is divided into four parts; they are:
The encoder-decoder architecture has been extensively applied to sequence-to-sequence (seq2seq) tasks for language processing. Examples of such tasks, within the domain of language processing, include machine translation and image captioning.
The earliest use of attention was as part of RNN based encoder-decoder framework to encode long input sentences [Bahdanau et al. 2015]. Consequently, attention has been most widely used with this architecture.
Within the context of machine translation, such a seq2seq task would involve the translation of an input sequence, $I = \{ A, B, C, <EOS> \}$, into an output sequence, $O = \{ W, X, Y, Z, <EOS> \}$, of a different length.
For an RNN-based encoder-decoder architecture without attention, unrolling each RNN would produce the following graph:
Here, the encoder reads the input sequence one word at a time, each time updating its internal state. It stops when it encounters the <EOS> symbol, which signals that the end of sequence has been reached. The hidden state generated by the encoder essentially contains a vector representation of the input sequence, which will then be processed by the decoder.
The decoder generates the output sequence one word at a time, taking the word at the previous time step ($t$ – 1) as input to generate the next word in the output sequence. An <EOS> symbol at the decoding side signals that the decoding process has ended.
As we have previously mentioned, the problem with the encoder-decoder architecture without attention arises when sequences of different length and complexity are represented by a fixed-length vector, potentially resulting in the decoder missing important information.
To circumvent this problem, an attention-based architecture introduces an attention mechanism between the encoder and decoder.
Here, the attention mechanism ($\phi$) learns a set of attention weights that capture the relationship between the encoded vectors (v) and the hidden state of the decoder (h), to generate a context vector (c) through a weighted sum of all the hidden states of the encoder. In doing so, the decoder would have access to the entire input sequence, with specific focus on the input information that is most relevant for generating the output.
The architecture of the transformer also implements an encoder and decoder, however, as opposed to the architectures that we have reviewed above, it does not rely on the use of recurrent neural networks. For this reason, we shall be reviewing this architecture and its variants separately.
The transformer architecture dispenses of any recurrence, and instead relies solely on a self-attention (or intra-attention) mechanism.
In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d …
– Advanced Deep Learning with Python, 2019.
The self-attention mechanism relies on the use of queries, keys and values, which are generated by multiplying the encoder’s representation of the same input sequence with different weight matrices. The transformer uses dot product (or multiplicative) attention, where each query is matched against a database of keys by a dot product operation, in the process of generating the attention weights. These weights are then multiplied to the values to generate a final attention vector.
Intuitively, since all queries, keys and values originate from the same input sequence, the self-attention mechanism captures the relationship between the different elements of the same sequence, highlighting those that are mostly relevant to one another.
Since the transformer does not rely on RNNs, the positional information of each element in the sequence can be preserved by augmenting the encoder’s representation of each element with positional encoding. This means that the transformer architecture may also be applied to tasks where the information may not necessarily be related sequentially, such as for the computer vision tasks of image classification, segmentation or captioning.
Transformers can capture global/long range dependencies between input and output, support parallel processing, require minimal inductive biases (prior knowledge), demonstrate scalability to large sequences and datasets, and allow domain-agnostic processing of multiple modalities (text, images, speech) using similar processing blocks.
Furthermore, several attention layers can be stacked in parallel in what has been termed as multi-head attention. Each head works in parallel over different linear transformations of the same input, and the outputs of the heads are then concatenated to produce the final attention result. The benefit of having a multi-head model is that each head can attend to different elements of the sequence.
Some variants of the transformer architecture that address limitations of the vanilla model, are:
A graph can be defined as a set of nodes (or vertices) that are linked by means of connections (or edges).
A graph is a versatile data structure that lends itself well to the way data is organized in many real-world scenarios.
– Advanced Deep Learning with Python, 2019.
Take, for example, a social network where users can be represented by nodes in a graph, and their relationships with friends by edges. Or a molecule, where the nodes would be the atoms, and the edges would represent the chemical bonds between them.
We can think of an image as a graph, where each pixel is a node, directly connected to its neighboring pixels …
– Advanced Deep Learning with Python, 2019.
Of particular interest are the Graph Attention Networks (GAT) that employ a self-attention mechanism within a graph convolutional network (GCN), where the latter updates the state vectors by performing a convolution over the nodes of the graph. The convolution operation is applied to the central node and the neighboring nodes by means of a weighted filter, to update the representation of the central node. The filter weights in a GCN can be fixed or learnable.
A GAT, in comparison, assigns weights to the neighboring nodes using attention scores.
The computation of these attention scores follows a similar procedure as in the methods for seq2seq tasks reviewed above: (1) alignment scores are first computed between the feature vectors of two neighboring nodes, from which (2) attention scores are computed by applying a softmax operation, and finally (3) an output feature vector for each node (equivalent to the context vector in a seq2seq task) can be computed by a weighted combination of the feature vectors of all neighbors.
Multi-head attention can be applied here too, in a very similar manner as to how it was proposed in the transformer architecture that we have previously seen. Each node in the graph would be assigned multiple heads, and their outputs averaged in the final layer.
Once the final output has been produced, this can be used as input to a subsequent task-specific layer. Tasks that can be solved by graphs can be the classification of individual nodes between different groups (for example, in predicting which of several clubs a person will decided to become a member with); or the classification of individual edges to determine whether an edge exists between two nodes (for example, to predict whether two persons in a social network might be friends); or even the classification of a full graph (for example, to predict if a molecule is toxic).
In the encoder-decoder attention-based architectures that we have reviewed so far, the set of vectors that encode the input sequence can be considered as external memory, to which the encoder writes and from which the decoder reads. However, a limitation arises because the encoder can only write to this memory, and the decoder can only read.
Memory-Augmented Neural Networks (MANNs) are recent algorithms that aim to address this limitation.
The Neural Turing Machine (NTM) is one type of MANN. It consists of a neural network controller that takes an input to produce an output, and performs read and write operations to memory.
The operation performed by the read head is similar to the attention mechanism employed for seq2seq tasks, where an attention weight indicates the importance of the vector under consideration in forming the output.
A read head always reads the full memory matrix, but it does so by attending to different memory vectors with different intensities.
– Advanced Deep Learning with Python, 2019.
The output of a read operation is then defined by a weighted sum of the memory vectors.
The write head also makes use of an attention vector, together with an erase and add vectors. A memory location is erased based on the values in the attention and erase vectors, and information is written via the add vector.
Examples of applications for MANNs include question answering and chat bots, where an external memory stores a large database of sequences (or facts) that the neural network taps into. The role of the attention mechanism is crucial in selecting facts from the database that are more relevant than others for the task at hand.
This section provides more resources on the topic if you are looking to go deeper.
In this tutorial, you discovered the salient neural architectures that have been used in conjunction with attention.
Specifically, you gained a better understanding of how the attention mechanism is incorporated into different neural architectures and for which purpose.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
The post A Tour of Attention-Based Architectures appeared first on Machine Learning Mastery.
]]>The post Understanding Simple Recurrent Neural Networks In Keras appeared first on Machine Learning Mastery.
]]>After completing this tutorial, you will know:
Let’s get started.
This tutorial is divided into two parts; they are:
It is assumed that you have a basic understanding of RNNs before you start implementing them. An Introduction To Recurrent Neural Networks And The Math That Powers Them gives you a quick overview of RNNs.
Let’s now get right down to the implementation part.
To start the implementation of RNNs, let’s add the import section.
from pandas import read_csv import numpy as np from keras.models import Sequential from keras.layers import Dense, SimpleRNN from sklearn.preprocessing import MinMaxScaler from sklearn.metrics import mean_squared_error import math import matplotlib.pyplot as plt
The function below returns a model that includes a SimpleRNN
layer and a Dense
layer for learning sequential data. The input_shape
specifies the parameter (time_steps x features)
. We’ll simplify everything and use univariate data, i.e., one feature only; the time_steps are discussed below.
def create_RNN(hidden_units, dense_units, input_shape, activation): model = Sequential() model.add(SimpleRNN(hidden_units, input_shape=input_shape, activation=activation[0])) model.add(Dense(units=dense_units, activation=activation[1])) model.compile(loss='mean_squared_error', optimizer='adam') return model demo_model = create_RNN(2, 1, (3,1), activation=['linear', 'linear'])
The object demo_model
is returned with 2 hidden units created via a the SimpleRNN
layer and 1 dense unit created via the Dense
layer. The input_shape
is set at 3×1 and a linear
activation function is used in both layers for simplicity. Just to recall the linear activation function $f(x) = x$ makes no change in the input. The network looks as follows:
If we have $m$ hidden units ($m=2$ in the above case), then:
Let’s look at the above weights. Note: As the weights are initialized randomly, the results pasted here will be different from yours. The important thing is to learn what the structure of each object being used looks like and how it interacts with others to produce the final output.
wx = demo_model.get_weights()[0] wh = demo_model.get_weights()[1] bh = demo_model.get_weights()[2] wy = demo_model.get_weights()[3] by = demo_model.get_weights()[4] print('wx = ', wx, ' wh = ', wh, ' bh = ', bh, ' wy =', wy, 'by = ', by)
wx = [[ 0.18662322 -1.2369459 ]] wh = [[ 0.86981213 -0.49338293] [ 0.49338293 0.8698122 ]] bh = [0. 0.] wy = [[-0.4635998] [ 0.6538409]] by = [0.]
Now let’s do a simple experiment to see how the layers from a SimpleRNN and Dense layer produce an output. Keep this figure in view.
We’ll input x
for three time steps and let the network generate an output. The values of the hidden units at time steps 1, 2 and 3 will be computed. $h_0$ is initialized to the zero vector. The output $o_3$ is computed from $h_3$ and $w_y$. An activation function is not required as we are using linear units.
x = np.array([1, 2, 3]) # Reshape the input to the required sample_size x time_steps x features x_input = np.reshape(x,(1, 3, 1)) y_pred_model = demo_model.predict(x_input) m = 2 h0 = np.zeros(m) h1 = np.dot(x[0], wx) + h0 + bh h2 = np.dot(x[1], wx) + np.dot(h1,wh) + bh h3 = np.dot(x[2], wx) + np.dot(h2,wh) + bh o3 = np.dot(h3, wy) + by print('h1 = ', h1,'h2 = ', h2,'h3 = ', h3) print("Prediction from network ", y_pred_model) print("Prediction from our computation ", o3)
h1 = [[ 0.18662322 -1.23694587]] h2 = [[-0.07471441 -3.64187904]] h3 = [[-1.30195881 -6.84172557]] Prediction from network [[-3.8698118]] Prediction from our computation [[-3.86981216]]
Now that we understand how the SimpleRNN and Dense layers are put together. Let’s run a complete RNN on a simple time series dataset. We’ll need to follow these steps
The following function reads the train and test data from a given URL and splits it into a given percentage of train and test data. It returns single dimensional arrays for train and test data after scaling the data between 0 and 1 using MinMaxScaler
from scikit-learn.
# Parameter split_percent defines the ratio of training examples def get_train_test(url, split_percent=0.8): df = read_csv(url, usecols=[1], engine='python') data = np.array(df.values.astype('float32')) scaler = MinMaxScaler(feature_range=(0, 1)) data = scaler.fit_transform(data).flatten() n = len(data) # Point for splitting data into train and test split = int(n*split_percent) train_data = data[range(split)] test_data = data[split:] return train_data, test_data, data sunspots_url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/monthly-sunspots.csv' train_data, test_data, data = get_train_test(sunspots_url)
The next step is to prepare the data for Keras model training. The input array should be shaped as: total_samples x time_steps x features
.
There are many ways of preparing time series data for training. We’ll create input rows with non-overlapping time steps. An example for time_steps = 2 is shown in the figure below. Here time_steps denotes the number of previous time steps to use for predicting the next value of the time series data.
The following function get_XY()
takes a one dimensional array as input and converts it to the required input X
and target Y
arrays. We’ll use 12 time_steps
for the sunspots dataset as the sunspots generally have a cycle of 12 months. You can experiment with other values of time_steps
.
# Prepare the input X and target Y def get_XY(dat, time_steps): # Indices of target array Y_ind = np.arange(time_steps, len(dat), time_steps) Y = dat[Y_ind] # Prepare X rows_x = len(Y) X = dat[range(time_steps*rows_x)] X = np.reshape(X, (rows_x, time_steps, 1)) return X, Y time_steps = 12 trainX, trainY = get_XY(train_data, time_steps) testX, testY = get_XY(test_data, time_steps)
For this step, we can reuse our create_RNN()
function that was defined above.
model = create_RNN(hidden_units=3, dense_units=1, input_shape=(time_steps,1), activation=['tanh', 'tanh']) model.fit(trainX, trainY, epochs=20, batch_size=1, verbose=2)
The function print_error()
computes the mean square error between the actual values and the predicted values.
def print_error(trainY, testY, train_predict, test_predict): # Error of predictions train_rmse = math.sqrt(mean_squared_error(trainY, train_predict)) test_rmse = math.sqrt(mean_squared_error(testY, test_predict)) # Print RMSE print('Train RMSE: %.3f RMSE' % (train_rmse)) print('Test RMSE: %.3f RMSE' % (test_rmse)) # make predictions train_predict = model.predict(trainX) test_predict = model.predict(testX) # Mean square error print_error(trainY, testY, train_predict, test_predict)
Train RMSE: 0.058 RMSE Test RMSE: 0.077 RMSE
The following function plots the actual target values and the predicted value. The red line separates the training and test data points.
# Plot the result def plot_result(trainY, testY, train_predict, test_predict): actual = np.append(trainY, testY) predictions = np.append(train_predict, test_predict) rows = len(actual) plt.figure(figsize=(15, 6), dpi=80) plt.plot(range(rows), actual) plt.plot(range(rows), predictions) plt.axvline(x=len(trainY), color='r') plt.legend(['Actual', 'Predictions']) plt.xlabel('Observation number after given time steps') plt.ylabel('Sunspots scaled') plt.title('Actual and Predicted Values. The Red Line Separates The Training And Test Examples') plot_result(trainY, testY, train_predict, test_predict)
The following plot is generated:
Given below is the entire code for this tutorial. Do try this out at your end and experiment with different hidden units and time steps. You can add a second SimpleRNN
to the network and see how it behaves. You can also use the scaler
object to rescale the data back to its normal range.
# Parameter split_percent defines the ratio of training examples def get_train_test(url, split_percent=0.8): df = read_csv(url, usecols=[1], engine='python') data = np.array(df.values.astype('float32')) scaler = MinMaxScaler(feature_range=(0, 1)) data = scaler.fit_transform(data).flatten() n = len(data) # Point for splitting data into train and test split = int(n*split_percent) train_data = data[range(split)] test_data = data[split:] return train_data, test_data, data # Prepare the input X and target Y def get_XY(dat, time_steps): Y_ind = np.arange(time_steps, len(dat), time_steps) Y = dat[Y_ind] rows_x = len(Y) X = dat[range(time_steps*rows_x)] X = np.reshape(X, (rows_x, time_steps, 1)) return X, Y def create_RNN(hidden_units, dense_units, input_shape, activation): model = Sequential() model.add(SimpleRNN(hidden_units, input_shape=input_shape, activation=activation[0])) model.add(Dense(units=dense_units, activation=activation[1])) model.compile(loss='mean_squared_error', optimizer='adam') return model def print_error(trainY, testY, train_predict, test_predict): # Error of predictions train_rmse = math.sqrt(mean_squared_error(trainY, train_predict)) test_rmse = math.sqrt(mean_squared_error(testY, test_predict)) # Print RMSE print('Train RMSE: %.3f RMSE' % (train_rmse)) print('Test RMSE: %.3f RMSE' % (test_rmse)) # Plot the result def plot_result(trainY, testY, train_predict, test_predict): actual = np.append(trainY, testY) predictions = np.append(train_predict, test_predict) rows = len(actual) plt.figure(figsize=(15, 6), dpi=80) plt.plot(range(rows), actual) plt.plot(range(rows), predictions) plt.axvline(x=len(trainY), color='r') plt.legend(['Actual', 'Predictions']) plt.xlabel('Observation number after given time steps') plt.ylabel('Sunspots scaled') plt.title('Actual and Predicted Values. The Red Line Separates The Training And Test Examples') sunspots_url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/monthly-sunspots.csv' time_steps = 12 train_data, test_data, data = get_train_test(sunspots_url) trainX, trainY = get_XY(train_data, time_steps) testX, testY = get_XY(test_data, time_steps) # Create model and train model = create_RNN(hidden_units=3, dense_units=1, input_shape=(time_steps,1), activation=['tanh', 'tanh']) model.fit(trainX, trainY, epochs=20, batch_size=1, verbose=2) # make predictions train_predict = model.predict(trainX) test_predict = model.predict(testX) # Print error print_error(trainY, testY, train_predict, test_predict) #Plot result plot_result(trainY, testY, train_predict, test_predict)
This section provides more resources on the topic if you are looking to go deeper.
In this tutorial, you discovered recurrent neural networks and their various architectures.
Specifically, you learned:
Do you have any questions about RNNs discussed in this post? Ask your questions in the comments below and I will do my best to answer.
The post Understanding Simple Recurrent Neural Networks In Keras appeared first on Machine Learning Mastery.
]]>The post How to Learn Python for Machine Learning appeared first on Machine Learning Mastery.
]]>In this post, you will discover what is the right way to learn a programming language and how to get help. After reading this post, you will know:
Let’s get started.
There are many ways to learn a language, same for natural languages like English, or programming language like Python. Babies learn a language from listening and mimicking. Slowly, when they learned the pattern and some vocabulary, they can make up their own sentences. On the contrary, when college students learn Latin, probably they start with grammar rules. Singular and plural, indicative and subjunctive, nominative and accusative. Then we can build up a Latin sentence.
Similarly, learning Python or any programming language, you can either read other people’s code and try to understand, and then modify from it. Or you can learn the language rules and build up a program from scratch. The latter would be beneficial if your ultimate goal is to work on the language, such as writing the Python interpreter. But usually, the former approach is faster to get some results.
My suggestion is to learn from examples first. But to strengthen your foundation in understanding the language by revisiting the language rules from time to time. Let’s look at an example from Wikipedia:
def secant_method(f, x0, x1, iterations): """Return the root calculated using the secant method.""" for i in range(iterations): x2 = x1 - f(x1) * (x1 - x0) / float(f(x1) - f(x0)) x0, x1 = x1, x2 return x2 def f_example(x): return x ** 2 - 612 root = secant_method(f_example, 10, 30, 5) print("Root: {}".format(root)) # Root: 24.738633748750722
This Python code is implementing secant method to find a root for a function. If you are new to Python, what you should do is to look at the example, and see how much you can understand. If you have prior knowledge from other programming languages, you would probably guess def
defines a function. But if you do not, you might feel confused and it is best for you to start from a beginner’s book on programming to learn about the concept of functions, variables, loops, etc.
Next thing you might think you can do is to modify the functions. For example, what if we are not using the secant method to find root but to use Newton’s method? You may guess how to modify the equation on line 4 to do it. What about bisection method? You need to add a statement of if f(x2)>0
to decide which way should we go. If we look at the function f_example
, we see the symbol **
. This is the exponent operator to mean $x$ to the power 2 there. But should we it be $x^2 – 612$ or $x^{2-612}$? You would need to go back and check the language manual to see the operator precedence hierarchy.
Therefore, even a short example like this, you can learn a lot of language features. By learning from more examples, you can deduce the syntax, get used to the idiomic way of coding, and do some work even if you cannot explain it in detail.
If you decided to learn Python, it is inevitable to learn from a book. Just pick up any beginner’s book on Python from your local library should work. But when you read, keep the bigger picture of your learning goal in mind. Do some exercise while you read, try out the codes from the book and make up your own. It is not a bad idea to skip some pages. Reading a book cover to cover may not be the most efficient way to learn. You should prevent yourself drilling too deep into a single topic, because this will make you lost your track to the bigger goal of using Python to do useful things. Topics such as multithreading, network sockets, object-oriented programming can be treated as advanced topics for later.
Python is a language that is decoupled from its interpreter or compiler. Therefore, different interpreter may behave a bit different. The standard interpreter from python.org is the CPython, which is also called the reference implementation. A common alternative is PyPy. Regardless of which one you use, you should learn with Python 3 rather than Python 2 as the latter is an obsoleted dialect. But bear in mind that Python gained its momentum with Python 2 and you may still see quite a lot of Python 2 program around.
If you cannot go to the library to pick up a printed book, you can make use of some online resources instead. I would highly recommend beginners to read The Python Tutorial. It is short but guides you to through different aspects of the language. It lets you have a glance on what Python can do, and how to do it.
After the tutorial, probably you should keep the Python Language Reference and the Python Library Reference handy. You will reference to them from time to time to check the syntax and look up function usages. Do not force yourself to remember every function.
Python is built-in in macOS but you may want to install a newer version. In Windows, it is common to see people using Anacronda instead of installing just the Python interpreter. But if you feel it is too much hassle to install an IDE and the Python programming environment, you may consider to use Google Colab. This allows you to write Python program in a “notebook” format. Indeed, many machine learning projects are developed in the Jupyter notebook as it allows us to quickly explore different approaches to a problem and visually verify the result.
You can also use an online shell at https://www.python.org/shell/ to try out a short snippet. The downside compared to the Google Colab is you cannot save your work.
When you start from an example you saw from a book and modifying it, you might break the code and make it failed to run. It is especially true in machine learning examples, where you have many lines of code that covered from data collection, preprocessing, building a model, training, validation, prediction and finally presenting the result in a visualized manner. When you see error from your code, the first thing you need to do is to pin point the few lines that caused the error. Try to check the output from each step to make sure it is in a correct format. Or try to roll back your code to see which change you made started to introduce errors.
It is important to make mistakes and learn from mistakes. When you try out syntax and learn your way, you should encounter error messages from time to time. Try to make sense from it, then you will be easier to figure out what was causing the error. Almost always, if the error comes from a library that you’re using, double confirm your syntax with the library’s documentation.
If you are still confused, try to search it on the internet. If you’re using Google, one trick you can use is to put the entire error message in a pair of double quotes when you search it. Or sometimes, search on StackOverflow might give you better answers.
Here I list out some pointers for a beginner. As referenced above, the Python Tutorial is a good start. This is especially true at the time of writing, when Python 3.9 rolled out recently and some new syntax are introduced. Printed books are usually not as updated as the official tutorial online.
There are many primer level books for Python. Some short ones that I knew are:
For a bit more advanced learner, you may want to see more examples to get something done. A cookbook style book might help a lot as you can learn not only the syntax and language tricks, but also the different libraries that can get things done.
In this post, you learned about how should one study Python and the resources that can help you start. A goal-oriented approach to study can help you get the result quicker, but as always, you need to spend some significant time into it before you become proficient.
The post How to Learn Python for Machine Learning appeared first on Machine Learning Mastery.
]]>The post An Introduction To Recurrent Neural Networks And The Math That Powers Them appeared first on Machine Learning Mastery.
]]>After completing this tutorial, you will know:
Let’s get started.
This tutorial is divided into two parts; they are:
For this tutorial, it is assumed that you are already familiar with artificial neural networks and the back propagation algorithm. If not, you can go through this very nice tutorial Calculus in Action: Neural Networks by Stefania Cristina. The tutorial also explains how gradient based back propagation algorithm is used to train a neural network.
A recurrent neural network (RNN) is a special type of an artificial neural network adapted to work for time series data or data that involves sequences. Ordinary feed forward neural networks are only meant for data points, which are independent of each other. However, if we have data in a sequence such that one data point depends upon the previous data point, we need to modify the neural network to incorporate the dependencies between these data points. RNNs have the concept of ‘memory’ that helps them store the states or information of previous inputs to generate the next output of the sequence.
A simple RNN has a feedback loop as shown in the first diagram of the above figure. The feedback loop shown in the gray rectangle can be unrolled in 3 time steps to produce the second network of the above figure. Of course, you can vary the architecture so that the network unrolls $k$ time steps. In the figure, the following notation is used:
At every time step we can unfold the network for $k$ time steps to get the output at time step $k+1$. The unfolded network is very similar to the feedforward neural network. The rectangle in the unfolded network shows an operation taking place. So for example, with an activation function f:
The output $y$ at time $t$ is computed as:
$$
y_{t} = f(h_t, w_y) = f(w_y \cdot h_t + b_y)
$$
Here, $\cdot$ is the dot product.
Hence, in the feedforward pass of a RNN, the network computes the values of the hidden units and the output after $k$ time steps. The weights associated with the network are shared temporally. Each recurrent layer has two sets of weights; one for the input and the second one for the hidden unit. The last feedforward layer, which computes the final output for the kth time step is just like an ordinary layer of a traditional feedforward network.
We can use any activation function we like in the recurrent neural network. Common choices are:
The backpropagation algorithm of an artificial neural network is modified to include the unfolding in time to train the weights of the network. This algorithm is based on computing the gradient vector and is called back propagation in time or BPTT algorithm for short. The pseudo-code for training is given below. The value of $k$ can be selected by the user for training. In the pseudo-code below $p_t$ is the target value at time step t:
There are different types of recurrent neural networks with varying architectures. Some examples are:
Here there is a single $(x_t, y_t)$ pair. Traditional neural networks employ a one to one architecture.
In one to many networks, a single input at $x_t$ can produce multiple outputs, e.g., $(y_{t0}, y_{t1}, y_{t2})$. Music generation is an example area, where one to many networks are employed.
In this case many inputs from different time steps produce a single output. For example, $(x_t, x_{t+1}, x_{t+2})$ can produce a single output $y_t$. Such networks are employed in sentiment analysis or emotion detection, where the class label depends upon a sequence of words.
There are many possibilities for many to many. An example is shown above, where two inputs produce three outputs. Many to many networks are applied in machine translation, e.g, English to French or vice versa translation systems.
RNNs have various advantages such as:
The disadvantages are:
There are different variations of RNNs that are being applied practically in machine learning problems:
In BRNN, inputs from future time steps are used to improve the accuracy of the network. It is like having knowledge of the first and last words of a sentence to predict the middle words.
These networks are designed to handle the vanishing gradient problem. They have a reset and update gate. These gates determine which information is to be retained for future predictions.
LSTMs were also designed to address the vanishing gradient problem in RNNs. LSTM use three gates called input, output and forget gate. Similar to GRU, these gates determine which information to retain.
This section provides more resources on the topic if you are looking to go deeper.
In this tutorial, you discovered recurrent neural networks and their various architectures.
Specifically, you learned:
Do you have any questions about RNNs discussed in this post? Ask your questions in the comments below and I will do my best to answer.
The post An Introduction To Recurrent Neural Networks And The Math That Powers Them appeared first on Machine Learning Mastery.
]]>The post Training-validation-test split and cross-validation done right appeared first on Machine Learning Mastery.
]]>This is why we have cross validation. In scikit-learn, there is a family of functions that help us do this. But quite often, we see cross validation used improperly, or the result of cross validation not being interpreted correctly.
In this tutorial, you will discover the correct procedure to use cross validation and a dataset to select the best models for a project.
After completing this tutorial, you will know:
Let’s get started.
This tutorial is divided into three parts:
The outcome of machine learning is a model that can do prediction. The most common cases are the classification model and the regression model; the former is to predict the class membership of an input and the latter is to predict the value of a dependent variable based on the input. However, in either case we have a variety of models to choose from. Classification model, for instance, includes decision tree, support vector machine, and neural network, to name a few. Any one of these, depends on some hyperparameters. Therefore, we have to decide on a number of settings before we start training a model.
If we have two candidate models based on our intuition, and we want to pick one to use in our project, how should we pick?
There are some standard metrics we can generally use. In regression problems, we commonly use one of the following:
and in case of classification problems, frequently used metrics consists of:
The metrics page from scikit-learn has a longer, but not exhaustive, list of common evaluations put into different categories. If we have a sample dataset and want to train a model to predict it, we can use one of these metrics to evaluate how efficient the model is.
However, there is a problem; for the sample dataset, we only evaluated the model once. Assuming we correctly separated the dataset into a training set and a test set, and fitted the model with the training set while evaluated with the test set, we obtained only a single sample point of evaluation with one test set. How can we be sure it is an accurate evaluation, rather than a value too low or too high by chance? If we have two models, and found that one model is better than another based on the evaluation, how can we know this is also not by chance?
The reason we are concerned about this, is to avoid surprisingly low accuracy when the model is deployed and used on an entirely new data than the one we obtained, in the future.
The solution to this problem is the training-validation-test split.
The model is initially fit on a training data set, […] Successively, the fitted model is used to predict the responses for the observations in a second data set called the validation data set. […] Finally, the test data set is a data set used to provide an unbiased evaluation of a final model fit on the training data set. If the data in the test data set has never been used in training (for example in cross-validation), the test data set is also called a holdout data set.
— “Training, validation, and test sets”, Wikipedia
The reason for such practice, lies in the concept of preventing data leakage.
“What gets measured gets improved.”, or as Goodhart’s law puts it, “When a measure becomes a target, it ceases to be a good measure.” If we use one set of data to choose a model, the model we chose, with certainty, will do well on the same set of data under the same evaluation metric. However, what we should care about is the evaluation metric on the unseen data instead.
Therefore, we need to keep a slice of data from the entire model selection and training process, and save it for the final evaluation. This slice of data is the “final exam” to our model and the exam questions must not be seen by the model before. Precisely, this is the workflow of how the data is being used:
In steps 1 and 2, we do not want to evaluate the candidate models once. Instead, we prefer to evaluate each model multiple times with different dataset and take the average score for our decision at step 3. If we have the luxury of vast amounts of data, this could be done easily. Otherwise, we can use the trick of k-fold to resample the same dataset multiple times and pretend they are different. As we are evaluating the model, or hyperparameter, the model has to be trained from scratch, each time, without reusing the training result from previous attempts. We call this process cross validation.
From the result of cross validation, we can conclude whether one model is better than another. Since the cross validation is done on a smaller dataset, we may want to retrain the model again, once we have a decision on the model. The reason is the same as that for why we need to use k-fold in cross-validation; we do not have a lot of data, and the smaller dataset we used previously, had a part of it held out for validation. We believe combining the training and validation dataset can produce a better model. This is what would occur in step 4.
The dataset for evaluation in step 5, and the one we used in cross validation, are different because we do not want data leakage. If they were the same, we would see the same score as we have already seen from the cross validation. Or even worse, the test score was guaranteed to be good because it was part of the data we used to train the chosen model and we have adapted the model for that test dataset.
Once we finished the training, we want to (1) compare this model to our previous evaluation and (2) estimate how it will perform if we deploy it.
We make use of the test dataset that was never used in previous steps to evaluate the performance. Because this is unseen data, it can help us evaluate the generalization, or out-of-sample, error. This should simulate what the model will do when we deploy it. If there is overfitting, we would expect the error to be high at this evaluation.
Similarly, we do not expect this evaluation score to be very different from that we obtained from cross validation in the previous step, if we did the model training correctly. This can serve as a confirmation for our model selection.
In the following, we fabricate a regression problem to illustrate how a model selection workflow should be.
First, we use numpy to generate a dataset:
... # Generate data and plot N = 300 x = np.linspace(0, 7*np.pi, N) smooth = 1 + 0.5*np.sin(x) y = smooth + 0.2*np.random.randn(N)
We generate a sine curve and add some noise into it. Essentially, the data is
$$y=1 + 0.5\sin(x) + \epsilon$$
for some small noise signal $\epsilon$. The data looks like the following:
Then we perform a train-test split, and hold out the test set until we finish our final model. Because we are going to use scikit-learn models for regression, and they assumed the input x
to be in two-dimensional array, we reshape it here first. Also, to make the effect of model selection more pronounced, we do not shuffle the data in the split. In reality, this is usually not a good idea.
... # Train-test split, intentionally use shuffle=False X = x.reshape(-1,1) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
In the next step, we create two models for regression. They are namely quadratic:
$$y = c + b\times x + a\times x^2$$
and linear:
$$y = b + a\times x$$
There are no polynomial regression in scikit-learn but we can make use of PolynomialFeatures
combined with LinearRegression
to achieve that. PolynomialFeatures(2)
will convert input $x$ into $1,x,x^2$ and linear regression on these three will find us the coefficients $a,b,c$ in the formula above.
... # Create two models: Quadratic and linear regression polyreg = make_pipeline(PolynomialFeatures(2), LinearRegression(fit_intercept=False)) linreg = LinearRegression()
The next step is to use only the training set and apply k-fold cross validation to each of the two models:
... # Cross-validation scoring = "neg_root_mean_squared_error" polyscores = cross_validate(polyreg, X_train, y_train, scoring=scoring, return_estimator=True) linscores = cross_validate(linreg, X_train, y_train, scoring=scoring, return_estimator=True)
The function cross_validate()
returns a Python dictionary like the following:
{'fit_time': array([0.00177097, 0.00117302, 0.00219226, 0.0015142 , 0.00126314]), 'score_time': array([0.00054097, 0.0004108 , 0.00086379, 0.00092077, 0.00043106]), 'estimator': [Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()), ('linearregression', LinearRegression(fit_intercept=False))]), Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()), ('linearregression', LinearRegression(fit_intercept=False))]), Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()), ('linearregression', LinearRegression(fit_intercept=False))]), Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()), ('linearregression', LinearRegression(fit_intercept=False))]), Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()), ('linearregression', LinearRegression(fit_intercept=False))])], 'test_score': array([-1.00421665, -0.53397399, -0.47742336, -0.41834582, -0.68043053])}
which the key test_score
holds the score for each fold. We are using negative root mean square error for the cross validation and the higher the score, the less the error, and hence the better the model.
The above is from the quadratic model. The corresponding test score from the linear model is as follows:
array([-0.43401194, -0.52385836, -0.42231028, -0.41532203, -0.43441137])
By comparing the average score, we found that the linear model performs better than the quadratic model.
... # Which one is better? Linear and polynomial print(linscores["test_score"].mean()) print(polyscores["test_score"].mean()) print(linscores["test_score"].mean() - polyscores["test_score"].mean())
Linear regression score: -0.4459827970437929 Polynomial regression score: -0.6228780695994603 Difference: 0.17689527255566745
Before we proceed to train our model of choice, we can illustrate what happened. Take the first cross-validation iteration as an example, we can see that the coefficient for quadratic regression is as follows:
... # Let's show the coefficient of the first fitted polynomial regression # This starts from the constant term and in ascending order of powers print(polyscores["estimator"][0].steps[1][1].coef_)
[-0.03190358 0.20818594 -0.00937904]
This means our fitted quadratic model is:
$$y=-0.0319 + 0.2082\times x – 0.0094\times x^2$$
and the coefficients of the linear regression at first iteration of its cross-validation are
... # And show the coefficient of the last-fitted linear regression print(linscores["estimator"][0].intercept_, linscores["estimator"][-1].coef_)
0.856999187854241 [-0.00918622]
which means the fitted linear model is
$$y = 0.8570 – 0.0092\times x$$
We can see how they look like in a plot:
... # Plot and compare plt.plot(x, y) plt.plot(x, smooth) plt.plot(x, polyscores["estimator"][0].predict(X)) plt.plot(x, linscores["estimator"][0].predict(X)) plt.ylim(0,2) plt.xlabel("x") plt.ylabel("y") plt.show()
Here we see the red line is the linear regression while the green line is from quadratic regression. We can see the quadratic curve is immensely off from the input data (blue curve) at two ends.
Since we decided to use linear model for regression, we need to re-train the model and test it using our held out test data.
... # Retrain the model and evaluate linreg.fit(X_train, y_train) print("Test set RMSE:", mean_squared_error(y_test, linreg.predict(X_test), squared=False)) print("Mean validation RMSE:", -linscores["test_score"].mean())
Test set RMSE: 0.4403109417232645 Mean validation RMSE: 0.4459827970437929
Here, since scikit-learn will clone a new model on every iteration of cross validation, the model we created remain untrained after cross validation. Otherwise, we should reset the model by cloning a new one using linreg = sklearn.base.clone(linreg)
. But from above, we see that we obtained the root mean squared error of 0.440 from our test set while the score we obtained from cross validation is 0.446. This is not too much of a difference, and hence, we concluded that this model should see an error of similar magnitude for new data.
Tying all these together, the complete example is listed below.
import matplotlib.pyplot as plt import numpy as np from sklearn.model_selection import cross_validate, train_test_split from sklearn.preprocessing import PolynomialFeatures, StandardScaler from sklearn.pipeline import make_pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error np.random.seed(42) # Generate data and plot N = 300 x = np.linspace(0, 7*np.pi, N) smooth = 1 + 0.5*np.sin(x) y = smooth + 0.2*np.random.randn(N) plt.plot(x, y) plt.plot(x, smooth) plt.xlabel("x") plt.ylabel("y") plt.ylim(0,2) plt.show() # Train-test split, intentionally use shuffle=False X = x.reshape(-1,1) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False) # Create two models: Polynomial and linear regression degree = 2 polyreg = make_pipeline(PolynomialFeatures(degree), LinearRegression(fit_intercept=False)) linreg = LinearRegression() # Cross-validation scoring = "neg_root_mean_squared_error" polyscores = cross_validate(polyreg, X_train, y_train, scoring=scoring, return_estimator=True) linscores = cross_validate(linreg, X_train, y_train, scoring=scoring, return_estimator=True) # Which one is better? Linear and polynomial print("Linear regression score:", linscores["test_score"].mean()) print("Polynomial regression score:", polyscores["test_score"].mean()) print("Difference:", linscores["test_score"].mean() - polyscores["test_score"].mean()) print("Coefficients of polynomial regression and linear regression:") # Let's show the coefficient of the last fitted polynomial regression # This starts from the constant term and in ascending order of powers print(polyscores["estimator"][0].steps[1][1].coef_) # And show the coefficient of the last-fitted linear regression print(linscores["estimator"][0].intercept_, linscores["estimator"][-1].coef_) # Plot and compare plt.plot(x, y) plt.plot(x, smooth) plt.plot(x, polyscores["estimator"][0].predict(X)) plt.plot(x, linscores["estimator"][0].predict(X)) plt.ylim(0,2) plt.xlabel("x") plt.ylabel("y") plt.show() # Retrain the model and evaluate import sklearn linreg = sklearn.base.clone(linreg) linreg.fit(X_train, y_train) print("Test set RMSE:", mean_squared_error(y_test, linreg.predict(X_test), squared=False)) print("Mean validation RMSE:", -linscores["test_score"].mean())
This section provides more resources on the topic if you are looking to go deeper.
In this tutorial, you discovered how to do training-validation-test split of dataset and perform k-fold cross validation to select a model correctly and how to retrain the model after the selection.
Specifically, you learned:
The post Training-validation-test split and cross-validation done right appeared first on Machine Learning Mastery.
]]>The post The Attention Mechanism from Scratch appeared first on Machine Learning Mastery.
]]>In this tutorial, you will discover the attention mechanism and its implementation.
After completing this tutorial, you will know:
Let’s get started.
This tutorial is divided into three parts; they are:
The attention mechanism was introduced by Bahdanau et al. (2014), to address the bottleneck problem that arises with the use of a fixed-length encoding vector, where the decoder would have limited access to the information provided by the input. This is thought to become especially problematic for long and/or complex sequences, where the dimensionality of their representation would be forced to be the same as for shorter or simpler sequences.
We had seen that Bahdanau et al.’s attention mechanism is divided into the step-by-step computations of the alignment scores, the weights and the context vector:
$$e_{t,i} = a(\mathbf{s}_{t-1}, \mathbf{h}_i)$$
$$\alpha_{t,i} = \text{softmax}(e_{t,i})$$
$$\mathbf{c}_t = \sum_{i=1}^T \alpha_{t,i} \mathbf{h}_i$$
Bahdanau et al. had implemented an RNN for both the encoder and decoder.
However, the attention mechanism can be re-formulated into a general form that can be applied to any sequence-to-sequence (abbreviated to seq2seq) task, where the information may not necessarily be related in a sequential fashion.
In other words, the database doesn’t have to consist of the hidden RNN states at different steps, but could contain any kind of information instead.
– Advanced Deep Learning with Python, 2019.
The general attention mechanism makes use of three main components, namely the queries, $\mathbf{Q}$, the keys, $\mathbf{K}$, and the values, $\mathbf{V}$.
If we had to compare these three components to the attention mechanism as proposed by Bahdanau et al., then the query would be analogous to the previous decoder output, $\mathbf{s}_{t-1}$, while the values would be analogous to the encoded inputs, $\mathbf{h}_i$. In the Bahdanau attention mechanism, the keys and values are the same vector.
In this case, we can think of the vector $\mathbf{s}_{t-1}$ as a query executed against a database of key-value pairs, where the keys are vectors and the hidden states $\mathbf{h}_i$ are the values.
– Advanced Deep Learning with Python, 2019.
The general attention mechanism then performs the following computations:
$$e_{\mathbf{q},\mathbf{k}_i} = \mathbf{q} \cdot \mathbf{k}_i$$
$$\alpha_{\mathbf{q},\mathbf{k}_i} = \text{softmax}(e_{\mathbf{q},\mathbf{k}_i})$$
$$\text{attention}(\mathbf{q}, \mathbf{K}, \mathbf{V}) = \sum_i \alpha_{\mathbf{q},\mathbf{k}_i} \mathbf{v}_{\mathbf{k}_i}$$
Within the context of machine translation, each word in an input sentence would be attributed its own query, key and value vectors. These vectors are generated by multiplying the encoder’s representation of the specific word under consideration, with three different weight matrices that would have been generated during training.
In essence, when the generalized attention mechanism is presented with a sequence of words, it takes the query vector attributed to some specific word in the sequence and scores it against each key in the database. In doing so, it captures how the word under consideration relates to the others in the sequence. Then it scales the values according to the attention weights (computed from the scores), in order to retain focus on those words that are relevant to the query. In doing so, it produces an attention output for the word under consideration.
In this section, we will explore how to implement the general attention mechanism using the NumPy and SciPy libraries in Python.
For simplicity, we will initially calculate the attention for the first word in a sequence of four. We will then generalize the code to calculate an attention output for all four words in matrix form.
Hence, let’s start by first defining the word embeddings of the four different words for which we will be calculating the attention. In actual practice, these word embeddings would have been generated by an encoder, however for this particular example we shall be defining them manually.
# encoder representations of four different words word_1 = array([1, 0, 0]) word_2 = array([0, 1, 0]) word_3 = array([1, 1, 0]) word_4 = array([0, 0, 1])
The next step generates the weight matrices, which we will eventually be multiplying to the word embeddings to generate the queries, keys and values. Here, we shall be generating these weight matrices randomly, however in actual practice these would have been learned during training.
... # generating the weight matrices random.seed(42) # to allow us to reproduce the same attention values W_Q = random.randint(3, size=(3, 3)) W_K = random.randint(3, size=(3, 3)) W_V = random.randint(3, size=(3, 3))
Notice how the number of rows of each of these matrices is equal to the dimensionality of the word embeddings (which in this case is three) to allow us to perform the matrix multiplication.
Subsequently, the query, key and value vectors for each word are generated by multiplying each word embedding by each of the weight matrices.
... # generating the queries, keys and values query_1 = word_1 @ W_Q key_1 = word_1 @ W_K value_1 = word_1 @ W_V query_2 = word_2 @ W_Q key_2 = word_2 @ W_K value_2 = word_2 @ W_V query_3 = word_3 @ W_Q key_3 = word_3 @ W_K value_3 = word_3 @ W_V query_4 = word_4 @ W_Q key_4 = word_4 @ W_K value_4 = word_4 @ W_V
Considering only the first word for the time being, the next step scores its query vector against all of the key vectors using a dot product operation.
... # scoring the first query vector against all key vectors scores = array([dot(query_1, key_1), dot(query_1, key_2), dot(query_1, key_3), dot(query_1, key_4)])
The score values are subsequently passed through a softmax operation to generate the weights. Before doing so, it is common practice to divide the score values by the square root of the dimensionality of the key vectors (in this case, three), to keep the gradients stable.
... # computing the weights by a softmax operation weights = softmax(scores / key_1.shape[0] ** 0.5)
Finally, the attention output is calculated by a weighted sum of all four value vectors.
... # computing the attention by a weighted sum of the value vectors attention = (weights[0] * value_1) + (weights[1] * value_2) + (weights[2] * value_3) + (weights[3] * value_4) print(attention)
[0.98522025 1.74174051 0.75652026]
For faster processing, the same calculations can be implemented in matrix form to generate an attention output for all four words in one go:
from numpy import array from numpy import random from numpy import dot from scipy.special import softmax # encoder representations of four different words word_1 = array([1, 0, 0]) word_2 = array([0, 1, 0]) word_3 = array([1, 1, 0]) word_4 = array([0, 0, 1]) # stacking the word embeddings into a single array words = array([word_1, word_2, word_3, word_4]) # generating the weight matrices random.seed(42) W_Q = random.randint(3, size=(3, 3)) W_K = random.randint(3, size=(3, 3)) W_V = random.randint(3, size=(3, 3)) # generating the queries, keys and values Q = words @ W_Q K = words @ W_K V = words @ W_V # scoring the query vectors against all key vectors scores = Q @ K.transpose() # computing the weights by a softmax operation weights = softmax(scores / K.shape[1] ** 0.5, axis=1) # computing the attention by a weighted sum of the value vectors attention = weights @ V print(attention)
[[0.98522025 1.74174051 0.75652026] [0.90965265 1.40965265 0.5 ] [0.99851226 1.75849334 0.75998108] [0.99560386 1.90407309 0.90846923]]
This section provides more resources on the topic if you are looking to go deeper.
In this tutorial, you discovered the attention mechanism and its implementation.
Specifically, you learned:
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
The post The Attention Mechanism from Scratch appeared first on Machine Learning Mastery.
]]>The post A Gentle Introduction to Particle Swarm Optimization appeared first on Machine Learning Mastery.
]]>In this tutorial, you will learn the rationale of PSO and its algorithm with an example. After competing this tutorial, you will know:
Kick-start your project with my new book Optimization for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.Particle Swarm Optimization was proposed by Kennedy and Eberhart in 1995. As mentioned in the original paper, sociobiologists believe a school of fish or a flock of birds that moves in a group “can profit from the experience of all other members”. In other words, while a bird flying and searching randomly for food, for instance, all birds in the flock can share their discovery and help the entire flock get the best hunt.
While we can simulate the movement of a flock of birds, we can also imagine each bird is to help us find the optimal solution in a high-dimensional solution space and the best solution found by the flock is the best solution in the space. This is a heuristic solution because we can never prove the real global optimal solution can be found and it is usually not. However, we often find that the solution found by PSO is quite close to the global optimal.
PSO is best used to find the maximum or minimum of a function defined on a multidimensional vector space. Assume we have a function $f(X)$ that produces a real value from a vector parameter $X$ (such as coordinate $(x,y)$ in a plane) and $X$ can take on virtually any value in the space (for example, $f(X)$ is the altitude and we can find one for any point on the plane), then we can apply PSO. The PSO algorithm will return the parameter $X$ it found that produces the minimum $f(X)$.
Let’s start with the following function
$$
f(x,y) = (x-3.14)^2 + (y-2.72)^2 + \sin(3x+1.41) + \sin(4y-1.73)
$$
As we can see from the plot above, this function looks like a curved egg carton. It is not a convex function and therefore it is hard to find its minimum because a local minimum found is not necessarily the global minimum.
So how can we find the minimum point in this function? For sure, we can resort to exhaustive search: If we check the value of $f(x,y)$ for every point on the plane, we can find the minimum point. Or we can just randomly find some sample points on the plane and see which one gives the lowest value on $f(x,y)$ if we think it is too expensive to search every point. However, we also note from the shape of $f(x,y)$ that if we have found a point with a smaller value of $f(x,y)$, it is easier to find an even smaller value around its proximity.
This is how a particle swarm optimization does. Similar to the flock of birds looking for food, we start with a number of random points on the plane (call them particles) and let them look for the minimum point in random directions. At each step, every particle should search around the minimum point it ever found as well as around the minimum point found by the entire swarm of particles. After certain iterations, we consider the minimum point of the function as the minimum point ever explored by this swarm of particles.
Assume we have $P$ particles and we denote the position of particle $i$ at iteration $t$ as $X^i(t)$, which in the example of above, we have it as a coordinate $X^i(t) = (x^i(t), y^i(t)).$ Besides the position, we also have a velocity for each particle, denoted as $V^i(t)=(v_x^i(t), v_y^i(t))$. At the next iteration, the position of each particle would be updated as
$$
X^i(t+1) = X^i(t)+V^i(t+1)
$$
or, equivalently,
$$
\begin{aligned}
x^i(t+1) &= x^i(t) + v_x^i(t+1) \\
y^i(t+1) &= y^i(t) + v_y^i(t+1)
\end{aligned}
$$
and at the same time, the velocities are also updated by the rule
$$
V^i(t+1) =
w V^i(t) + c_1r_1(pbest^i – X^i(t)) + c_2r_2(gbest – X^i(t))
$$
where $r_1$ and $r_2$ are random numbers between 0 and 1, constants $w$, $c_1$, and $c_2$ are parameters to the PSO algorithm, and $pbest^i$ is the position that gives the best $f(X)$ value ever explored by particle $i$ and $gbest$ is that explored by all the particles in the swarm.
Note that $pbest^i$ and $X^i(t)$ are two position vectors and the difference $pbest^i – X^i(t)$ is a vector subtraction. Adding this subtraction to the original velocity $V^i(t)$ is to bring the particle back to the position $pbest^i$. Similar are for the difference $gbest – X^i(t)$.
We call the parameter $w$ the inertia weight constant. It is between 0 and 1 and determines how much should the particle keep on with its previous velocity (i.e., speed and direction of the search). The parameters $c_1$ and $c_2$ are called the cognitive and the social coefficients respectively. They controls how much weight should be given between refining the search result of the particle itself and recognizing the search result of the swarm. We can consider these parameters controls the trade off between exploration and exploitation.
The positions $pbest^i$ and $gbest$ are updated in each iteration to reflect the best position ever found thus far.
One interesting property of this algorithm that distinguishs it from other optimization algorithms is that it does not depend on the gradient of the objective function. In gradient descent, for example, we look for the minimum of a function $f(X)$ by moving $X$ to the direction of $-\nabla f(X)$ as it is where the function going down the fastest. For any particle at the position $X$ at the moment, how it moves does not depend on which direction is the “down hill” but only on where are $pbest$ and $gbest$. This makes PSO particularly suitable if differentiating $f(X)$ is difficult.
Another property of PSO is that it can be parallelized easily. As we are manipulating multiple particles to find the optimal solution, each particles can be updated in parallel and we only need to collect the updated value of $gbest$ once per iteration. This makes map-reduce architecture a perfect candidate to implement PSO.
Here we show how we can implement PSO to find the optimal solution.
For the same function as we showed above, we can first define it as a Python function and show it in a contour plot:
import numpy as np import matplotlib.pyplot as plt def f(x,y): "Objective function" return (x-3.14)**2 + (y-2.72)**2 + np.sin(3*x+1.41) + np.sin(4*y-1.73) # Contour plot: With the global minimum showed as "X" on the plot x, y = np.array(np.meshgrid(np.linspace(0,5,100), np.linspace(0,5,100))) z = f(x, y) x_min = x.ravel()[z.argmin()] y_min = y.ravel()[z.argmin()] plt.figure(figsize=(8,6)) plt.imshow(z, extent=[0, 5, 0, 5], origin='lower', cmap='viridis', alpha=0.5) plt.colorbar() plt.plot([x_min], [y_min], marker='x', markersize=5, color="white") contours = plt.contour(x, y, z, 10, colors='black', alpha=0.4) plt.clabel(contours, inline=True, fontsize=8, fmt="%.0f") plt.show()
Here we plotted the function $f(x,y)$ in the region of $0\le x,y\le 5$. We can create 20 particles at random locations in this region, together with random velocities sampled over a normal distribution with mean 0 and standard deviation 0.1, as follows:
n_particles = 20 X = np.random.rand(2, n_particles) * 5 V = np.random.randn(2, n_particles) * 0.1
which we can show their position on the same contour plot:
From this, we can already find the $gbest$ as the best position ever found by all the particles. Since the particles did not explore at all, their current position is their $pbest^i$ as well:
pbest = X pbest_obj = f(X[0], X[1]) gbest = pbest[:, pbest_obj.argmin()] gbest_obj = pbest_obj.min()
The vector pbest_obj
is the best value of the objective function found by each particle. Similarly, gbest_obj
is the best scalar value of the objective function ever found by the swarm. We are using min()
and argmin()
functions here because we set it as a minimization problem. The position of gbest
is marked as a star below
Let’s set $c_1=c_2=0.1$ and $w=0.8$. Then we can update the positions and velocities according to the formula we mentioned above, and then update $pbest^i$ and $gbest$ afterwards:
c1 = c2 = 0.1 w = 0.8 # One iteration r = np.random.rand(2) V = w * V + c1*r[0]*(pbest - X) + c2*r[1]*(gbest.reshape(-1,1)-X) X = X + V obj = f(X[0], X[1]) pbest[:, (pbest_obj >= obj)] = X[:, (pbest_obj >= obj)] pbest_obj = np.array([pbest_obj, obj]).max(axis=0) gbest = pbest[:, pbest_obj.argmin()] gbest_obj = pbest_obj.min()
The following is the position after the first iteration. We mark the best position of each particle with a black dot to distinguish from their current position, which are set in blue.
We can repeat the above code segment for multiple times and see how the particles explore. This is the result after the second iteration:
and this is after the 5th iteration, note that the position of $gbest$ as denoted by the star changed:
and after 20th iteration, we already very close to the optimal:
This is the animation showing how we find the optimal solution as the algorithm progressed. See if you may find some resemblance to the movement of a flock of birds:
So how close is our solution? In this particular example, the global minimum we found by exhaustive search is at the coordinate $(3.182,3.131)$ and the one found by PSO algorithm above is at $(3.185,3.130)$.
All PSO algorithms are mostly the same as we mentioned above. In the above example, we set the PSO to run in a fixed number of iterations. It is trivial to set the number of iterations to run dynamically in response to the progress. For example, we can make it stop once we cannot see any update to the global best solution $gbest$ in a number of iterations.
Research on PSO were mostly on how to determine the hyperparameters $w$, $c_1$, and $c_2$ or varying their values as the algorithm progressed. For example, there are proposals making the inertia weight linear decreasing. There are also proposals trying to make the cognitive coefficient $c_1$ decreasing while the social coefficient $c_2$ increasing to bring more exploration at the beginning and more exploitation at the end. See, for example, Shi and Eberhart (1998) and Eberhart and Shi (2000).
It should be easy to see how we can change the above code to solve a higher dimensional objective function, or switching from minimization to maximization. The following is the complete example of finding the minimum point of the function $f(x,y)$ proposed above, together with the code to generate the plot animation:
import numpy as np import matplotlib.pyplot as plt from matplotlib.animation import FuncAnimation def f(x,y): "Objective function" return (x-3.14)**2 + (y-2.72)**2 + np.sin(3*x+1.41) + np.sin(4*y-1.73) # Compute and plot the function in 3D within [0,5]x[0,5] x, y = np.array(np.meshgrid(np.linspace(0,5,100), np.linspace(0,5,100))) z = f(x, y) # Find the global minimum x_min = x.ravel()[z.argmin()] y_min = y.ravel()[z.argmin()] # Hyper-parameter of the algorithm c1 = c2 = 0.1 w = 0.8 # Create particles n_particles = 20 np.random.seed(100) X = np.random.rand(2, n_particles) * 5 V = np.random.randn(2, n_particles) * 0.1 # Initialize data pbest = X pbest_obj = f(X[0], X[1]) gbest = pbest[:, pbest_obj.argmin()] gbest_obj = pbest_obj.min() def update(): "Function to do one iteration of particle swarm optimization" global V, X, pbest, pbest_obj, gbest, gbest_obj # Update params r1, r2 = np.random.rand(2) V = w * V + c1*r1*(pbest - X) + c2*r2*(gbest.reshape(-1,1)-X) X = X + V obj = f(X[0], X[1]) pbest[:, (pbest_obj >= obj)] = X[:, (pbest_obj >= obj)] pbest_obj = np.array([pbest_obj, obj]).min(axis=0) gbest = pbest[:, pbest_obj.argmin()] gbest_obj = pbest_obj.min() # Set up base figure: The contour map fig, ax = plt.subplots(figsize=(8,6)) fig.set_tight_layout(True) img = ax.imshow(z, extent=[0, 5, 0, 5], origin='lower', cmap='viridis', alpha=0.5) fig.colorbar(img, ax=ax) ax.plot([x_min], [y_min], marker='x', markersize=5, color="white") contours = ax.contour(x, y, z, 10, colors='black', alpha=0.4) ax.clabel(contours, inline=True, fontsize=8, fmt="%.0f") pbest_plot = ax.scatter(pbest[0], pbest[1], marker='o', color='black', alpha=0.5) p_plot = ax.scatter(X[0], X[1], marker='o', color='blue', alpha=0.5) p_arrow = ax.quiver(X[0], X[1], V[0], V[1], color='blue', width=0.005, angles='xy', scale_units='xy', scale=1) gbest_plot = plt.scatter([gbest[0]], [gbest[1]], marker='*', s=100, color='black', alpha=0.4) ax.set_xlim([0,5]) ax.set_ylim([0,5]) def animate(i): "Steps of PSO: algorithm update and show in plot" title = 'Iteration {:02d}'.format(i) # Update params update() # Set picture ax.set_title(title) pbest_plot.set_offsets(pbest.T) p_plot.set_offsets(X.T) p_arrow.set_offsets(X.T) p_arrow.set_UVC(V[0], V[1]) gbest_plot.set_offsets(gbest.reshape(1,-1)) return ax, pbest_plot, p_plot, p_arrow, gbest_plot anim = FuncAnimation(fig, animate, frames=list(range(1,50)), interval=500, blit=False, repeat=True) anim.save("PSO.gif", dpi=120, writer="imagemagick") print("PSO found best solution at f({})={}".format(gbest, gbest_obj)) print("Global optimal at f({})={}".format([x_min,y_min], f(x_min,y_min)))
These are the original papers that proposed the particle swarm optimization, and the early research on refining its hyperparameters:
In this tutorial we learned:
As particle swarm optimization does not have a lot of hyper-parameters and very permissive on the objective function, it can be used to solve a wide range of problems.
The post A Gentle Introduction to Particle Swarm Optimization appeared first on Machine Learning Mastery.
]]>The post What is Attention? appeared first on Machine Learning Mastery.
]]>In this tutorial, you will discover an overview of attention and its application in machine learning.
After completing this tutorial, you will know:
Let’s get started.
This tutorial is divided into two parts; they are:
Attention is a widely investigated concept that has often been studied in conjunction with arousal, alertness and engagement with one’s surroundings.
In its most generic form, attention could be described as merely an overall level of alertness or ability to engage with surroundings.
– Attention in Psychology, Neuroscience, and Machine Learning, 2020.
Visual attention is one of the areas that is most often studied from both the neuroscientific and psychological perspectives.
When a subject is presented with different images, the eye movements that the subject performs can reveal the salient image parts that the subject’s attention is mostly attracted to. In their review on computational models for visual attention, Itti and Koch (2001) mention that such salient image parts are often characterised by visual attributes that include intensity contrast, oriented edges, corners and junctions, and motion. The human brain attends to these salient visual features at different neuronal stages.
Neurons at the earliest stages are tuned to simple visual attributes such as intensity contrast, colour opponency, orientation, direction and velocity of motion, or stereo disparity at several spatial scales. Neuronal tuning becomes increasingly more specialized with the progression from low-level to high-level visual areas, such that higher-level visual areas include neurons that respond only to corners or junctions, shape-from-shading cues or views of specific real-world objects.
Interestingly, research has also observed that different subjects tend to be attracted to the same salient visual cues.
Research has also discovered several forms of interaction between memory and attention. Since the human brain has a limited memory capacity, then selecting which information to store becomes crucial in making the best use of the limited resources. The human brain does so by relying on attention, such that it dynamically stores into memory the information that the human subject is mostly paying attention to.
The implementation of the attention mechanism in artificial neural networks does not necessarily track the biological and psychological mechanisms of the human brain. Rather, it is the ability to dynamically highlight and use the salient parts of the information at hand, in a similar manner as it does in the human brain, that makes attention such an attractive concept in machine learning.
An attention-based system is thought to consist of three components:
- A process that “reads” raw data (such as source words in a source sentence), and converts them into distributed representations, with one feature vector associated with each word position.
- A list of feature vectors storing the output of the reader. This can be understood as a “memory” containing a sequence of facts, which can be retrieved later, not necessarily in the same order, without having to visit all of them.
- A process that “exploits” the content of the memory to sequentially perform a task, at each time step having the ability put attention on the content of one memory element (or a few, with a different weight).
– Page 491, Deep Learning, 2017.
Let’s take the encoder-decoder framework as an example, since it was within such a framework that the attention mechanism was first introduced.
If we are processing an input sequence of words, then this will first be fed into an encoder, which will output a vector for every element in the sequence. This corresponds to the first component of our attention-based system, as explained above.
A list of these vectors (the second component of the attention-based system above), together with the decoder’s previous hidden states, will be exploited by the attention mechanism to dynamically highlight which of the input information will be used to generate the output.
At each time step, the attention mechanism, then, takes the previous hidden state of the decoder and the list of encoded vectors, and uses them to generate unnormalized score values that indicate how well the elements of the input sequence align with the current output. Since the generated score values need to make relative sense in terms of their importance, they are normalized by passing them through a softmax function to generate the weights. Following the softmax normalization, all of the weight values will lie in the interval [0, 1] and will add up to 1, which means that they can be interpreted as probabilities. Finally, the encoded vectors are scaled by the computed weights to generate a context vector. This attention process forms the third component of the attention-based system above. It is this context vector that is, then, fed into the decoder to generate a translated output.
This type of artificial attention is thus a form of iterative re-weighting. Specifically, it dynamically highlights different components of a pre-processed input as they are needed for output generation. This makes it flexible and context dependent, like biological attention.
– Attention in Psychology, Neuroscience, and Machine Learning, 2020.
The process implemented by a system that incorporates an attention mechanism contrasts with one that does not. In the latter, the encoder would generate a fixed-length vector irrespective of the input’s length or complexity. In absence of a mechanism that highlights the salient information across the entirety of the input, the decoder would only have access to the limited information that would be encoded within the fixed-length vector. This would potentially result in the decoder missing important information.
The attention mechanism was initially proposed to process sequences of words in machine translation, which have an implied temporal aspect to them. However, it can be generalized to process information that can be static, and not necessarily related in a sequential fashion, such as in the context of image processing. We will be seeing how this generalization can be achieved in a separate tutorial.
This section provides more resources on the topic if you are looking to go deeper.
In this tutorial, you discovered an overview of attention and its application in machine learning.
Specifically, you learned:
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
The post What is Attention? appeared first on Machine Learning Mastery.
]]>The post A Bird’s Eye View of Research on Attention appeared first on Machine Learning Mastery.
]]>In this tutorial, you will discover an overview of the research advances on attention.
After completing this tutorial, you will know:
Let’s get started.
This tutorial is divided into two parts; they are:
Research on attention finds its origin in the field of psychology.
The scientific study of attention began in psychology, where careful behavioral experimentation can give rise to precise demonstrations of the tendencies and abilities of attention in different circumstances.
– Attention in Psychology, Neuroscience, and Machine Learning, 2020.
Observations derived from such studies could help researchers infer the mental processes underlying such behavioral patterns.
While the different fields of psychology, neuroscience and, more recently, machine learning, have all produced their own definitions of attention, there is one core quality that is of great significance to all:
Attention is the flexible control of limited computational resources.
– Attention in Psychology, Neuroscience, and Machine Learning, 2020.
With this in mind, the following sections review the role of attention in revolutionizing the field of machine learning.
The concept of attention in machine learning is very loosely inspired by the psychological mechanisms of attention in the human brain.
The use of attention mechanisms in artificial neural networks came about — much like the apparent need for attention in the brain — as a means of making neural systems more flexible.
– Attention in Psychology, Neuroscience, and Machine Learning, 2020.
The idea is to be able to work with an artificial neural network that can perform well on tasks where the input may be of variable length, size or structure, or even handle several different tasks. It is in this spirit that attention mechanisms in machine learning are said to inspire themselves from psychology, rather than because they replicate the biology of the human brain.
In the form of attention originally developed for ANNs, attention mechanisms worked within an encoder-decoder framework and in the context of sequence models …
– Attention in Psychology, Neuroscience, and Machine Learning, 2020.
The task of the encoder is to generate a vector representation of the input, whereas the task of the decoder is to transform this vector representation into an output. The attention mechanism connects the two.
There have been different propositions of neural network architectures that implement attention mechanisms, which are also tied to the specific applications in which they find their use. Natural Language Processing (NLP) and computer vision are among the most popular applications.
An early application for attention in NLP was that of machine translation, where the goal was to translate an input sentence in a source language, to an output sentence in a target language. Within this context, the encoder would generate a set of context vectors, one for each word in the source sentence. The decoder, on the other hand, would read the context vectors to generate an output sentence in the target language, one word at a time.
In the traditional encoder-decoder framework without attention, the encoder produced a fixed-length vector that was independent of the length or features of the input and static during the course of decoding.
– Attention in Psychology, Neuroscience, and Machine Learning, 2020.
Representing the input by a fixed-length vector was especially problematic for long sequences or sequences that were complex in structure, since the dimensionality of their representation was forced to be the same as for shorter or simpler sequences.
For example, in some languages, such as Japanese, the last word might be very important to predict the first word, while translating English to French might be easier as the order of the sentences (how the sentence is organized) is more similar to each other.
– Attention in Psychology, Neuroscience, and Machine Learning, 2020.
This created a bottleneck, whereby the decoder has limited access to the information provided by the input – that which is available within the fixed-length encoding vector. On the other hand, preserving the length of the input sequence during the encoding process, could make it possible for the decoder to utilize its most relevant parts in a flexible manner.
The latter is how the attention mechanism operates.
Attention helps determine which of these vectors should be used to generate the output. Because the output sequence is dynamically generated one element at a time, attention can dynamically highlight different encoded vectors at each time point. This allows the decoder to flexibly utilize the most relevant parts of the input sequence.
– Page 186, Deep Learning Essentials, 2018.
One of the earliest works in machine translation that sought to address the bottleneck problem created by fixed-length vectors, was by Bahdanau et al. (2014). In their work, Bahdanau et al. employed the use of Recurrent Neural Networks (RNNs) for both encoding and decoding tasks: the encoder employs a bi-directional RNN to generate a sequence of annotations that each contain a summary of both preceding and succeeding words, and which can be mapped into a context vector through a weighted sum; the decoder then generates an output based on these annotations and the hidden states of another RNN. Since the context vector is computed by a weighted sum of the annotations, then Bahdanau et al.’s attention mechanism is an example of soft attention.
Another of the earliest works was by Sutskever et al. (2014), who, alternatively, made use of multilayered Long Short-Term Memory (LSTM) to encode a vector representing the input sequence, and another LSTM to decode the vector into a target sequence.
Luong et al. (2015) introduced the idea of global versus local attention. In their work, they described a global attention model as one that, when deriving the context vector, considers all the hidden states of the encoder. The computation of the global context vector is, therefore, based upon a weighted average of all the words in the source sequence. Luong et al. mention that this is computationally expensive, and could potentially make global attention difficult to be applied to long sequences. Local attention is proposed to address this problem, by focusing on a smaller subset of the words in the source sequence, per target word. Luong et al. explain that local attention trades-off the soft and hard attentional models of Xu et al. (2016) (we will refer to this paper again in the next section), by being less computationally expensive than the soft attention, but easier to train than the hard attention.
More recently, Vaswani et al. (2017) proposed an entirely different architecture that has steered the field of machine translation into a new direction. Termed by the name of Transformer, their architecture dispenses of any recurrence and convolutions entirely, but implements a self-attention mechanism. Words in the source sequence are first encoded in parallel to generate key, query and value representations. The keys and queries are combined to generate attention weightings that capture how each word relates to the others in the sequence. These attention weightings are then used to scale the values, in order to retain focus on the important words and drown out the irrelevant ones.
The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
– Attention Is All You Need, 2017.
At the time, the proposed Transformer architecture established a new state-of-the-art on English-to-German and English-to-French translation tasks, and was reportedly also faster to train than architectures based on recurrent or convolutional layers. Subsequently, the method called BERT by Devlin et al. (2019) built on Vaswani et al.’s work by proposing a multi-layer bi-directional architecture.
As we shall be seeing shortly, the uptake of the Transformer architecture was not only rapid in the domain of NLP, but in the computer vision domain too.
In computer vision, attention has found its way into several applications, such as in the domains of image classification, image segmentation and image captioning.
If we had to reframe the encoder-decoder model to the task of image captioning, as an example, then the encoder can be a Convolutional Neural Network (CNN) that captures the salient visual cues in the images into a vector representation, whereas the decoder can be an RNN or LSTM that transforms the vector representation into an output.
Also, as in the neuroscience literature, these attentional processes can be divided into spatial and feature-based attention.
– Attention in Psychology, Neuroscience, and Machine Learning, 2020.
In spatial attention, different spatial locations are attributed different weights, however these same weights are retained across all feature channels at the different spatial locations.
One of the fundamental image captioning approaches working with spatial attention has been proposed by Xu et al. (2016). Their model incorporates a CNN as an encoder that extracts a set of feature vectors (or annotation vectors), with each vector corresponding to a different part of the image to allow the decoder to focus selectively on specific image parts. The decoder is an LSTM that generates a caption based on a context vector, the previous hidden state, and the previously generated words. Xu et al. investigate the use of hard attention as an alternative to soft attention in computing their context vector. Here, soft attention places weights softly on all patches of the source image, whereas hard attention attends to a single patch alone while disregarding the rest. They report that, in their work, hard attention performs better.
Feature attention, in comparison, permits individual feature maps to be attributed their own weight values. One such example, also applied to image captioning, is the encoder-decoder framework of Chen et al. (2018), which incorporates spatial and channel-wise attentions in the same CNN.
Similarly to how the Transformer has quickly become the standard architecture for NLP tasks, it has also been recently taken up and adapted by the computer vision community.
The earliest work to do so was proposed by Dosovitskiy et al. (2020), who applied their Vision Transformer (ViT) to an image classification task. They argued that the long-standing reliance on CNNs for image classification was not necessary, and the same task could be accomplished by a pure transformer. Dosovitskiy et al. reshape an input image into a sequence of flattened 2D image patches, which they subsequently embed by a trainable linear projection to generate the patch embeddings. These patch embeddings together with their position embeddings, to retain positional information, are fed into the encoder part of the Transformer architecture, whose output is subsequently fed into a Multilayer Perceptron (MLP) for classification.
Inspired by ViT, and the fact that attention-based architectures are an intuitive choice for modelling long-range contextual relationships in video, we develop several transformer-based models for video classification.
– ViViT: A Video Vision Transformer, 2021.
Arnab et al. (2021), subsequently extended the ViT model to ViViT, which exploits the spatiotemporal information contained within videos for the task of video classification. Their method explores different approaches of extracting the spatiotemporal data, such as by sampling and embedding each frame independently, or by extracting non-overlapping tubelets (an image patch that spans across several image frames, creating a tube) and embedding each one in turn. They also investigate different methods of factorising the spatial and temporal dimensions of the input video, for increased efficiency and scalability.
Further to its first application for image classification, the Vision Transformer is already being applied to several other computer vision domains, such as to action localization, gaze estimation, and image generation. This surge of interest among computer vision practitioners suggests an exciting near future, where we’ll be seeing more adaptations and applications of the Transformer architecture.
This section provides more resources on the topic if you are looking to go deeper.
Example Applications:
In this tutorial, you discovered an overview of the research advances on attention.
Specifically, you learned:
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
The post A Bird’s Eye View of Research on Attention appeared first on Machine Learning Mastery.
]]>The post Lagrange Multiplier Approach with Inequality Constraints appeared first on Machine Learning Mastery.
]]>In this tutorial, you will discover the method of Lagrange multipliers applied to find the local minimum or maximum of a function when inequality constraints are present, optionally together with equality constraints.
After completing this tutorial, you will know
Let’s get started.
For this tutorial, we assume that you already have reviewed:
as well as
You can review these concepts by clicking on the links above.
Extending from our previous post, a constrained optimization problem can be generally considered as
$$
\begin{aligned}
\min && f(X) \\
\textrm{subject to} && g(X) &= 0 \\
&& h(X) &\ge 0 \\
&& k(X) &\le 0
\end{aligned}
$$
where $X$ is a scalar or vector values. Here, $g(X)=0$ is the equality constraint, and $h(X)\ge 0$, $k(X)\le 0$ are inequality constraints. Note that we always use $\ge$ and $\le$ rather than $\gt$ and $\lt$ in optimization problems because the former defined a closed set in mathematics from where we should look for the value of $X$. These can be many constraints of each type in an optimization problem.
The equality constraints are easy to handle but the inequality constraints are not. Therefore, one way to make it easier to tackle is to convert the inequalities into equalities, by introducing slack variables:
$$
\begin{aligned}
\min && f(X) \\
\textrm{subject to} && g(X) &= 0 \\
&& h(X) – s^2 &= 0 \\
&& k(X) + t^2 &= 0
\end{aligned}
$$
When something is negative, adding a certain positive quantity into it will make it equal to zero, and vice versa. That quantity is the slack variable; the $s^2$ and $t^2$ above are examples. We deliberately put $s^2$ and $t^2$ terms there to denote that they must not be negative.
With the slack variables introduced, we can use the Lagrange multipliers approach to solve it, in which the Lagrangian is defined as:
$$
L(X, \lambda, \theta, \phi) = f(X) – \lambda g(X) – \theta (h(X)-s^2) + \phi (k(X)+t^2)
$$
It is useful to know that, for the optimal solution $X^*$ to the problem, the inequality constraints are either having the equality holds (which the slack variable is zero), or not. For those inequality constraints with their equality hold are called the active constraints. Otherwise, the inactive constraints. In this sense, you can consider that the equality constraints are always active.
The reason we need to know whether a constraint is active or not is because of the Krush-Kuhn-Tucker (KKT) conditions. Precisely, the KKT conditions describe what happens when $X^*$ is the optimal solution to a constrained optimization problem:
The most important of them is the complementary slackness condition. While we learned that optimization problem with equality constraint can be solved using Lagrange multiplier which the gradient of the Lagrangian is zero at the optimal solution, the complementary slackness condition extends this to the case of inequality constraint by saying that at the optimal solution $X^*$, either the Lagrange multiplier is zero or the corresponding inequality constraint is active.
The use of complementary slackness condition is to help us explore different cases in solving the optimization problem. It is the best to be explained with an example.
This is an example from finance. If we have 1 dollar and were to engage in two different investments, in which their return is modeled as a bi-variate Gaussian distribution. How much should we invest in each to minimize the overall variance in return?
This optimization problem, also known as Markowitz mean-variance portfolio optimization, is formulated as:
$$
\begin{aligned}
\min && f(w_1, w_2) &= w_1^2\sigma_1^2+w_2^2\sigma_2^2+2w_1w_2\sigma_{12} \\
\textrm{subject to} && w_1+w_2 &= 1 \\
&& w_1 &\ge 0 \\
&& w_1 &\le 1
\end{aligned}
$$
which the last two are to bound the weight of each investment to between 0 and 1 dollar. Let’s assume $\sigma_1^2=0.25$, $\sigma_2^2=0.10$, $\sigma_{12} = 0.15$ Then the Lagrangian function is defined as:
$$
\begin{aligned}
L(w_1,w_2,\lambda,\theta,\phi) =& 0.25w_1^2+0.1w_2^2+0.3w_1w_2 \\
&- \lambda(w_1+w_2-1) \\
&- \theta(w_1-s^2) – \phi(w_1-1+t^2)
\end{aligned}
$$
and we have the gradients:
$$
\begin{aligned}
\frac{\partial L}{\partial w_1} &= 0.5w_1+0.3w_2-\lambda-\theta-\phi \\
\frac{\partial L}{\partial w_2} &= 0.2w_2+0.3w_1-\lambda \\
\frac{\partial L}{\partial\lambda} &= 1-w_1-w_2 \\
\frac{\partial L}{\partial\theta} &= s^2-w_1 \\
\frac{\partial L}{\partial\phi} &= 1-w_1-t^2
\end{aligned}
$$
From this point onward, the complementary slackness condition have to be considered. We have two slack variables $s$ and $t$ and the corresponding Lagrange multipliers are $\theta$ and $\phi$. We now have to consider whether a slack variable is zero (which the corresponding inequality constraint is active) or the Lagrange multiplier is zero (the constraint is inactive). There are four possible cases:
For case 1, using $\partial L/\partial\lambda=0$, $\partial L/\partial w_1=0$ and $\partial L/\partial w_2=0$ we get
$$
\begin{align}
w_2 &= 1-w_1 \\
0.5w_1 + 0.3w_2 &= \lambda \\
0.3w_1 + 0.2w_2 &= \lambda
\end{align}
$$
which we get $w_1=-1$, $w_2=2$, $\lambda=0.1$. But with $\partial L/\partial\theta=0$, we get $s^2=-1$, which we cannot find a solution ($s^2$ cannot be negative). Thus this case is infeasible.
For case 2, with $\partial L/\partial\theta=0$ we get $w_1=0$. Hence from $\partial L/\partial\lambda=0$, we know $w_2=1$. And with $\partial L/\partial w_2=0$, we found $\lambda=0.2$ and from $\partial L/\partial w_1$ we get $\phi=0.1$. In this case, the objective function is 0.1
For case 3, with $\partial L/\partial\phi=0$ we get $w_1=1$. Hence from $\partial L/\partial\lambda=0$, we know $w_2=0$. And with $\partial L/\partial w_2=0$, we get $\lambda=0.3$ and from $\partial L/\partial w_1$ we get $\theta=0.2$. In this case, the objective function is 0.25
For case 4, we get $w_1=0$ from $\partial L/\partial\theta=0$ but $w_1=1$ from $\partial L/\partial\phi=0$. Hence this case is infeasible.
Comparing the objective function from case 2 and case 3, we see that the value from case 2 is lower. Hence that is taken as our solution to the optimization problem, with the optimal solution attained at $w_1=0$, $w_2=1$.
As an exercise, you can retry the above with $\sigma_{12}=-0.15$. The solution would be 0.0038 attained when $w_1=\frac{5}{13}$, with the two inequality constraints inactive.
This is an example from communication engineering. If we have a channel (say, a wireless bandwidth) in which the noise power is $N$ and the signal power is $S$, the channel capacity (in terms of bits per second) is proportional to $\log_2(1+S/N)$. If we have $k$ similar channels, each has its own noise and signal level, the total capacity of all channels is the sum $\sum_i \log_2(1+S_i/N_i)$.
Assume we are using a battery that can give only 1 watt of power and this power have to distribute to the $k$ channels (denoted as $p_1,\cdots,p_k$). Each channel may have different attenuation so at the end, the signal power is discounted by a gain $g_i$ for each channel. Then the maximum total capacity we can achieve by using these $k$ channels is formulated as an optimization problem
$$
\begin{aligned}
\max && f(p_1,\cdots,p_k) &= \sum_{i=1}^k \log_2\left(1+\frac{g_ip_i}{n_i}\right) \\
\textrm{subject to} && \sum_{i=1}^k p_i &= 1 \\
&& p_1,\cdots,p_k &\ge 0 \\
\end{aligned}
$$
For convenience of differentiation, we notice $\log_2x=\log x/\log 2$ and $\log(1+g_ip_i/n_i)=\log(n_i+g_ip_i)-\log(n_i)$, hence the objective function can be replaced with
$$
f(p_1,\cdots,p_k) = \sum_{i=1}^k \log(n_i+g_ip_i)
$$
Assume we have $k=3$ channels, each has noise level of 1.0, 0.9, 1.0 respectively, and the channel gain is 0.9, 0.8, 0.7, then the optimization problem is
$$
\begin{aligned}
\max && f(p_1,p_2,p_k) &= \log(1+0.9p_1) + \log(0.9+0.8p_2) + \log(1+0.7p_3)\\
\textrm{subject to} && p_1+p_2+p_3 &= 1 \\
&& p_1,p_2,p_3 &\ge 0
\end{aligned}
$$
We have three inequality constraints here. The Lagrangian function is defined as
$$
\begin{aligned}
& L(p_1,p_2,p_3,\lambda,\theta_1,\theta_2,\theta_3) \\
=\ & \log(1+0.9p_1) + \log(0.9+0.8p_2) + \log(1+0.7p_3) \\
& – \lambda(p_1+p_2+p_3-1) \\
& – \theta_1(p_1-s_1^2) – \theta_2(p_2-s_2^2) – \theta_3(p_3-s_3^2)
\end{aligned}
$$
The gradient is therefore
$$
\begin{aligned}
\frac{\partial L}{\partial p_1} & = \frac{0.9}{1+0.9p_1}-\lambda-\theta_1 \\
\frac{\partial L}{\partial p_2} & = \frac{0.8}{0.9+0.8p_2}-\lambda-\theta_2 \\
\frac{\partial L}{\partial p_3} & = \frac{0.7}{1+0.7p_3}-\lambda-\theta_3 \\
\frac{\partial L}{\partial\lambda} &= 1-p_1-p_2-p_3 \\
\frac{\partial L}{\partial\theta_1} &= s_1^2-p_1 \\
\frac{\partial L}{\partial\theta_2} &= s_2^2-p_2 \\
\frac{\partial L}{\partial\theta_3} &= s_3^2-p_3 \\
\end{aligned}
$$
But now we have 3 slack variables and we have to consider 8 cases:
Immediately we can tell case 8 is infeasible since from $\partial L/\partial\theta_i=0$ we can make $p_1=p_2=p_3=0$ but it cannot make $\partial L/\partial\lambda=0$.
For case 1, we have
$$
\frac{0.9}{1+0.9p_1}=\frac{0.8}{0.9+0.8p_2}=\frac{0.7}{1+0.7p_3}=\lambda
$$
from $\partial L/\partial p_1=\partial L/\partial p_2=\partial L/\partial p_3=0$. Together with $p_3=1-p_1-p_2$ from $\partial L/\partial\lambda=0$, we found the solution to be $p_1=0.444$, $p_2=0.430$, $p_3=0.126$, and the objective function $f(p_1,p_2,p_3)=0.639$.
For case 2, we have $p_3=0$ from $\partial L/\partial\theta_3=0$. Further, using $p_2=1-p_1$ from $\partial L/\partial\lambda=0$, and
$$
\frac{0.9}{1+0.9p_1}=\frac{0.8}{0.9+0.8p_2}=\lambda
$$
from $\partial L/\partial p_1=\partial L/\partial p_2=0$, we can solve for $p_1=0.507$ and $p_2=0.493$. The objective function $f(p_1,p_2,p_3)=0.634$.
Similarly in case 3, $p_2=0$ and we solved $p_1=0.659$ and $p_3=0.341$, with the objective function $f(p_1,p_2,p_3)=0.574$.
In case 4, we have $p_1=0$, $p_2=0.652$, $p_3=0.348$, and the objective function $f(p_1,p_2,p_3)=0.570$.
Case 5 we have $p_2=p_3=0$ and hence $p_3=1$. Thus we have the objective function $f(p_1,p_2,p_3)=0.0.536$.
Similarly in case 6 and case 7, we have $p_2=1$ and $p_1=1$ respectively. The objective function attained 0.531 and 0.425 respectively.
Comparing all these cases, we found that the maximum value that the objective function attained is in case 1. Hence the solution to this optimization problem is
$p_1=0.444$, $p_2=0.430$, $p_3=0.126$, with $f(p_1,p_2,p_3)=0.639$.
While in the above example, we introduced the slack variables into the Lagrangian function, some books may prefer not to add the slack variables but to limit the Lagrange multipliers for inequality constraints as positive. In that case you may see the Lagrangian function written as
$$
L(X, \lambda, \theta, \phi) = f(X) – \lambda g(X) – \theta h(X) + \phi k(X)
$$
but requires $\theta\ge 0;\phi\ge 0$.
The Lagrangian function is also useful to apply to primal-dual approach for finding the maximum or minimum. This is particularly helpful if the objectives or constraints are non-linear, which the solution may not be easily found.
Some books that covers this topic are:
In this tutorial, you discovered how the method of Lagrange multipliers can be applied to inequality constraints. Specifically, you learned:
The post Lagrange Multiplier Approach with Inequality Constraints appeared first on Machine Learning Mastery.
]]>