The post Neural Network Models for Combined Classification and Regression appeared first on Machine Learning Mastery.

]]>A simple approach is to develop both regression and classification predictive models on the same data and use the models sequentially.

An alternative and often more effective approach is to develop a single neural network model that can predict both a numeric and class label value from the same input. This is called a **multi-output model** and can be relatively easy to develop and evaluate using modern deep learning libraries such as Keras and TensorFlow.

In this tutorial, you will discover how to develop a neural network for combined regression and classification predictions.

After completing this tutorial, you will know:

- Some prediction problems require predicting both numeric and class label values for each input example.
- How to develop separate regression and classification models for problems that require multiple outputs.
- How to develop and evaluate a neural network model capable of making simultaneous regression and classification predictions.

Let’s get started.

This tutorial is divided into three parts; they are:

- Single Model for Regression and Classification
- Separate Regression and Classification Models
- Abalone Dataset
- Regression Model
- Classification Model

- Combined Regression and Classification Models

It is common to develop a deep learning neural network model for a regression or classification problem, but on some predictive modeling tasks, we may want to develop a single model that can make both regression and classification predictions.

Regression refers to predictive modeling problems that involve predicting a numeric value given an input.

Classification refers to predictive modeling problems that involve predicting a class label or probability of class labels for a given input.

For more on the difference between classification and regression, see the tutorial:

There may be some problems where we want to predict both a numerical value and a classification value.

One approach to solving this problem is to develop a separate model for each prediction that is required.

The problem with this approach is that the predictions made by the separate models may diverge.

An alternate approach that can be used when using neural network models is to develop a single model capable of making separate predictions for a numeric and class output for the same input.

This is called a multi-output neural network model.

The benefit of this type of model is that we have a single model to develop and maintain instead of two models and that training and updating the model on both output types at the same time may offer more consistency in the predictions between the two output types.

We will develop a multi-output neural network model capable of making regression and classification predictions at the same time.

First, let’s select a dataset where this requirement makes sense and start by developing separate models for both regression and classification predictions.

In this section, we will start by selecting a real dataset where we may want regression and classification predictions at the same time, then develop separate models for each type of prediction.

We will use the “*abalone*” dataset.

Determining the age of an abalone is a time-consuming task and it is desirable to determine the age from physical details alone.

This is a dataset that describes the physical details of abalone and requires predicting the number of rings of the abalone, which is a proxy for the age of the creature.

You can learn more about the dataset from here:

The “*age*” can be predicted as both a numerical value (in years) or a class label (ordinal year as a class).

No need to download the dataset as we will download it automatically as part of the worked examples.

The dataset provides an example of a dataset where we may want both a numerical and classification of an input.

First, let’s develop an example to download and summarize the dataset.

# load and summarize the abalone dataset from pandas import read_csv from matplotlib import pyplot # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/abalone.csv' dataframe = read_csv(url, header=None) # summarize shape print(dataframe.shape) # summarize first few lines print(dataframe.head())

Running the example first downloads and summarizes the shape of the dataset.

We can see that there are 4,177 examples (rows) that we can use to train and evaluate a model and 9 features (columns) including the target variable.

We can see that all input variables are numeric except the first, which is a string value.

To keep data preparation simple, we will drop the first column from our models and focus on modeling the numeric input values.

(4177, 9) 0 1 2 3 4 5 6 7 8 0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15 1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7 2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9 3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10 4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7

We can use the data as the basis for developing separate regression and classification Multilayer Perceptron (MLP) neural network models.

**Note**: we are not trying to develop an optimal model for this dataset; instead we are demonstrating a specific technique: developing a model that can make both regression and classification predictions.

In this section, we will develop a regression MLP model for the abalone dataset.

First, we must separate the columns into input and output elements and drop the first column that contains string values.

We will also force all loaded columns to have a float type (expected by neural network models) and record the number of input features, which will need to be known by the model later.

... # split into input (X) and output (y) variables X, y = dataset[:, 1:-1], dataset[:, -1] X, y = X.astype('float'), y.astype('float') n_features = X.shape[1]

Next, we can split the dataset into a train and test dataset.

We will use a 67% random sample to train the model and the remaining 33% to evaluate the model.

... # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

We can then define an MLP neural network model.

The model will have two hidden layers, the first with 20 nodes and the second with 10 nodes, both using ReLU activation and “*he normal*” weight initialization (a good practice). The number of layers and nodes were chosen arbitrarily.

The output layer will have a single node for predicting a numeric value and a linear activation function.

... # define the keras model model = Sequential() model.add(Dense(20, input_dim=n_features, activation='relu', kernel_initializer='he_normal')) model.add(Dense(10, activation='relu', kernel_initializer='he_normal')) model.add(Dense(1, activation='linear'))

The model will be trained to minimize the mean squared error (MSE) loss function using the effective Adam version of stochastic gradient descent.

... # compile the keras model model.compile(loss='mse', optimizer='adam')

We will train the model for 150 epochs with a mini-batch size of 32 samples, again chosen arbitrarily.

... # fit the keras model on the dataset model.fit(X_train, y_train, epochs=150, batch_size=32, verbose=2)

Finally, after the model is trained, we will evaluate it on the holdout test dataset and report the mean absolute error (MAE).

... # evaluate on test set yhat = model.predict(X_test) error = mean_absolute_error(y_test, yhat) print('MAE: %.3f' % error)

Tying this all together, the complete example of an MLP neural network for the abalone dataset framed as a regression problem is listed below.

# regression mlp model for the abalone dataset from pandas import read_csv from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from sklearn.metrics import mean_absolute_error from sklearn.model_selection import train_test_split # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/abalone.csv' dataframe = read_csv(url, header=None) dataset = dataframe.values # split into input (X) and output (y) variables X, y = dataset[:, 1:-1], dataset[:, -1] X, y = X.astype('float'), y.astype('float') n_features = X.shape[1] # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # define the keras model model = Sequential() model.add(Dense(20, input_dim=n_features, activation='relu', kernel_initializer='he_normal')) model.add(Dense(10, activation='relu', kernel_initializer='he_normal')) model.add(Dense(1, activation='linear')) # compile the keras model model.compile(loss='mse', optimizer='adam') # fit the keras model on the dataset model.fit(X_train, y_train, epochs=150, batch_size=32, verbose=2) # evaluate on test set yhat = model.predict(X_test) error = mean_absolute_error(y_test, yhat) print('MAE: %.3f' % error)

Running the example will prepare the dataset, fit the model, and report an estimate of model error.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved an error of about 1.5 (rings).

... Epoch 145/150 88/88 - 0s - loss: 4.6130 Epoch 146/150 88/88 - 0s - loss: 4.6182 Epoch 147/150 88/88 - 0s - loss: 4.6277 Epoch 148/150 88/88 - 0s - loss: 4.6437 Epoch 149/150 88/88 - 0s - loss: 4.6166 Epoch 150/150 88/88 - 0s - loss: 4.6132 MAE: 1.554

So far so good.

Next, let’s look at developing a similar model for classification.

The abalone dataset can be framed as a classification problem where each “*ring*” integer is taken as a separate class label.

The example and model are much the same as the above example for regression, with a few important changes.

This requires first assigning a separate integer for each “*ring*” value, starting at 0 and ending at the total number of “*classes*” minus one.

This can be achieved using the LabelEncoder.

We can also record the total number of classes as the total number of unique encoded class values, which will be needed by the model later.

... # encode strings to integer y = LabelEncoder().fit_transform(y) n_class = len(unique(y))

After splitting the data into train and test sets as before, we can define the model and change the number of outputs from the model to equal the number of classes and use the softmax activation function, common for multi-class classification.

... # define the keras model model = Sequential() model.add(Dense(20, input_dim=n_features, activation='relu', kernel_initializer='he_normal')) model.add(Dense(10, activation='relu', kernel_initializer='he_normal')) model.add(Dense(n_class, activation='softmax'))

Given we have encoded class labels as integer values, we can fit the model by minimizing the sparse categorical cross-entropy loss function, appropriate for multi-class classification tasks with integer encoded class labels.

... # compile the keras model model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')

After the model is fit on the training dataset as before, we can evaluate the performance of the model by calculating the classification accuracy on the hold-out test set.

... # evaluate on test set yhat = model.predict(X_test) yhat = argmax(yhat, axis=-1).astype('int') acc = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % acc)

Tying this all together, the complete example of an MLP neural network for the abalone dataset framed as a classification problem is listed below.

# classification mlp model for the abalone dataset from numpy import unique from numpy import argmax from pandas import read_csv from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/abalone.csv' dataframe = read_csv(url, header=None) dataset = dataframe.values # split into input (X) and output (y) variables X, y = dataset[:, 1:-1], dataset[:, -1] X, y = X.astype('float'), y.astype('float') n_features = X.shape[1] # encode strings to integer y = LabelEncoder().fit_transform(y) n_class = len(unique(y)) # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # define the keras model model = Sequential() model.add(Dense(20, input_dim=n_features, activation='relu', kernel_initializer='he_normal')) model.add(Dense(10, activation='relu', kernel_initializer='he_normal')) model.add(Dense(n_class, activation='softmax')) # compile the keras model model.compile(loss='sparse_categorical_crossentropy', optimizer='adam') # fit the keras model on the dataset model.fit(X_train, y_train, epochs=150, batch_size=32, verbose=2) # evaluate on test set yhat = model.predict(X_test) yhat = argmax(yhat, axis=-1).astype('int') acc = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % acc)

Running the example will prepare the dataset, fit the model, and report an estimate of model error.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved an accuracy of about 27%.

... Epoch 145/150 88/88 - 0s - loss: 1.9271 Epoch 146/150 88/88 - 0s - loss: 1.9265 Epoch 147/150 88/88 - 0s - loss: 1.9265 Epoch 148/150 88/88 - 0s - loss: 1.9271 Epoch 149/150 88/88 - 0s - loss: 1.9262 Epoch 150/150 88/88 - 0s - loss: 1.9260 Accuracy: 0.274

So far so good.

Next, let’s look at developing a combined model capable of both regression and classification predictions.

In this section, we can develop a single MLP neural network model that can make both regression and classification predictions for a single input.

This is called a multi-output model and can be developed using the functional Keras API.

For more on this functional API, which can be tricky for beginners, see the tutorials:

- TensorFlow 2 Tutorial: Get Started in Deep Learning With tf.keras
- How to Use the Keras Functional API for Deep Learning

First, the dataset must be prepared.

We can prepare the dataset as we did before for classification, although we should save the encoded target variable with a separate name to differentiate it from the raw target variable values.

... # encode strings to integer y_class = LabelEncoder().fit_transform(y) n_class = len(unique(y_class))

We can then split the input, raw output, and encoded output variables into train and test sets.

... # split data into train and test sets X_train, X_test, y_train, y_test, y_train_class, y_test_class = train_test_split(X, y, y_class, test_size=0.33, random_state=1)

Next, we can define the model using the functional API.

The model takes the same number of inputs as before with the standalone models and uses two hidden layers configured in the same way.

... # input visible = Input(shape=(n_features,)) hidden1 = Dense(20, activation='relu', kernel_initializer='he_normal')(visible) hidden2 = Dense(10, activation='relu', kernel_initializer='he_normal')(hidden1)

We can then define two separate output layers that connect to the second hidden layer of the model.

The first is a regression output layer that has a single node and a linear activation function.

... # regression output out_reg = Dense(1, activation='linear')(hidden2)

The second is a classification output layer that has one node for each class being predicted and uses a softmax activation function.

... # classification output out_clas = Dense(n_class, activation='softmax')(hidden2)

We can then define the model with a single input layer and two output layers.

... # define model model = Model(inputs=visible, outputs=[out_reg, out_clas])

Given the two output layers, we can compile the model with two loss functions, mean squared error loss for the first (regression) output layer and sparse categorical cross-entropy for the second (classification) output layer.

... # compile the keras model model.compile(loss=['mse','sparse_categorical_crossentropy'], optimizer='adam')

We can also create a plot of the model for reference.

This requires that pydot and pygraphviz are installed. If this is a problem, you can comment out this line and the import statement for the *plot_model()* function.

... # plot graph of model plot_model(model, to_file='model.png', show_shapes=True)

Each time the model makes a prediction, it will predict two values.

Similarly, when training the model, it will need one target variable per sample for each output.

As such, we can train the model, carefully providing both the regression target and classification target data to each output of the model.

... # fit the keras model on the dataset model.fit(X_train, [y_train,y_train_class], epochs=150, batch_size=32, verbose=2)

The fit model can then make a regression and classification prediction for each example in the hold-out test set.

... # make predictions on test set yhat1, yhat2 = model.predict(X_test)

The first array can be used to evaluate the regression predictions via mean absolute error.

... # calculate error for regression model error = mean_absolute_error(y_test, yhat1) print('MAE: %.3f' % error)

The second array can be used to evaluate the classification predictions via classification accuracy.

... # evaluate accuracy for classification model yhat2 = argmax(yhat2, axis=-1).astype('int') acc = accuracy_score(y_test_class, yhat2) print('Accuracy: %.3f' % acc)

And that’s it.

Tying this together, the complete example of training and evaluating a multi-output model for combiner regression and classification predictions on the abalone dataset is listed below.

# mlp for combined regression and classification predictions on the abalone dataset from numpy import unique from numpy import argmax from pandas import read_csv from sklearn.metrics import mean_absolute_error from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from tensorflow.keras.models import Model from tensorflow.keras.layers import Input from tensorflow.keras.layers import Dense from tensorflow.keras.utils import plot_model # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/abalone.csv' dataframe = read_csv(url, header=None) dataset = dataframe.values # split into input (X) and output (y) variables X, y = dataset[:, 1:-1], dataset[:, -1] X, y = X.astype('float'), y.astype('float') n_features = X.shape[1] # encode strings to integer y_class = LabelEncoder().fit_transform(y) n_class = len(unique(y_class)) # split data into train and test sets X_train, X_test, y_train, y_test, y_train_class, y_test_class = train_test_split(X, y, y_class, test_size=0.33, random_state=1) # input visible = Input(shape=(n_features,)) hidden1 = Dense(20, activation='relu', kernel_initializer='he_normal')(visible) hidden2 = Dense(10, activation='relu', kernel_initializer='he_normal')(hidden1) # regression output out_reg = Dense(1, activation='linear')(hidden2) # classification output out_clas = Dense(n_class, activation='softmax')(hidden2) # define model model = Model(inputs=visible, outputs=[out_reg, out_clas]) # compile the keras model model.compile(loss=['mse','sparse_categorical_crossentropy'], optimizer='adam') # plot graph of model plot_model(model, to_file='model.png', show_shapes=True) # fit the keras model on the dataset model.fit(X_train, [y_train,y_train_class], epochs=150, batch_size=32, verbose=2) # make predictions on test set yhat1, yhat2 = model.predict(X_test) # calculate error for regression model error = mean_absolute_error(y_test, yhat1) print('MAE: %.3f' % error) # evaluate accuracy for classification model yhat2 = argmax(yhat2, axis=-1).astype('int') acc = accuracy_score(y_test_class, yhat2) print('Accuracy: %.3f' % acc)

Running the example will prepare the dataset, fit the model, and report an estimate of model error.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

A plot of the multi-output model is created, clearly showing the regression (left) and classification (right) output layers connected to the second hidden layer of the model.

In this case, we can see that the model achieved both a reasonable error of about 1.495 (rings) and a similar accuracy as before of about 25.6%.

... Epoch 145/150 88/88 - 0s - loss: 6.5707 - dense_2_loss: 4.5396 - dense_3_loss: 2.0311 Epoch 146/150 88/88 - 0s - loss: 6.5753 - dense_2_loss: 4.5466 - dense_3_loss: 2.0287 Epoch 147/150 88/88 - 0s - loss: 6.5970 - dense_2_loss: 4.5723 - dense_3_loss: 2.0247 Epoch 148/150 88/88 - 0s - loss: 6.5640 - dense_2_loss: 4.5389 - dense_3_loss: 2.0251 Epoch 149/150 88/88 - 0s - loss: 6.6053 - dense_2_loss: 4.5827 - dense_3_loss: 2.0226 Epoch 150/150 88/88 - 0s - loss: 6.5754 - dense_2_loss: 4.5524 - dense_3_loss: 2.0230 MAE: 1.495 Accuracy: 0.256

This section provides more resources on the topic if you are looking to go deeper.

- Difference Between Classification and Regression in Machine Learning
- TensorFlow 2 Tutorial: Get Started in Deep Learning With tf.keras
- Best Results for Standard Machine Learning Datasets
- How to Use the Keras Functional API for Deep Learning

In this tutorial, you discovered how to develop a neural network for combined regression and classification predictions.

Specifically, you learned:

- Some prediction problems require predicting both numeric and class label values for each input example.
- How to develop separate regression and classification models for problems that require multiple outputs.
- How to develop and evaluate a neural network model capable of making simultaneous regression and classification predictions.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Neural Network Models for Combined Classification and Regression appeared first on Machine Learning Mastery.

]]>The post Iterated Local Search From Scratch in Python appeared first on Machine Learning Mastery.

]]>It involves the repeated application of a local search algorithm to modified versions of a good solution found previously. In this way, it is like a clever version of the stochastic hill climbing with random restarts algorithm.

The intuition behind the algorithm is that random restarts can help to locate many local optima in a problem and that better local optima are often close to other local optima. Therefore modest perturbations to existing local optima may locate better or even best solutions to an optimization problem.

In this tutorial, you will discover how to implement the iterated local search algorithm from scratch.

After completing this tutorial, you will know:

- Iterated local search is a stochastic global search optimization algorithm that is a smarter version of stochastic hill climbing with random restarts.
- How to implement stochastic hill climbing with random restarts from scratch.
- How to implement and apply the iterated local search algorithm to a nonlinear objective function.

Let’s get started.

This tutorial is divided into five parts; they are:

- What Is Iterated Local Search
- Ackley Objective Function
- Stochastic Hill Climbing Algorithm
- Stochastic Hill Climbing With Random Restarts
- Iterated Local Search Algorithm

Iterated Local Search, or ILS for short, is a stochastic global search optimization algorithm.

It is related to or an extension of stochastic hill climbing and stochastic hill climbing with random starts.

It’s essentially a more clever version of Hill-Climbing with Random Restarts.

— Page 26, Essentials of Metaheuristics, 2011.

Stochastic hill climbing is a local search algorithm that involves making random modifications to an existing solution and accepting the modification only if it results in better results than the current working solution.

Local search algorithms in general can get stuck in local optima. One approach to address this problem is to restart the search from a new randomly selected starting point. The restart procedure can be performed many times and may be triggered after a fixed number of function evaluations or if no further improvement is seen for a given number of algorithm iterations. This algorithm is called stochastic hill climbing with random restarts.

The simplest possibility to improve upon a cost found by LocalSearch is to repeat the search from another starting point.

— Page 132, Handbook of Metaheuristics, 3rd edition 2019.

Iterated local search is similar to stochastic hill climbing with random restarts, except rather than selecting a random starting point for each restart, a point is selected based on a modified version of the best point found so far during the broader search.

The perturbation of the best solution so far is like a large jump in the search space to a new region, whereas the perturbations made by the stochastic hill climbing algorithm are much smaller, confined to a specific region of the search space.

The heuristic here is that you can often find better local optima near to the one you’re presently in, and walking from local optimum to local optimum in this way often outperforms just trying new locations entirely at random.

— Page 26, Essentials of Metaheuristics, 2011.

This allows the search to be performed at two levels. The hill climbing algorithm is the local search for getting the most out of a specific candidate solution or region of the search space, and the restart approach allows different regions of the search space to be explored.

In this way, the algorithm Iterated Local Search explores multiple local optima in the search space, increasing the likelihood of locating the global optima.

The Iterated Local Search was proposed for combinatorial optimization problems, such as the traveling salesman problem (TSP), although it can be applied to continuous function optimization by using different step sizes in the search space: smaller steps for the hill climbing and larger steps for the random restart.

Now that we are familiar with the Iterated Local Search algorithm, let’s explore how to implement the algorithm from scratch.

First, let’s define a channeling optimization problem as the basis for implementing the Iterated Local Search algorithm.

The Ackley function is an example of a multimodal objective function that has a single global optima and multiple local optima in which a local search might get stuck.

As such, a global optimization technique is required. It is a two-dimensional objective function that has a global optima at [0,0], which evaluates to 0.0.

The example below implements the Ackley and creates a three-dimensional surface plot showing the global optima and multiple local optima.

# ackley multimodal function from numpy import arange from numpy import exp from numpy import sqrt from numpy import cos from numpy import e from numpy import pi from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi * x) + cos(2 * pi * y))) + e + 20 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates the surface plot of the Ackley function showing the vast number of local optima.

We will use this as the basis for implementing and comparing a simple stochastic hill climbing algorithm, stochastic hill climbing with random restarts, and finally iterated local search.

We would expect a stochastic hill climbing algorithm to get stuck easily in local minima. We would expect stochastic hill climbing with restarts to find many local minima, and we would expect iterated local search to perform better than either method on this problem if configured appropriately.

Core to the Iterated Local Search algorithm is a local search, and in this tutorial, we will use the Stochastic Hill Climbing algorithm for this purpose.

The Stochastic Hill Climbing algorithm involves first generating a random starting point and current working solution, then generating perturbed versions of the current working solution and accepting them if they are better than the current working solution.

Given that we are working on a continuous optimization problem, a solution is a vector of values to be evaluated by the objective function, in this case, a point in a two-dimensional space bounded by -5 and 5.

We can generate a random point by sampling the search space with a uniform probability distribution. For example:

... # generate a random point in the search space solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

We can generate perturbed versions of a currently working solution using a Gaussian probability distribution with the mean of the current values in the solution and a standard deviation controlled by a hyperparameter that controls how far the search is allowed to explore from the current working solution.

We will refer to this hyperparameter as “*step_size*“, for example:

... # generate a perturbed version of a current working solution candidate = solution + randn(len(bounds)) * step_size

Importantly, we must check that generated solutions are within the search space.

This can be achieved with a custom function named *in_bounds()* that takes a candidate solution and the bounds of the search space and returns True if the point is in the search space, *False* otherwise.

# check if a point is within the bounds of the search def in_bounds(point, bounds): # enumerate all dimensions of the point for d in range(len(bounds)): # check if out of bounds for this dimension if point[d] < bounds[d, 0] or point[d] > bounds[d, 1]: return False return True

This function can then be called during the hill climb to confirm that new points are in the bounds of the search space, and if not, new points can be generated.

Tying this together, the function *hillclimbing()* below implements the stochastic hill climbing local search algorithm. It takes the name of the objective function, bounds of the problem, number of iterations, and steps size as arguments and returns the best solution and its evaluation.

# hill climbing local search algorithm def hillclimbing(objective, bounds, n_iterations, step_size): # generate an initial point solution = None while solution is None or not in_bounds(solution, bounds): solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # evaluate the initial point solution_eval = objective(solution) # run the hill climb for i in range(n_iterations): # take a step candidate = None while candidate is None or not in_bounds(candidate, bounds): candidate = solution + randn(len(bounds)) * step_size # evaluate candidate point candidte_eval = objective(candidate) # check if we should keep the new point if candidte_eval <= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval # report progress print('>%d f(%s) = %.5f' % (i, solution, solution_eval)) return [solution, solution_eval]

We can test this algorithm on the Ackley function.

We will fix the seed for the pseudorandom number generator to ensure we get the same results each time the code is run.

The algorithm will be run for 1,000 iterations and a step size of 0.05 units will be used; both hyperparameters were chosen after a little trial and error.

At the end of the run, we will report the best solution found.

... # seed the pseudorandom number generator seed(1) # define range for input bounds = asarray([[-5.0, 5.0], [-5.0, 5.0]]) # define the total iterations n_iterations = 1000 # define the maximum step size step_size = 0.05 # perform the hill climbing search best, score = hillclimbing(objective, bounds, n_iterations, step_size) print('Done!') print('f(%s) = %f' % (best, score))

Tying this together, the complete example of applying the stochastic hill climbing algorithm to the Ackley objective function is listed below.

# hill climbing search of the ackley objective function from numpy import asarray from numpy import exp from numpy import sqrt from numpy import cos from numpy import e from numpy import pi from numpy.random import randn from numpy.random import rand from numpy.random import seed # objective function def objective(v): x, y = v return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi * x) + cos(2 * pi * y))) + e + 20 # check if a point is within the bounds of the search def in_bounds(point, bounds): # enumerate all dimensions of the point for d in range(len(bounds)): # check if out of bounds for this dimension if point[d] < bounds[d, 0] or point[d] > bounds[d, 1]: return False return True # hill climbing local search algorithm def hillclimbing(objective, bounds, n_iterations, step_size): # generate an initial point solution = None while solution is None or not in_bounds(solution, bounds): solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # evaluate the initial point solution_eval = objective(solution) # run the hill climb for i in range(n_iterations): # take a step candidate = None while candidate is None or not in_bounds(candidate, bounds): candidate = solution + randn(len(bounds)) * step_size # evaluate candidate point candidte_eval = objective(candidate) # check if we should keep the new point if candidte_eval <= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval # report progress print('>%d f(%s) = %.5f' % (i, solution, solution_eval)) return [solution, solution_eval] # seed the pseudorandom number generator seed(1) # define range for input bounds = asarray([[-5.0, 5.0], [-5.0, 5.0]]) # define the total iterations n_iterations = 1000 # define the maximum step size step_size = 0.05 # perform the hill climbing search best, score = hillclimbing(objective, bounds, n_iterations, step_size) print('Done!') print('f(%s) = %f' % (best, score))

Running the example performs the stochastic hill climbing search on the objective function. Each improvement found during the search is reported and the best solution is then reported at the end of the search.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see about 13 improvements during the search and a final solution of about f(-0.981, 1.965), resulting in an evaluation of about 5.381, which is far from f(0.0, 0.0) = 0.

>0 f([-0.85618854 2.1495965 ]) = 6.46986 >1 f([-0.81291816 2.03451957]) = 6.07149 >5 f([-0.82903902 2.01531685]) = 5.93526 >7 f([-0.83766043 1.97142393]) = 5.82047 >9 f([-0.89269139 2.02866012]) = 5.68283 >12 f([-0.8988359 1.98187164]) = 5.55899 >13 f([-0.9122303 2.00838942]) = 5.55566 >14 f([-0.94681334 1.98855174]) = 5.43024 >15 f([-0.98117198 1.94629146]) = 5.39010 >23 f([-0.97516403 1.97715161]) = 5.38735 >39 f([-0.98628044 1.96711371]) = 5.38241 >362 f([-0.9808789 1.96858459]) = 5.38233 >629 f([-0.98102417 1.96555308]) = 5.38194 Done! f([-0.98102417 1.96555308]) = 5.381939

Next, we will modify the algorithm to perform random restarts and see if we can achieve better results.

The Stochastic Hill Climbing With Random Restarts algorithm involves the repeated running of the Stochastic Hill Climbing algorithm and keeping track of the best solution found.

First, let’s modify the *hillclimbing()* function to take the starting point of the search rather than generating it randomly. This will help later when we implement the Iterated Local Search algorithm later.

# hill climbing local search algorithm def hillclimbing(objective, bounds, n_iterations, step_size, start_pt): # store the initial point solution = start_pt # evaluate the initial point solution_eval = objective(solution) # run the hill climb for i in range(n_iterations): # take a step candidate = None while candidate is None or not in_bounds(candidate, bounds): candidate = solution + randn(len(bounds)) * step_size # evaluate candidate point candidte_eval = objective(candidate) # check if we should keep the new point if candidte_eval <= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval return [solution, solution_eval]

Next, we can implement the random restart algorithm by repeatedly calling the *hillclimbing()* function a fixed number of times.

Each call, we will generate a new randomly selected starting point for the hill climbing search.

... # generate a random initial point for the search start_pt = None while start_pt is None or not in_bounds(start_pt, bounds): start_pt = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # perform a stochastic hill climbing search solution, solution_eval = hillclimbing(objective, bounds, n_iter, step_size, start_pt)

We can then inspect the result and keep it if it is better than any result of the search we have seen so far.

... # check for new best if solution_eval < best_eval: best, best_eval = solution, solution_eval print('Restart %d, best: f(%s) = %.5f' % (n, best, best_eval))

Tying this together, the *random_restarts()* function implemented the stochastic hill climbing algorithm with random restarts.

# hill climbing with random restarts algorithm def random_restarts(objective, bounds, n_iter, step_size, n_restarts): best, best_eval = None, 1e+10 # enumerate restarts for n in range(n_restarts): # generate a random initial point for the search start_pt = None while start_pt is None or not in_bounds(start_pt, bounds): start_pt = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # perform a stochastic hill climbing search solution, solution_eval = hillclimbing(objective, bounds, n_iter, step_size, start_pt) # check for new best if solution_eval < best_eval: best, best_eval = solution, solution_eval print('Restart %d, best: f(%s) = %.5f' % (n, best, best_eval)) return [best, best_eval]

We can then apply this algorithm to the Ackley objective function. In this case, we will limit the number of random restarts to 30, chosen arbitrarily.

The complete example is listed below.

# hill climbing search with random restarts of the ackley objective function from numpy import asarray from numpy import exp from numpy import sqrt from numpy import cos from numpy import e from numpy import pi from numpy.random import randn from numpy.random import rand from numpy.random import seed # objective function def objective(v): x, y = v return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi * x) + cos(2 * pi * y))) + e + 20 # check if a point is within the bounds of the search def in_bounds(point, bounds): # enumerate all dimensions of the point for d in range(len(bounds)): # check if out of bounds for this dimension if point[d] < bounds[d, 0] or point[d] > bounds[d, 1]: return False return True # hill climbing local search algorithm def hillclimbing(objective, bounds, n_iterations, step_size, start_pt): # store the initial point solution = start_pt # evaluate the initial point solution_eval = objective(solution) # run the hill climb for i in range(n_iterations): # take a step candidate = None while candidate is None or not in_bounds(candidate, bounds): candidate = solution + randn(len(bounds)) * step_size # evaluate candidate point candidte_eval = objective(candidate) # check if we should keep the new point if candidte_eval <= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval return [solution, solution_eval] # hill climbing with random restarts algorithm def random_restarts(objective, bounds, n_iter, step_size, n_restarts): best, best_eval = None, 1e+10 # enumerate restarts for n in range(n_restarts): # generate a random initial point for the search start_pt = None while start_pt is None or not in_bounds(start_pt, bounds): start_pt = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # perform a stochastic hill climbing search solution, solution_eval = hillclimbing(objective, bounds, n_iter, step_size, start_pt) # check for new best if solution_eval < best_eval: best, best_eval = solution, solution_eval print('Restart %d, best: f(%s) = %.5f' % (n, best, best_eval)) return [best, best_eval] # seed the pseudorandom number generator seed(1) # define range for input bounds = asarray([[-5.0, 5.0], [-5.0, 5.0]]) # define the total iterations n_iter = 1000 # define the maximum step size step_size = 0.05 # total number of random restarts n_restarts = 30 # perform the hill climbing search best, score = random_restarts(objective, bounds, n_iter, step_size, n_restarts) print('Done!') print('f(%s) = %f' % (best, score))

Running the example will perform a stochastic hill climbing with random restarts search for the Ackley objective function. Each time an improved overall solution is discovered, it is reported and the final best solution found by the search is summarized.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see three improvements during the search and that the best solution found was approximately f(0.002, 0.002), which evaluated to about 0.009, which is much better than a single run of the hill climbing algorithm.

Restart 0, best: f([-0.98102417 1.96555308]) = 5.38194 Restart 2, best: f([1.96522236 0.98120013]) = 5.38191 Restart 4, best: f([0.00223194 0.00258853]) = 0.00998 Done! f([0.00223194 0.00258853]) = 0.009978

Next, let’s look at how we can implement the iterated local search algorithm.

The Iterated Local Search algorithm is a modified version of the stochastic hill climbing with random restarts algorithm.

The important difference is that the starting point for each application of the stochastic hill climbing algorithm is a perturbed version of the best point found so far.

We can implement this algorithm by using the *random_restarts()* function as a starting point. Each restart iteration, we can generate a modified version of the best solution found so far instead of a random starting point.

This can be achieved by using a step size hyperparameter, much like is used in the stochastic hill climber. In this case, a larger step size value will be used given the need for larger perturbations in the search space.

... # generate an initial point as a perturbed version of the last best start_pt = None while start_pt is None or not in_bounds(start_pt, bounds): start_pt = best + randn(len(bounds)) * p_size

Tying this together, the *iterated_local_search()* function is defined below.

# iterated local search algorithm def iterated_local_search(objective, bounds, n_iter, step_size, n_restarts, p_size): # define starting point best = None while best is None or not in_bounds(best, bounds): best = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # evaluate current best point best_eval = objective(best) # enumerate restarts for n in range(n_restarts): # generate an initial point as a perturbed version of the last best start_pt = None while start_pt is None or not in_bounds(start_pt, bounds): start_pt = best + randn(len(bounds)) * p_size # perform a stochastic hill climbing search solution, solution_eval = hillclimbing(objective, bounds, n_iter, step_size, start_pt) # check for new best if solution_eval < best_eval: best, best_eval = solution, solution_eval print('Restart %d, best: f(%s) = %.5f' % (n, best, best_eval)) return [best, best_eval]

We can then apply the algorithm to the Ackley objective function. In this case, we will use a larger step size value of 1.0 for the random restarts, chosen after a little trial and error.

The complete example is listed below.

# iterated local search of the ackley objective function from numpy import asarray from numpy import exp from numpy import sqrt from numpy import cos from numpy import e from numpy import pi from numpy.random import randn from numpy.random import rand from numpy.random import seed # objective function def objective(v): x, y = v return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi * x) + cos(2 * pi * y))) + e + 20 # check if a point is within the bounds of the search def in_bounds(point, bounds): # enumerate all dimensions of the point for d in range(len(bounds)): # check if out of bounds for this dimension if point[d] < bounds[d, 0] or point[d] > bounds[d, 1]: return False return True # hill climbing local search algorithm def hillclimbing(objective, bounds, n_iterations, step_size, start_pt): # store the initial point solution = start_pt # evaluate the initial point solution_eval = objective(solution) # run the hill climb for i in range(n_iterations): # take a step candidate = None while candidate is None or not in_bounds(candidate, bounds): candidate = solution + randn(len(bounds)) * step_size # evaluate candidate point candidte_eval = objective(candidate) # check if we should keep the new point if candidte_eval <= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval return [solution, solution_eval] # iterated local search algorithm def iterated_local_search(objective, bounds, n_iter, step_size, n_restarts, p_size): # define starting point best = None while best is None or not in_bounds(best, bounds): best = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # evaluate current best point best_eval = objective(best) # enumerate restarts for n in range(n_restarts): # generate an initial point as a perturbed version of the last best start_pt = None while start_pt is None or not in_bounds(start_pt, bounds): start_pt = best + randn(len(bounds)) * p_size # perform a stochastic hill climbing search solution, solution_eval = hillclimbing(objective, bounds, n_iter, step_size, start_pt) # check for new best if solution_eval < best_eval: best, best_eval = solution, solution_eval print('Restart %d, best: f(%s) = %.5f' % (n, best, best_eval)) return [best, best_eval] # seed the pseudorandom number generator seed(1) # define range for input bounds = asarray([[-5.0, 5.0], [-5.0, 5.0]]) # define the total iterations n_iter = 1000 # define the maximum step size s_size = 0.05 # total number of random restarts n_restarts = 30 # perturbation step size p_size = 1.0 # perform the hill climbing search best, score = iterated_local_search(objective, bounds, n_iter, s_size, n_restarts, p_size) print('Done!') print('f(%s) = %f' % (best, score))

Running the example will perform an Iterated Local Search of the Ackley objective function.

Each time an improved overall solution is discovered, it is reported and the final best solution found by the search is summarized at the end of the run.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see four improvements during the search and that the best solution found was two very small inputs that are close to zero, which evaluated to about 0.0003, which is better than either a single run of the hill climber or the hill climber with restarts.

Restart 0, best: f([-0.96775653 0.96853129]) = 3.57447 Restart 3, best: f([-4.50618519e-04 9.51020713e-01]) = 2.57996 Restart 5, best: f([ 0.00137423 -0.00047059]) = 0.00416 Restart 22, best: f([ 1.16431936e-04 -3.31358206e-06]) = 0.00033 Done! f([ 1.16431936e-04 -3.31358206e-06]) = 0.000330

This section provides more resources on the topic if you are looking to go deeper.

- Essentials of Metaheuristics, 2011.
- Handbook of Metaheuristics, 3rd edition 2019.

In this tutorial, you discovered how to implement the iterated local search algorithm from scratch.

Specifically, you learned:

- Iterated local search is a stochastic global search optimization algorithm that is a smarter version of stochastic hill climbing with random restarts.
- How to implement stochastic hill climbing with random restarts from scratch.
- How to implement and apply the iterated local search algorithm to a nonlinear objective function.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Iterated Local Search From Scratch in Python appeared first on Machine Learning Mastery.

]]>The post Develop a Neural Network for Woods Mammography Dataset appeared first on Machine Learning Mastery.

]]>One approach is to first inspect the dataset and develop ideas for what models might work, then explore the learning dynamics of simple models on the dataset, then finally develop and tune a model for the dataset with a robust test harness.

This process can be used to develop effective neural network models for classification and regression predictive modeling problems.

In this tutorial, you will discover how to develop a Multilayer Perceptron neural network model for the Wood’s Mammography classification dataset.

After completing this tutorial, you will know:

- How to load and summarize the Wood’s Mammography dataset and use the results to suggest data preparations and model configurations to use.
- How to explore the learning dynamics of simple MLP models on the dataset.
- How to develop robust estimates of model performance, tune model performance and make predictions on new data.

Let’s get started.

This tutorial is divided into 4 parts; they are:

- Woods Mammography Dataset
- Neural Network Learning Dynamics
- Robust Model Evaluation
- Final Model and Make Predictions

The first step is to define and explore the dataset.

We will be working with the “*mammography*” standard binary classification dataset, sometimes called “*Woods Mammography*“.

The dataset is credited to Kevin Woods, et al. and the 1993 paper titled “Comparative Evaluation Of Pattern Recognition Techniques For Detection Of Microcalcifications In Mammography.”

The focus of the problem is on detecting breast cancer from radiological scans, specifically the presence of clusters of microcalcifications that appear bright on a mammogram.

There are two classes and the goal is to distinguish between microcalcifications and non-microcalcifications using the features for a given segmented object.

**Non-microcalcifications**: negative case, or majority class.**Microcalcifications**: positive case, or minority class.

The Mammography dataset is a widely used standard machine learning dataset, used to explore and demonstrate many techniques designed specifically for imbalanced classification.

**Note**: To be crystal clear, we are NOT “*solving breast cancer*“. We are exploring a standard classification dataset.

Below is a sample of the first 5 rows of the dataset

0.23001961,5.0725783,-0.27606055,0.83244412,-0.37786573,0.4803223,'-1' 0.15549112,-0.16939038,0.67065219,-0.85955255,-0.37786573,-0.94572324,'-1' -0.78441482,-0.44365372,5.6747053,-0.85955255,-0.37786573,-0.94572324,'-1' 0.54608818,0.13141457,-0.45638679,-0.85955255,-0.37786573,-0.94572324,'-1' -0.10298725,-0.3949941,-0.14081588,0.97970269,-0.37786573,1.0135658,'-1' ...

You can learn more about the dataset here:

We can load the dataset as a pandas DataFrame directly from the URL; for example:

# load the mammography dataset and summarize the shape from pandas import read_csv # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/mammography.csv' # load the dataset df = read_csv(url, header=None) # summarize shape print(df.shape)

Running the example loads the dataset directly from the URL and reports the shape of the dataset.

In this case, we can confirm that the dataset has 7 variables (6 input and one output) and that the dataset has 11,183 rows of data.

This a modest sized dataset for a neural network and suggests that a small network would be appropriate.

It also suggests that using k-fold cross-validation would be a good idea given that it will give a more reliable estimate of model performance than a train/test split and because a single model will fit in seconds instead of hours or days with the largest datasets.

(11183, 7)

Next, we can learn more about the dataset by looking at summary statistics and a plot of the data.

# show summary statistics and plots of the mammography dataset from pandas import read_csv from matplotlib import pyplot # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/mammography.csv' # load the dataset df = read_csv(url, header=None) # show summary statistics print(df.describe()) # plot histograms df.hist() pyplot.show()

Running the example first loads the data before and then prints summary statistics for each variable.

We can see that the values are generally small with means close to zero.

0 1 ... 4 5 count 1.118300e+04 1.118300e+04 ... 1.118300e+04 1.118300e+04 mean 1.096535e-10 1.297595e-09 ... -1.120680e-09 1.459483e-09 std 1.000000e+00 1.000000e+00 ... 1.000000e+00 1.000000e+00 min -7.844148e-01 -4.701953e-01 ... -3.778657e-01 -9.457232e-01 25% -7.844148e-01 -4.701953e-01 ... -3.778657e-01 -9.457232e-01 50% -1.085769e-01 -3.949941e-01 ... -3.778657e-01 -9.457232e-01 75% 3.139489e-01 -7.649473e-02 ... -3.778657e-01 1.016613e+00 max 3.150844e+01 5.085849e+00 ... 2.361712e+01 1.949027e+00

A histogram plot is then created for each variable.

We can see that perhaps most variables have an exponential distribution, and perhaps variable 5 (the last input variable) is Gaussian with outliers/missing values.

We may have some benefit in using a power transform on each variable in order to make the probability distribution less skewed which will likely improve model performance.

It may be helpful to know how imbalanced the dataset actually is.

We can use the Counter object to count the number of examples in each class, then use those counts to summarize the distribution.

The complete example is listed below.

# summarize the class ratio of the mammography dataset from pandas import read_csv from collections import Counter # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/mammography.csv' # load the csv file as a data frame dataframe = read_csv(url, header=None) # summarize the class distribution target = dataframe.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))

Running the example summarizes the class distribution, confirming the severe class imbalanced with approximately 98 percent for the majority class (no cancer) and approximately 2 percent for the minority class (cancer).

Class='-1', Count=10923, Percentage=97.675% Class='1', Count=260, Percentage=2.325%

This is helpful because if we use classification accuracy, then any model that achieves an accuracy less than about 97.7% does not have skill on this dataset.

Now that we are familiar with the dataset, let’s explore how we might develop a neural network model.

We will develop a Multilayer Perceptron (MLP) model for the dataset using TensorFlow.

We cannot know what model architecture of learning hyperparameters would be good or best for this dataset, so we must experiment and discover what works well.

Given that the dataset is small, a small batch size is probably a good idea, e.g. 16 or 32 rows. Using the Adam version of stochastic gradient descent is a good idea when getting started as it will automatically adapt the learning rate and works well on most datasets.

Before we evaluate models in earnest, it is a good idea to review the learning dynamics and tune the model architecture and learning configuration until we have stable learning dynamics, then look at getting the most out of the model.

We can do this by using a simple train/test split of the data and review plots of the learning curves. This will help us see if we are over-learning or under-learning; then we can adapt the configuration accordingly.

First, we must ensure all input variables are floating-point values and encode the target label as integer values 0 and 1.

... # ensure all data are floating point values X = X.astype('float32') # encode strings to integer y = LabelEncoder().fit_transform(y)

Next, we can split the dataset into input and output variables, then into 67/33 train and test sets.

We must ensure that the split is stratified by the class ensuring that the train and test sets have the same distribution of class labels as the main dataset.

... # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # split into train and test datasets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, stratify=y, random_state=1)

We can define a minimal MLP model.

In this case, we will use one hidden layer with 50 nodes and one output layer (chosen arbitrarily). We will use the ReLU activation function in the hidden layer and the “*he_normal*” weight initialization, as together, they are a good practice.

The output of the model is a sigmoid activation for binary classification and we will minimize binary cross-entropy loss.

... # define model model = Sequential() model.add(Dense(50, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy')

We will fit the model for 300 training epochs (chosen arbitrarily) with a batch size of 32 because it is a modestly sized dataset.

We are fitting the model on raw data, which we think might be a good idea, but it is an important starting point.

... history = model.fit(X_train, y_train, epochs=300, batch_size=32, verbose=0, validation_data=(X_test,y_test))

At the end of training, we will evaluate the model’s performance on the test dataset and report performance as the classification accuracy.

... # predict test set yhat = model.predict_classes(X_test) # evaluate predictions score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score)

Finally, we will plot learning curves of the cross-entropy loss on the train and test sets during training.

... # plot learning curves pyplot.title('Learning Curves') pyplot.xlabel('Epoch') pyplot.ylabel('Cross Entropy') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='val') pyplot.legend() pyplot.show()

Tying this all together, the complete example of evaluating our first MLP on the cancer survival dataset is listed below.

# fit a simple mlp model on the mammography and review learning curves from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense from matplotlib import pyplot # load the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/mammography.csv' df = read_csv(path, header=None) # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # ensure all data are floating point values X = X.astype('float32') # encode strings to integer y = LabelEncoder().fit_transform(y) # split into train and test datasets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, stratify=y, random_state=1) # determine the number of input features n_features = X.shape[1] # define model model = Sequential() model.add(Dense(50, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy') # fit the model history = model.fit(X_train, y_train, epochs=300, batch_size=32, verbose=0, validation_data=(X_test,y_test)) # predict test set yhat = model.predict_classes(X_test) # evaluate predictions score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score) # plot learning curves pyplot.title('Learning Curves') pyplot.xlabel('Epoch') pyplot.ylabel('Cross Entropy') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='val') pyplot.legend() pyplot.show()

Running the example first fits the model on the training dataset, then reports the classification accuracy on the test dataset.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case we can see that the model performs better than a no-skill model, given that the accuracy is above about 97.7 percent, in this case achieving an accuracy of about 98.8 percent.

Accuracy: 0.988

Line plots of the loss on the train and test sets are then created.

We can see that the model quickly finds a good fit on the dataset and does not appear to be over or underfitting.

Now that we have some idea of the learning dynamics for a simple MLP model on the dataset, we can look at developing a more robust evaluation of model performance on the dataset.

The k-fold cross-validation procedure can provide a more reliable estimate of MLP performance, although it can be very slow.

This is because k models must be fit and evaluated. This is not a problem when the dataset size is small, such as the cancer survival dataset.

We can use the StratifiedKFold class and enumerate each fold manually, fit the model, evaluate it, and then report the mean of the evaluation scores at the end of the procedure.

... # prepare cross validation kfold = KFold(10) # enumerate splits scores = list() for train_ix, test_ix in kfold.split(X, y): # fit and evaluate the model... ... ... # summarize all scores print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

We can use this framework to develop a reliable estimate of MLP model performance with our base configuration, and even with a range of different data preparations, model architectures, and learning configurations.

It is important that we first developed an understanding of the learning dynamics of the model on the dataset in the previous section before using k-fold cross-validation to estimate the performance. If we started to tune the model directly, we might get good results, but if not, we might have no idea of why, e.g. that the model was over or under fitting.

If we make large changes to the model again, it is a good idea to go back and confirm that the model is converging appropriately.

The complete example of this framework to evaluate the base MLP model from the previous section is listed below.

# k-fold cross-validation of base model for the mammography dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense from matplotlib import pyplot # load the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/mammography.csv' df = read_csv(path, header=None) # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # ensure all data are floating point values X = X.astype('float32') # encode strings to integer y = LabelEncoder().fit_transform(y) # prepare cross validation kfold = StratifiedKFold(10, random_state=1) # enumerate splits scores = list() for train_ix, test_ix in kfold.split(X, y): # split data X_train, X_test, y_train, y_test = X[train_ix], X[test_ix], y[train_ix], y[test_ix] # determine the number of input features n_features = X.shape[1] # define model model = Sequential() model.add(Dense(50, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy') # fit the model model.fit(X_train, y_train, epochs=300, batch_size=32, verbose=0) # predict test set yhat = model.predict_classes(X_test) # evaluate predictions score = accuracy_score(y_test, yhat) print('>%.3f' % score) scores.append(score) # summarize all scores print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports the model performance each iteration of the evaluation procedure and reports the mean and standard deviation of classification accuracy at the end of the run.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the MLP model achieved a mean accuracy of about 98.7 percent, which is pretty close to our rough estimate in the previous section.

This confirms our expectation that the base model configuration may work better than a naive model for this dataset

>0.987 >0.986 >0.989 >0.987 >0.986 >0.988 >0.989 >0.989 >0.983 >0.988 Mean Accuracy: 0.987 (0.002)

Next, let’s look at how we might fit a final model and use it to make predictions.

Once we choose a model configuration, we can train a final model on all available data and use it to make predictions on new data.

In this case, we will use the model with dropout and a small batch size as our final model.

We can prepare the data and fit the model as before, although on the entire dataset instead of a training subset of the dataset.

... # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # ensure all data are floating point values X = X.astype('float32') # encode strings to integer le = LabelEncoder() y = le.fit_transform(y) # determine the number of input features n_features = X.shape[1] # define model model = Sequential() model.add(Dense(50, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy')

We can then use this model to make predictions on new data.

First, we can define a row of new data.

... # define a row of new data row = [0.23001961,5.0725783,-0.27606055,0.83244412,-0.37786573,0.4803223]

Note: I took this row from the first row of the dataset and the expected label is a ‘-1’.

We can then make a prediction.

... # make prediction yhat = model.predict_classes([row])

Then invert the transform on the prediction, so we can use or interpret the result in the correct label (which is just an integer for this dataset).

... # invert transform to get label for class yhat = le.inverse_transform(yhat)

And in this case, we will simply report the prediction.

... # report prediction print('Predicted: %s' % (yhat[0]))

Tying this all together, the complete example of fitting a final model for the mammography dataset and using it to make a prediction on new data is listed below.

# fit a final model and make predictions on new data for the mammography dataset from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.layers import Dropout # load the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/mammography.csv' df = read_csv(path, header=None) # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # ensure all data are floating point values X = X.astype('float32') # encode strings to integer le = LabelEncoder() y = le.fit_transform(y) # determine the number of input features n_features = X.shape[1] # define model model = Sequential() model.add(Dense(50, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy') # fit the model model.fit(X, y, epochs=300, batch_size=32, verbose=0) # define a row of new data row = [0.23001961,5.0725783,-0.27606055,0.83244412,-0.37786573,0.4803223] # make prediction yhat = model.predict_classes([row]) # invert transform to get label for class yhat = le.inverse_transform(yhat) # report prediction print('Predicted: %s' % (yhat[0]))

Running the example fits the model on the entire dataset and makes a prediction for a single row of new data.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model predicted a “-1” label for the input row.

Predicted: '-1'

This section provides more resources on the topic if you are looking to go deeper.

- Imbalanced Classification Model to Detect Mammography Microcalcifications
- Best Results for Standard Machine Learning Datasets
- TensorFlow 2 Tutorial: Get Started in Deep Learning With tf.keras
- A Gentle Introduction to k-fold Cross-Validation

In this tutorial, you discovered how to develop a Multilayer Perceptron neural network model for the Wood’s Mammography classification dataset.

Specifically, you learned:

- How to load and summarize the Wood’s Mammography dataset and use the results to suggest data preparations and model configurations to use.
- How to explore the learning dynamics of simple MLP models on the dataset.
- How to develop robust estimates of model performance, tune model performance and make predictions on new data.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Develop a Neural Network for Woods Mammography Dataset appeared first on Machine Learning Mastery.

]]>The post Tune XGBoost Performance With Learning Curves appeared first on Machine Learning Mastery.

]]>It can be challenging to configure the hyperparameters of XGBoost models, which often leads to using large grid search experiments that are both time consuming and computationally expensive.

An alternate approach to configuring **XGBoost** models is to evaluate the performance of the model each iteration of the algorithm during training and to plot the results as **learning curves**. These learning curve plots provide a diagnostic tool that can be interpreted and suggest specific changes to model hyperparameters that may lead to improvements in predictive performance.

In this tutorial, you will discover how to plot and interpret learning curves for XGBoost models in Python.

After completing this tutorial, you will know:

- Learning curves provide a useful diagnostic tool for understanding the training dynamics of supervised learning models like XGBoost.
- How to configure XGBoost to evaluate datasets each iteration and plot the results as learning curves.
- How to interpret and use learning curve plots to improve XGBoost model performance.

Let’s get started.

This tutorial is divided into four parts; they are:

- Extreme Gradient Boosting
- Learning Curves
- Plot XGBoost Learning Curve
- Tune XGBoost Model Using Learning Curves

**Gradient boosting** refers to a class of ensemble machine learning algorithms that can be used for classification or regression predictive modeling problems.

Ensembles are constructed from decision tree models. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. This is a type of ensemble machine learning model referred to as boosting.

Models are fit using any arbitrary differentiable loss function and gradient descent optimization algorithm. This gives the technique its name, “gradient boosting,” as the loss gradient is minimized as the model is fit, much like a neural network.

For more on gradient boosting, see the tutorial:

Extreme Gradient Boosting, or XGBoost for short, is an efficient open-source implementation of the gradient boosting algorithm. As such, XGBoost is an algorithm, an open-source project, and a Python library.

It was initially developed by Tianqi Chen and was described by Chen and Carlos Guestrin in their 2016 paper titled “XGBoost: A Scalable Tree Boosting System.”

It is designed to be both computationally efficient (e.g. fast to execute) and highly effective, perhaps more effective than other open-source implementations.

The two main reasons to use XGBoost are execution speed and model performance.

XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems. The evidence is that it is the go-to algorithm for competition winners on the Kaggle competitive data science platform.

Among the 29 challenge winning solutions 3 published at Kaggle’s blog during 2015, 17 solutions used XGBoost. […] The success of the system was also witnessed in KDDCup 2015, where XGBoost was used by every winning team in the top-10.

— XGBoost: A Scalable Tree Boosting System, 2016.

For more on XGBoost and how to install and use the XGBoost Python API, see the tutorial:

Now that we are familiar with what XGBoost is and why it is important, let’s take a closer look at learning curves.

Generally, a learning curve is a plot that shows time or experience on the x-axis and learning or improvement on the y-axis.

Learning curves are widely used in machine learning for algorithms that learn (optimize their internal parameters) incrementally over time, such as deep learning neural networks.

The metric used to evaluate learning could be maximizing, meaning that better scores (larger numbers) indicate more learning. An example would be classification accuracy.

It is more common to use a score that is minimizing, such as loss or error whereby better scores (smaller numbers) indicate more learning and a value of 0.0 indicates that the training dataset was learned perfectly and no mistakes were made.

During the training of a machine learning model, the current state of the model at each step of the training algorithm can be evaluated. It can be evaluated on the training dataset to give an idea of how well the model is “*learning*.” It can also be evaluated on a hold-out validation dataset that is not part of the training dataset. Evaluation on the validation dataset gives an idea of how well the model is “*generalizing*.”

It is common to create dual learning curves for a machine learning model during training on both the training and validation datasets.

The shape and dynamics of a learning curve can be used to diagnose the behavior of a machine learning model, and in turn, perhaps suggest the type of configuration changes that may be made to improve learning and/or performance.

There are three common dynamics that you are likely to observe in learning curves; they are:

- Underfit.
- Overfit.
- Good Fit.

Most commonly, learning curves are used to diagnose overfitting behavior of a model that can be addressed by tuning the hyperparameters of the model.

Overfitting refers to a model that has learned the training dataset too well, including the statistical noise or random fluctuations in the training dataset.

The problem with overfitting is that the more specialized the model becomes to training data, the less well it is able to generalize to new data, resulting in an increase in generalization error. This increase in generalization error can be measured by the performance of the model on the validation dataset.

For more on learning curves, see the tutorial:

Now that we are familiar with learning curves, let’s look at how we might plot learning curves for XGBoost models.

In this section, we will plot the learning curve for an XGBoost model.

First, we need a dataset to use as the basis for fitting and evaluating the model.

We will use a synthetic binary (two-class) classification dataset in this tutorial.

The make_classification() scikit-learn function can be used to create a synthetic classification dataset. In this case, we will use 50 input features (columns) and generate 10,000 samples (rows). The seed for the pseudo-random number generator is fixed to ensure the same base “*problem*” is used each time samples are generated.

The example below generates the synthetic classification dataset and summarizes the shape of the generated data.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=10000, n_features=50, n_informative=50, n_redundant=0, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example generates the data and reports the size of the input and output components, confirming the expected shape.

(10000, 50) (10000,)

Next, we can fit an XGBoost model on this dataset and plot learning curves.

First, we must split the dataset into one portion that will be used to train the model (train) and another portion that will not be used to train the model, but will be held back and used to evaluate the model each step of the training algorithm (test set or validation set).

... # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1)

We can then define an XGBoost classification model with default hyperparameters.

... # define the model model = XGBClassifier()

Next, the model can be fit on the dataset.

In this case, we must specify to the training algorithm that we want it to evaluate the performance of the model on the train and test sets each iteration (e.g. after each new tree is added to the ensemble).

To do this we must specify the datasets to evaluate and the metric to evaluate.

The dataset must be specified as a list of tuples, where each tuple contains the input and output columns of a dataset and each element in the list is a different dataset to evaluate, e.g. the train and the test sets.

... # define the datasets to evaluate each iteration evalset = [(X_train, y_train), (X_test,y_test)]

There are many metrics we may want to evaluate, although given that it is a classification task, we will evaluate the log loss (cross-entropy) of the model which is a minimizing score (lower values are better).

This can be achieved by specifying the “*eval_metric*” argument when calling *fit()* and providing it the name of the metric we will evaluate ‘*logloss*‘. We can also specify the datasets to evaluate via the “*eval_set*” argument. The *fit()* function takes the training dataset as the first two arguments as per normal.

... # fit the model model.fit(X_train, y_train, eval_metric='logloss', eval_set=evalset)

Once the model is fit, we can evaluate its performance as the classification accuracy on the test dataset.

... # evaluate performance yhat = model.predict(X_test) score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score)

We can then retrieve the metrics calculated for each dataset via a call to the *evals_result()* function.

... # retrieve performance metrics results = model.evals_result()

This returns a dictionary organized first by dataset (‘*validation_0*‘ and ‘*validation_1*‘) and then by metric (‘*logloss*‘).

We can create line plots of metrics for each dataset.

... # plot learning curves pyplot.plot(results['validation_0']['logloss'], label='train') pyplot.plot(results['validation_1']['logloss'], label='test') # show the legend pyplot.legend() # show the plot pyplot.show()

And that’s it.

Tying all of this together, the complete example of fitting an XGBoost model on the synthetic classification task and plotting learning curves is listed below.

# plot learning curve of an xgboost model from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from xgboost import XGBClassifier from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=10000, n_features=50, n_informative=50, n_redundant=0, random_state=1) # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1) # define the model model = XGBClassifier() # define the datasets to evaluate each iteration evalset = [(X_train, y_train), (X_test,y_test)] # fit the model model.fit(X_train, y_train, eval_metric='logloss', eval_set=evalset) # evaluate performance yhat = model.predict(X_test) score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score) # retrieve performance metrics results = model.evals_result() # plot learning curves pyplot.plot(results['validation_0']['logloss'], label='train') pyplot.plot(results['validation_1']['logloss'], label='test') # show the legend pyplot.legend() # show the plot pyplot.show()

Running the example fits the XGBoost model, retrieves the calculated metrics, and plots learning curves.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

First, the model performance is reported, showing that the model achieved a classification accuracy of about 94.5% on the hold-out test set.

Accuracy: 0.945

The plot shows learning curves for the train and test dataset where the x-axis is the number of iterations of the algorithm (or the number of trees added to the ensemble) and the y-axis is the logloss of the model. Each line shows the logloss per iteration for a given dataset.

From the learning curves, we can see that the performance of the model on the training dataset (blue line) is better or has lower loss than the performance of the model on the test dataset (orange line), as we might generally expect.

Now that we know how to plot learning curves for XGBoost models, let’s look at how we might use the curves to improve model performance.

We can use the learning curves as a diagnostic tool.

The curves can be interpreted and used as the basis for suggesting specific changes to the model configuration that might result in better performance.

The model and result in the previous section can be used as a baseline and starting point.

Looking at the plot, we can see that both curves are sloping down and suggest that more iterations (adding more trees) may result in a further decrease in loss.

Let’s try it out.

We can increase the number of iterations of the algorithm via the “*n_estimators*” hyperparameter that defaults to 100. Let’s increase it to 500.

... # define the model model = XGBClassifier(n_estimators=500)

The complete example is listed below.

# plot learning curve of an xgboost model from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from xgboost import XGBClassifier from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=10000, n_features=50, n_informative=50, n_redundant=0, random_state=1) # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1) # define the model model = XGBClassifier(n_estimators=500) # define the datasets to evaluate each iteration evalset = [(X_train, y_train), (X_test,y_test)] # fit the model model.fit(X_train, y_train, eval_metric='logloss', eval_set=evalset) # evaluate performance yhat = model.predict(X_test) score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score) # retrieve performance metrics results = model.evals_result() # plot learning curves pyplot.plot(results['validation_0']['logloss'], label='train') pyplot.plot(results['validation_1']['logloss'], label='test') # show the legend pyplot.legend() # show the plot pyplot.show()

Running the example fits and evaluates the model and plots the learning curves of model performance.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that more iterations have resulted in a lift in accuracy from about 94.5% to about 95.8%.

Accuracy: 0.958

We can see from the learning curves that indeed the additional iterations of the algorithm caused the curves to continue to drop and then level out after perhaps 150 iterations, where they remain reasonably flat.

The long flat curves may suggest that the algorithm is learning too fast and we may benefit from slowing it down.

This can be achieved using the learning rate, which limits the contribution of each tree added to the ensemble. This can be controlled via the “*eta*” hyperparameter and defaults to the value of 0.3. We can try a smaller value, such as 0.05.

... # define the model model = XGBClassifier(n_estimators=500, eta=0.05)

The complete example is listed below.

# plot learning curve of an xgboost model from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from xgboost import XGBClassifier from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=10000, n_features=50, n_informative=50, n_redundant=0, random_state=1) # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1) # define the model model = XGBClassifier(n_estimators=500, eta=0.05) # define the datasets to evaluate each iteration evalset = [(X_train, y_train), (X_test,y_test)] # fit the model model.fit(X_train, y_train, eval_metric='logloss', eval_set=evalset) # evaluate performance yhat = model.predict(X_test) score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score) # retrieve performance metrics results = model.evals_result() # plot learning curves pyplot.plot(results['validation_0']['logloss'], label='train') pyplot.plot(results['validation_1']['logloss'], label='test') # show the legend pyplot.legend() # show the plot pyplot.show()

Running the example fits and evaluates the model and plots the learning curves of model performance.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the smaller learning rate has made the accuracy worse, dropping from about 95.8% to about 95.1%.

Accuracy: 0.951

We can see from the learning curves that indeed learning has slowed right down. The curves suggest that we can continue to add more iterations and perhaps achieve better performance as the curves would have more opportunity to continue to decrease.

Let’s try increasing the number of iterations from 500 to 2,000.

... # define the model model = XGBClassifier(n_estimators=2000, eta=0.05)

The complete example is listed below.

# plot learning curve of an xgboost model from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from xgboost import XGBClassifier from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=10000, n_features=50, n_informative=50, n_redundant=0, random_state=1) # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1) # define the model model = XGBClassifier(n_estimators=2000, eta=0.05) # define the datasets to evaluate each iteration evalset = [(X_train, y_train), (X_test,y_test)] # fit the model model.fit(X_train, y_train, eval_metric='logloss', eval_set=evalset) # evaluate performance yhat = model.predict(X_test) score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score) # retrieve performance metrics results = model.evals_result() # plot learning curves pyplot.plot(results['validation_0']['logloss'], label='train') pyplot.plot(results['validation_1']['logloss'], label='test') # show the legend pyplot.legend() # show the plot pyplot.show()

Running the example fits and evaluates the model and plots the learning curves of model performance.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that more iterations have given the algorithm more space to improve, achieving an accuracy of 96.1%, the best so far.

Accuracy: 0.961

The learning curves again show a stable convergence of the algorithm with a steep decrease and long flattening out.

We could repeat the process of decreasing the learning rate and increasing the number of iterations to see if further improvements are possible.

Another approach to slowing down learning is to add regularization in the form of reducing the number of samples and features (rows and columns) used to construct each tree in the ensemble.

In this case, we will try halving the number of samples and features respectively via the “*subsample*” and “*colsample_bytree*” hyperparameters.

... # define the model model = XGBClassifier(n_estimators=2000, eta=0.05, subsample=0.5, colsample_bytree=0.5)

The complete example is listed below.

# plot learning curve of an xgboost model from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from xgboost import XGBClassifier from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=10000, n_features=50, n_informative=50, n_redundant=0, random_state=1) # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1) # define the model model = XGBClassifier(n_estimators=2000, eta=0.05, subsample=0.5, colsample_bytree=0.5) # define the datasets to evaluate each iteration evalset = [(X_train, y_train), (X_test,y_test)] # fit the model model.fit(X_train, y_train, eval_metric='logloss', eval_set=evalset) # evaluate performance yhat = model.predict(X_test) score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score) # retrieve performance metrics results = model.evals_result() # plot learning curves pyplot.plot(results['validation_0']['logloss'], label='train') pyplot.plot(results['validation_1']['logloss'], label='test') # show the legend pyplot.legend() # show the plot pyplot.show()

Running the example fits and evaluates the model and plots the learning curves of model performance.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the addition of regularization has resulted in a further improvement, bumping accuracy from about 96.1% to about 96.6%.

Accuracy: 0.966

The curves suggest that regularization has slowed learning and that perhaps increasing the number of iterations may result in further improvements.

This process can continue, and I am interested to see what you can come up with.

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning
- Extreme Gradient Boosting (XGBoost) Ensemble in Python
- How to use Learning Curves to Diagnose Machine Learning Model Performance
- Avoid Overfitting By Early Stopping With XGBoost In Python

In this tutorial, you discovered how to plot and interpret learning curves for XGBoost models in Python.

Specifically, you learned:

- Learning curves provide a useful diagnostic tool for understanding the training dynamics of supervised learning models like XGBoost.
- How to configure XGBoost to evaluate datasets each iteration and plot the results as learning curves.
- How to interpret and use learning curve plots to improve XGBoost model performance.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Tune XGBoost Performance With Learning Curves appeared first on Machine Learning Mastery.

]]>The post Two-Dimensional (2D) Test Functions for Function Optimization appeared first on Machine Learning Mastery.

]]>There are a large number of optimization algorithms and it is important to study and develop intuitions for optimization algorithms on simple and easy-to-visualize test functions.

**Two-dimensional functions** take two input values (x and y) and output a single evaluation of the input. They are among the simplest types of test functions to use when studying function optimization. The benefit of two-dimensional functions is that they can be visualized as a contour plot or surface plot that shows the topography of the problem domain with the optima and samples of the domain marked with points.

In this tutorial, you will discover standard two-dimensional functions you can use when studying function optimization.

Let’s get started.

A two-dimensional function is a function that takes two input variables and computes the objective value.

We can think of the two input variables as two axes on a graph, x and y. Each input to the function is a single point on the graph and the outcome of the function can be taken as the height on the graph.

This allows the function to be conceptualized as a surface and we can characterize the function based on the structure of the surface. For example, hills for input points that result in large relative outcomes of the objective function and valleys for input points that result in small relative outcomes of the objective function.

A surface may have one major feature or global optima, or it may have many with lots of places for an optimization to get stuck. The surface may be smooth, noisy, convex, and all manner of other properties that we may care about when testing optimization algorithms.

There are many different types of simple two-dimensional test functions we could use.

Nevertheless, there are standard test functions that are commonly used in the field of function optimization. There are also specific properties of test functions that we may wish to select when testing different algorithms.

We will explore a small number of simple two-dimensional test functions in this tutorial and organize them by their properties with two different groups; they are:

- Unimodal Functions
- Unimodal Function 1
- Unimodal Function 2
- Unimodal Function 3

- Multimodal Functions
- Multimodal Function 1
- Multimodal Function 2
- Multimodal Function 3

Each function will be presented using Python code with a function implementation of the target objective function and a sampling of the function that is shown as a surface plot.

All functions are presented as a minimization function, e.g. find the input that results in the minimum (smallest value) output of the function. Any maximizing function can be made a minimization function by adding a negative sign to all output. Similarly, any minimizing function can be made maximizing in the same way.

I did not invent these functions; they are taken from the literature. See the further reading section for references.

You can then choose and copy-paste the code one or more functions to use in your own project to study or compare the behavior of optimization algorithms.

Unimodal means that the function has a single global optima.

A unimodal function may or may not be convex. A convex function is a function where a line can be drawn between any two points in the domain and the line remains in the domain. For a two-dimensional function shown as a contour or surface plot, this means the function has a bowl shape and the line between two remains above or in the bowl.

Let’s look at a few examples of unimodal functions.

The range is bounded to -5.0 and 5.0 and one global optimal at [0.0, 0.0].

# unimodal test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a surface plot of the function.

The range is bounded to -10.0 and 10.0 and one global optimal at [0.0, 0.0].

# unimodal test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return 0.26 * (x**2 + y**2) - 0.48 * x * y # define range for input r_min, r_max = -10.0, 10.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a surface plot of the function.

The range is bounded to -10.0 and 10.0 and one global optimal at [0.0, 0.0]. This function is known as Easom’s function.

# unimodal test function from numpy import cos from numpy import exp from numpy import pi from numpy import arange from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return -cos(x) * cos(y) * exp(-((x - pi)**2 + (y - pi)**2)) # define range for input r_min, r_max = -10, 10 # sample input range uniformly at 0.01 increments xaxis = arange(r_min, r_max, 0.01) yaxis = arange(r_min, r_max, 0.01) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a surface plot of the function.

A multi-modal function means a function with more than one “*mode*” or optima (e.g. valley).

Multimodal functions are non-convex.

There may be one global optima and one or more local or deceptive optima. Alternately, there may be multiple global optima, i.e. multiple different inputs that result in the same minimal output of the function.

Let’s look at a few examples of multimodal functions.

The range is bounded to -5.0 and 5.0 and one global optimal at [0.0, 0.0]. This function is known as Ackley’s function.

# multimodal test function from numpy import arange from numpy import exp from numpy import sqrt from numpy import cos from numpy import e from numpy import pi from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi * x) + cos(2 * pi * y))) + e + 20 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a surface plot of the function.

The range is bounded to -5.0 and 5.0 and the function as four global optima at [3.0, 2.0], [-2.805118, 3.131312], [-3.779310, -3.283186], [3.584428, -1.848126]. This function is known as Himmelblau’s function.

# multimodal test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return (x**2 + y - 11)**2 + (x + y**2 -7)**2 # define range for input r_min, r_max = -5.0, 5.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a surface plot of the function.

The range is bounded to -10.0 and 10.0 and the function as four global optima at [8.05502, 9.66459], [-8.05502, 9.66459], [8.05502, -9.66459], [-8.05502, -9.66459]. This function is known as Holder’s table function.

# multimodal test function from numpy import arange from numpy import exp from numpy import sqrt from numpy import cos from numpy import sin from numpy import e from numpy import pi from numpy import absolute from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return -absolute(sin(x) * cos(y) * exp(absolute(1 - (sqrt(x**2 + y**2)/pi)))) # define range for input r_min, r_max = -10.0, 10.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a surface plot of the function.

This section provides more resources on the topic if you are looking to go deeper.

- Test functions for optimization, Wikipedia.
- Virtual Library of Simulation Experiments: Test Functions and Datasets
- Test Functions Index
- GEA Toolbox – Examples of Objective Functions

In this tutorial, you discovered standard two-dimensional functions you can use when studying function optimization.

**Are you using any of the above functions?**

Let me know which one in the comments below.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Two-Dimensional (2D) Test Functions for Function Optimization appeared first on Machine Learning Mastery.

]]>The post How to Manually Optimize Machine Learning Model Hyperparameters appeared first on Machine Learning Mastery.

]]>Although the impact of hyperparameters may be understood generally, their specific effect on a dataset and their interactions during learning may not be known. Therefore, it is important to tune the values of algorithm hyperparameters as part of a machine learning project.

It is common to use naive optimization algorithms to tune hyperparameters, such as a grid search and a random search. An alternate approach is to use a stochastic optimization algorithm, like a stochastic hill climbing algorithm.

In this tutorial, you will discover how to manually optimize the hyperparameters of machine learning algorithms.

After completing this tutorial, you will know:

- Stochastic optimization algorithms can be used instead of grid and random search for hyperparameter optimization.
- How to use a stochastic hill climbing algorithm to tune the hyperparameters of the Perceptron algorithm.
- How to manually optimize the hyperparameters of the XGBoost gradient boosting algorithm.

Let’s get started.

This tutorial is divided into three parts; they are:

- Manual Hyperparameter Optimization
- Perceptron Hyperparameter Optimization
- XGBoost Hyperparameter Optimization

Machine learning models have hyperparameters that you must set in order to customize the model to your dataset.

Often, the general effects of hyperparameters on a model are known, but how to best set a hyperparameter and combinations of interacting hyperparameters for a given dataset is challenging.

A better approach is to objectively search different values for model hyperparameters and choose a subset that results in a model that achieves the best performance on a given dataset. This is called hyperparameter optimization, or hyperparameter tuning.

A range of different optimization algorithms may be used, although two of the simplest and most common methods are random search and grid search.

**Random Search**. Define a search space as a bounded domain of hyperparameter values and randomly sample points in that domain.**Grid Search**. Define a search space as a grid of hyperparameter values and evaluate every position in the grid.

Grid search is great for spot-checking combinations that are known to perform well generally. Random search is great for discovery and getting hyperparameter combinations that you would not have guessed intuitively, although it often requires more time to execute.

For more on grid and random search for hyperparameter tuning, see the tutorial:

Grid and random search are primitive optimization algorithms, and it is possible to use any optimization we like to tune the performance of a machine learning algorithm. For example, it is possible to use stochastic optimization algorithms. This might be desirable when good or great performance is required and there are sufficient resources available to tune the model.

Next, let’s look at how we might use a stochastic hill climbing algorithm to tune the performance of the Perceptron algorithm.

The Perceptron algorithm is the simplest type of artificial neural network.

It is a model of a single neuron that can be used for two-class classification problems and provides the foundation for later developing much larger networks.

In this section, we will explore how to manually optimize the hyperparameters of the Perceptron model.

First, let’s define a synthetic binary classification problem that we can use as the focus of optimizing the model.

We can use the make_classification() function to define a binary classification problem with 1,000 rows and five input variables.

The example below creates the dataset and summarizes the shape of the data.

# define a binary classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1) # summarize the shape of the dataset print(X.shape, y.shape)

Running the example prints the shape of the created dataset, confirming our expectations.

(1000, 5) (1000,)

The scikit-learn provides an implementation of the Perceptron model via the Perceptron class.

Before we tune the hyperparameters of the model, we can establish a baseline in performance using the default hyperparameters.

We will evaluate the model using good practices of repeated stratified k-fold cross-validation via the RepeatedStratifiedKFold class.

The complete example of evaluating the Perceptron model with default hyperparameters on our synthetic binary classification dataset is listed below.

# perceptron default hyperparameters for binary classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import Perceptron # define dataset X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1) # define model model = Perceptron() # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report result print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports evaluates the model and reports the mean and standard deviation of the classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model with default hyperparameters achieved a classification accuracy of about 78.5 percent.

We would hope that we can achieve better performance than this with optimized hyperparameters.

Mean Accuracy: 0.786 (0.069)

Next, we can optimize the hyperparameters of the Perceptron model using a stochastic hill climbing algorithm.

There are many hyperparameters that we could optimize, although we will focus on two that perhaps have the most impact on the learning behavior of the model; they are:

- Learning Rate (
*eta0*). - Regularization (
*alpha*).

The learning rate controls the amount the model is updated based on prediction errors and controls the speed of learning. The default value of eta is 1.0. reasonable values are larger than zero (e.g. larger than 1e-8 or 1e-10) and probably less than 1.0

By default, the Perceptron does not use any regularization, but we will enable “*elastic net*” regularization which applies both L1 and L2 regularization during learning. This will encourage the model to seek small model weights and, in turn, often better performance.

We will tune the “*alpha*” hyperparameter that controls the weighting of the regularization, e.g. the amount it impacts the learning. If set to 0.0, it is as though no regularization is being used. Reasonable values are between 0.0 and 1.0.

First, we need to define the objective function for the optimization algorithm. We will evaluate a configuration using mean classification accuracy with repeated stratified k-fold cross-validation. We will seek to maximize accuracy in the configurations.

The *objective()* function below implements this, taking the dataset and a list of config values. The config values (learning rate and regularization weighting) are unpacked, used to configure the model, which is then evaluated, and the mean accuracy is returned.

# objective function def objective(X, y, cfg): # unpack config eta, alpha = cfg # define model model = Perceptron(penalty='elasticnet', alpha=alpha, eta0=eta) # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # calculate mean accuracy result = mean(scores) return result

Next, we need a function to take a step in the search space.

The search space is defined by two variables (*eta* and *alpha*). A step in the search space must have some relationship to the previous values and must be bound to sensible values (e.g. between 0 and 1).

We will use a “*step size*” hyperparameter that controls how far the algorithm is allowed to move from the existing configuration. A new configuration will be chosen probabilistically using a Gaussian distribution with the current value as the mean of the distribution and the step size as the standard deviation of the distribution.

We can use the randn() NumPy function to generate random numbers with a Gaussian distribution.

The *step()* function below implements this and will take a step in the search space and generate a new configuration using an existing configuration.

# take a step in the search space def step(cfg, step_size): # unpack the configuration eta, alpha = cfg # step eta new_eta = eta + randn() * step_size # check the bounds of eta if new_eta <= 0.0: new_eta = 1e-8 # step alpha new_alpha = alpha + randn() * step_size # check the bounds of alpha if new_alpha < 0.0: new_alpha = 0.0 # return the new configuration return [new_eta, new_alpha]

Next, we need to implement the stochastic hill climbing algorithm that will call our *objective()* function to evaluate candidate solutions and our *step()* function to take a step in the search space.

The search first generates a random initial solution, in this case with eta and alpha values in the range 0 and 1. The initial solution is then evaluated and is taken as the current best working solution.

... # starting point for the search solution = [rand(), rand()] # evaluate the initial point solution_eval = objective(X, y, solution)

Next, the algorithm iterates for a fixed number of iterations provided as a hyperparameter to the search. Each iteration involves taking a step and evaluating the new candidate solution.

... # take a step candidate = step(solution, step_size) # evaluate candidate point candidate_eval = objective(X, y, candidate)

If the new solution is better than the current working solution, it is taken as the new current working solution.

... # check if we should keep the new point if candidate_eval >= solution_eval: # store the new point solution, solution_eval = candidate, candidate_eval # report progress print('>%d, cfg=%s %.5f' % (i, solution, solution_eval))

At the end of the search, the best solution and its performance are then returned.

Tying this together, the *hillclimbing()* function below implements the stochastic hill climbing algorithm for tuning the Perceptron algorithm, taking the dataset, objective function, number of iterations, and step size as arguments.

# hill climbing local search algorithm def hillclimbing(X, y, objective, n_iter, step_size): # starting point for the search solution = [rand(), rand()] # evaluate the initial point solution_eval = objective(X, y, solution) # run the hill climb for i in range(n_iter): # take a step candidate = step(solution, step_size) # evaluate candidate point candidate_eval = objective(X, y, candidate) # check if we should keep the new point if candidate_eval >= solution_eval: # store the new point solution, solution_eval = candidate, candidate_eval # report progress print('>%d, cfg=%s %.5f' % (i, solution, solution_eval)) return [solution, solution_eval]

We can then call the algorithm and report the results of the search.

In this case, we will run the algorithm for 100 iterations and use a step size of 0.1, chosen after a little trial and error.

... # define the total iterations n_iter = 100 # step size in the search space step_size = 0.1 # perform the hill climbing search cfg, score = hillclimbing(X, y, objective, n_iter, step_size) print('Done!') print('cfg=%s: Mean Accuracy: %f' % (cfg, score))

Tying this together, the complete example of manually tuning the Perceptron algorithm is listed below.

# manually search perceptron hyperparameters for binary classification from numpy import mean from numpy.random import randn from numpy.random import rand from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import Perceptron # objective function def objective(X, y, cfg): # unpack config eta, alpha = cfg # define model model = Perceptron(penalty='elasticnet', alpha=alpha, eta0=eta) # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # calculate mean accuracy result = mean(scores) return result # take a step in the search space def step(cfg, step_size): # unpack the configuration eta, alpha = cfg # step eta new_eta = eta + randn() * step_size # check the bounds of eta if new_eta <= 0.0: new_eta = 1e-8 # step alpha new_alpha = alpha + randn() * step_size # check the bounds of alpha if new_alpha < 0.0: new_alpha = 0.0 # return the new configuration return [new_eta, new_alpha] # hill climbing local search algorithm def hillclimbing(X, y, objective, n_iter, step_size): # starting point for the search solution = [rand(), rand()] # evaluate the initial point solution_eval = objective(X, y, solution) # run the hill climb for i in range(n_iter): # take a step candidate = step(solution, step_size) # evaluate candidate point candidate_eval = objective(X, y, candidate) # check if we should keep the new point if candidate_eval >= solution_eval: # store the new point solution, solution_eval = candidate, candidate_eval # report progress print('>%d, cfg=%s %.5f' % (i, solution, solution_eval)) return [solution, solution_eval] # define dataset X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1) # define the total iterations n_iter = 100 # step size in the search space step_size = 0.1 # perform the hill climbing search cfg, score = hillclimbing(X, y, objective, n_iter, step_size) print('Done!') print('cfg=%s: Mean Accuracy: %f' % (cfg, score))

Running the example reports the configuration and result each time an improvement is seen during the search. At the end of the run, the best configuration and result are reported.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the best result involved using a learning rate slightly above 1 at 1.004 and a regularization weight of about 0.002 achieving a mean accuracy of about 79.1 percent, better than the default configuration that achieved an accuracy of about 78.5 percent.

**Can you get a better result?**

Let me know in the comments below.

>0, cfg=[0.5827274503894747, 0.260872709578015] 0.70533 >4, cfg=[0.5449820307807399, 0.3017271170801444] 0.70567 >6, cfg=[0.6286475606495414, 0.17499090243915086] 0.71933 >7, cfg=[0.5956196828965779, 0.0] 0.78633 >8, cfg=[0.5878361167354715, 0.0] 0.78633 >10, cfg=[0.6353507984485595, 0.0] 0.78633 >13, cfg=[0.5690530537610675, 0.0] 0.78633 >17, cfg=[0.6650936023999641, 0.0] 0.78633 >22, cfg=[0.9070451625704087, 0.0] 0.78633 >23, cfg=[0.9253366187387938, 0.0] 0.78633 >26, cfg=[0.9966143540220266, 0.0] 0.78633 >31, cfg=[1.0048613895650054, 0.002162219228449132] 0.79133 Done! cfg=[1.0048613895650054, 0.002162219228449132]: Mean Accuracy: 0.791333

Now that we are familiar with how to use a stochastic hill climbing algorithm to tune the hyperparameters of a simple machine learning algorithm, let’s look at tuning a more advanced algorithm, such as XGBoost.

XGBoost is short for Extreme Gradient Boosting and is an efficient implementation of the stochastic gradient boosting machine learning algorithm.

The stochastic gradient boosting algorithm, also called gradient boosting machines or tree boosting, is a powerful machine learning technique that performs well or even best on a wide range of challenging machine learning problems.

First, the XGBoost library must be installed.

You can install it using pip, as follows:

sudo pip install xgboost

Once installed, you can confirm that it was installed successfully and that you are using a modern version by running the following code:

# xgboost import xgboost print("xgboost", xgboost.__version__)

Running the code, you should see the following version number or higher.

xgboost 1.0.1

Although the XGBoost library has its own Python API, we can use XGBoost models with the scikit-learn API via the XGBClassifier wrapper class.

An instance of the model can be instantiated and used just like any other scikit-learn class for model evaluation. For example:

... # define model model = XGBClassifier()

Before we tune the hyperparameters of XGBoost, we can establish a baseline in performance using the default hyperparameters.

We will use the same synthetic binary classification dataset from the previous section and the same test harness of repeated stratified k-fold cross-validation.

The complete example of evaluating the performance of XGBoost with default hyperparameters is listed below.

# xgboost with default hyperparameters for binary classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from xgboost import XGBClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1) # define model model = XGBClassifier() # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report result print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the model and reports the mean and standard deviation of the classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model with default hyperparameters achieved a classification accuracy of about 84.9 percent.

We would hope that we can achieve better performance than this with optimized hyperparameters.

Mean Accuracy: 0.849 (0.040)

Next, we can adapt the stochastic hill climbing optimization algorithm to tune the hyperparameters of the XGBoost model.

There are many hyperparameters that we may want to optimize for the XGBoost model.

For an overview of how to tune the XGBoost model, see the tutorial:

We will focus on four key hyperparameters; they are:

- Learning Rate (
*learning_rate*) - Number of Trees (
*n_estimators*) - Subsample Percentage (
*subsample*) - Tree Depth (
*max_depth*)

The **learning rate** controls the contribution of each tree to the ensemble. Sensible values are less than 1.0 and slightly above 0.0 (e.g. 1e-8).

The **number of trees** controls the size of the ensemble, and often, more trees is better to a point of diminishing returns. Sensible values are between 1 tree and hundreds or thousands of trees.

The **subsample** percentages define the random sample size used to train each tree, defined as a percentage of the size of the original dataset. Values are between a value slightly above 0.0 (e.g. 1e-8) and 1.0

The **tree depth** is the number of levels in each tree. Deeper trees are more specific to the training dataset and perhaps overfit. Shorter trees often generalize better. Sensible values are between 1 and 10 or 20.

First, we must update the *objective()* function to unpack the hyperparameters of the XGBoost model, configure it, and then evaluate the mean classification accuracy.

# objective function def objective(X, y, cfg): # unpack config lrate, n_tree, subsam, depth = cfg # define model model = XGBClassifier(learning_rate=lrate, n_estimators=n_tree, subsample=subsam, max_depth=depth) # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # calculate mean accuracy result = mean(scores) return result

Next, we need to define the *step()* function used to take a step in the search space.

Each hyperparameter is quite a different range, therefore, we will define the step size (standard deviation of the distribution) separately for each hyperparameter. We will also define the step sizes in line rather than as arguments to the function, to keep things simple.

The number of trees and the depth are integers, so the stepped values are rounded.

The step sizes chosen are arbitrary, chosen after a little trial and error.

The updated step function is listed below.

# take a step in the search space def step(cfg): # unpack config lrate, n_tree, subsam, depth = cfg # learning rate lrate = lrate + randn() * 0.01 if lrate <= 0.0: lrate = 1e-8 if lrate > 1: lrate = 1.0 # number of trees n_tree = round(n_tree + randn() * 50) if n_tree <= 0.0: n_tree = 1 # subsample percentage subsam = subsam + randn() * 0.1 if subsam <= 0.0: subsam = 1e-8 if subsam > 1: subsam = 1.0 # max tree depth depth = round(depth + randn() * 7) if depth <= 1: depth = 1 # return new config return [lrate, n_tree, subsam, depth]

Finally, the *hillclimbing()* algorithm must be updated to define an initial solution with appropriate values.

In this case, we will define the initial solution with sensible defaults, matching the default hyperparameters, or close to them.

... # starting point for the search solution = step([0.1, 100, 1.0, 7])

Tying this together, the complete example of manually tuning the hyperparameters of the XGBoost algorithm using a stochastic hill climbing algorithm is listed below.

# xgboost manual hyperparameter optimization for binary classification from numpy import mean from numpy.random import randn from numpy.random import rand from numpy.random import randint from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from xgboost import XGBClassifier # objective function def objective(X, y, cfg): # unpack config lrate, n_tree, subsam, depth = cfg # define model model = XGBClassifier(learning_rate=lrate, n_estimators=n_tree, subsample=subsam, max_depth=depth) # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # calculate mean accuracy result = mean(scores) return result # take a step in the search space def step(cfg): # unpack config lrate, n_tree, subsam, depth = cfg # learning rate lrate = lrate + randn() * 0.01 if lrate <= 0.0: lrate = 1e-8 if lrate > 1: lrate = 1.0 # number of trees n_tree = round(n_tree + randn() * 50) if n_tree <= 0.0: n_tree = 1 # subsample percentage subsam = subsam + randn() * 0.1 if subsam <= 0.0: subsam = 1e-8 if subsam > 1: subsam = 1.0 # max tree depth depth = round(depth + randn() * 7) if depth <= 1: depth = 1 # return new config return [lrate, n_tree, subsam, depth] # hill climbing local search algorithm def hillclimbing(X, y, objective, n_iter): # starting point for the search solution = step([0.1, 100, 1.0, 7]) # evaluate the initial point solution_eval = objective(X, y, solution) # run the hill climb for i in range(n_iter): # take a step candidate = step(solution) # evaluate candidate point candidate_eval = objective(X, y, candidate) # check if we should keep the new point if candidate_eval >= solution_eval: # store the new point solution, solution_eval = candidate, candidate_eval # report progress print('>%d, cfg=[%s] %.5f' % (i, solution, solution_eval)) return [solution, solution_eval] # define dataset X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1) # define the total iterations n_iter = 200 # perform the hill climbing search cfg, score = hillclimbing(X, y, objective, n_iter) print('Done!') print('cfg=[%s]: Mean Accuracy: %f' % (cfg, score))

Running the example reports the configuration and result each time an improvement is seen during the search. At the end of the run, the best configuration and result are reported.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the best result involved using a learning rate of about 0.02, 52 trees, a subsample rate of about 50 percent, and a large depth of 53 levels.

This configuration resulted in a mean accuracy of about 87.3 percent, better than the default configuration that achieved an accuracy of about 84.9 percent.

**Can you get a better result?**

Let me know in the comments below.

>0, cfg=[[0.1058242692126418, 67, 0.9228490731610172, 12]] 0.85933 >1, cfg=[[0.11060813799692253, 51, 0.859353656735739, 13]] 0.86100 >4, cfg=[[0.11890247679234153, 58, 0.7135275461723894, 12]] 0.86167 >5, cfg=[[0.10226257987735601, 61, 0.6086462443373852, 17]] 0.86400 >15, cfg=[[0.11176962034280596, 106, 0.5592742266405146, 13]] 0.86500 >19, cfg=[[0.09493587069112454, 153, 0.5049124222437619, 34]] 0.86533 >23, cfg=[[0.08516531024154426, 88, 0.5895201311518876, 31]] 0.86733 >46, cfg=[[0.10092590898175327, 32, 0.5982811365027455, 30]] 0.86867 >75, cfg=[[0.099469211050998, 20, 0.36372573610040404, 32]] 0.86900 >96, cfg=[[0.09021536590375884, 38, 0.4725379807796971, 20]] 0.86900 >100, cfg=[[0.08979482274655906, 65, 0.3697395430835758, 14]] 0.87000 >110, cfg=[[0.06792737273465625, 89, 0.33827505722318224, 17]] 0.87000 >118, cfg=[[0.05544969684589669, 72, 0.2989721608535262, 23]] 0.87200 >122, cfg=[[0.050102976159097, 128, 0.2043203965148931, 24]] 0.87200 >123, cfg=[[0.031493266763680444, 120, 0.2998819062922256, 30]] 0.87333 >128, cfg=[[0.023324201169625292, 84, 0.4017169945431015, 42]] 0.87333 >140, cfg=[[0.020224220443108752, 52, 0.5088096815056933, 53]] 0.87367 Done! cfg=[[0.020224220443108752, 52, 0.5088096815056933, 53]]: Mean Accuracy: 0.873667

This section provides more resources on the topic if you are looking to go deeper.

- Hyperparameter Optimization With Random Search and Grid Search
- How to Configure the Gradient Boosting Algorithm
- How To Implement The Perceptron Algorithm From Scratch In Python

- sklearn.datasets.make_classification APIs.
- sklearn.metrics.accuracy_score APIs.
- numpy.random.rand API.
- sklearn.linear_model.Perceptron API.

In this tutorial, you discovered how to manually optimize the hyperparameters of machine learning algorithms.

Specifically, you learned:

- Stochastic optimization algorithms can be used instead of grid and random search for hyperparameter optimization.
- How to use a stochastic hill climbing algorithm to tune the hyperparameters of the Perceptron algorithm.
- How to manually optimize the hyperparameters of the XGBoost gradient boosting algorithm.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Manually Optimize Machine Learning Model Hyperparameters appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to XGBoost Loss Functions appeared first on Machine Learning Mastery.

]]>An important aspect in configuring XGBoost models is the choice of loss function that is minimized during the training of the model.

The **loss function** must be matched to the predictive modeling problem type, in the same way we must choose appropriate loss functions based on problem types with deep learning neural networks.

In this tutorial, you will discover how to configure loss functions for XGBoost ensemble models.

After completing this tutorial, you will know:

- Specifying loss functions used when training XGBoost ensembles is a critical step, much like neural networks.
- How to configure XGBoost loss functions for binary and multi-class classification tasks.
- How to configure XGBoost loss functions for regression predictive modeling tasks.

Let’s get started.

This tutorial is divided into three parts; they are:

- XGBoost and Loss Functions
- XGBoost Loss for Classification
- XGBoost Loss for Regression

Extreme Gradient Boosting, or XGBoost for short, is an efficient open-source implementation of the gradient boosting algorithm. As such, XGBoost is an algorithm, an open-source project, and a Python library.

It was initially developed by Tianqi Chen and was described by Chen and Carlos Guestrin in their 2016 paper titled “XGBoost: A Scalable Tree Boosting System.”

It is designed to be both computationally efficient (e.g. fast to execute) and highly effective, perhaps more effective than other open-source implementations.

XGBoost supports a range of different predictive modeling problems, most notably classification and regression.

XGBoost is trained by minimizing loss of an objective function against a dataset. As such, the choice of loss function is a critical hyperparameter and tied directly to the type of problem being solved, much like deep learning neural networks.

The implementation allows the objective function to be specified via the “*objective*” hyperparameter, and sensible defaults are used that work for most cases.

Nevertheless, there remains some confusion by beginners as to what loss function to use when training XGBoost models.

We will take a closer look at how to configure the loss function for XGBoost in this tutorial.

Before we get started, let’s get setup.

XGBoost can be installed as a standalone library and an XGBoost model can be developed using the scikit-learn API.

The first step is to install the XGBoost library if it is not already installed. This can be achieved using the pip python package manager on most platforms; for example:

sudo pip install xgboost

You can then confirm that the XGBoost library was installed correctly and can be used by running the following script.

# check xgboost version import xgboost print(xgboost.__version__)

Running the script will print your version of the XGBoost library you have installed.

Your version should be the same or higher. If not, you must upgrade your version of the XGBoost library.

1.1.1

It is possible that you may have problems with the latest version of the library. It is not your fault.

Sometimes, the most recent version of the library imposes additional requirements or may be less stable.

If you do have errors when trying to run the above script, I recommend downgrading to version 1.0.1 (or lower). This can be achieved by specifying the version to install to the pip command, as follows:

sudo pip install xgboost==1.0.1

If you see a warning message, you can safely ignore it for now. For example, below is an example of a warning message that you may see and can ignore:

FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.

If you require specific instructions for your development environment, see the tutorial:

The XGBoost library has its own custom API, although we will use the method via the scikit-learn wrapper classes: XGBRegressor and XGBClassifier. This will allow us to use the full suite of tools from the scikit-learn machine learning library to prepare data and evaluate models.

Both models operate the same way and take the same arguments that influence how the decision trees are created and added to the ensemble.

For more on how to use the XGBoost API with scikit-learn, see the tutorial:

Next, let’s take a closer look at how to configure the loss function for XGBoost on classification problems.

Classification tasks involve predicting a label or probability for each possible class, given an input sample.

There are two main types of classification tasks with mutually exclusive labels: binary classification that has two class labels, and multi-class classification that have more than two class labels.

**Binary Classification**: Classification task with two class labels.**Multi-Class Classification**: Classification task with more than two class labels.

For more on the different types of classification tasks, see the tutorial:

XGBoost provides loss functions for each of these problem types.

It is typical in machine learning to train a model to predict the probability of class membership for probability tasks and if the task requires crisp class labels to post-process the predicted probabilities (e.g. use argmax).

This approach is used when training deep learning neural networks for classification, and is also recommended when using XGBoost for classification.

The loss function used for predicting probabilities for binary classification problems is “*binary:logistic*” and the loss function for predicting class probabilities for multi-class problems is “*multi:softprob*“.

- “
*multi:logistic*“: XGBoost loss function for binary classification. - “
*multi:softprob*“: XGBoost loss function for multi-class classification.

These string values can be specified via the “*objective*” hyperparameter when configuring your XGBClassifier model.

For example, for binary classification:

... # define the model for binary classification model = XGBClassifier(objective='binary:logistic')

And, for multi-class classification:

... # define the model for multi-class classification model = XGBClassifier(objective='binary:softprob')

Importantly, if you do not specify the “*objective*” hyperparameter, the *XGBClassifier* will automatically choose one of these loss functions based on the data provided during training.

We can make this concrete with a worked example.

The example below creates a synthetic binary classification dataset, fits an *XGBClassifier* on the dataset with default hyperparameters, then prints the model objective configuration.

# example of automatically choosing the loss function for binary classification from sklearn.datasets import make_classification from xgboost import XGBClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define the model model = XGBClassifier() # fit the model model.fit(X, y) # summarize the model loss function print(model.objective)

Running the example fits the model on the dataset and prints the loss function configuration.

We can see the model automatically choose a loss function for binary classification.

binary:logistic

Alternately, we can specify the objective and fit the model, confirming the loss function was used.

# example of manually specifying the loss function for binary classification from sklearn.datasets import make_classification from xgboost import XGBClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define the model model = XGBClassifier(objective='binary:logistic') # fit the model model.fit(X, y) # summarize the model loss function print(model.objective)

Running the example fits the model on the dataset and prints the loss function configuration.

We can see the model used to specify a loss function for binary classification.

binary:logistic

Let’s repeat this example on a dataset with more than two classes. In this case, three classes.

The complete example is listed below.

# example of automatically choosing the loss function for multi-class classification from sklearn.datasets import make_classification from xgboost import XGBClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1, n_classes=3) # define the model model = XGBClassifier() # fit the model model.fit(X, y) # summarize the model loss function print(model.objective)

Running the example fits the model on the dataset and prints the loss function configuration.

We can see the model automatically chose a loss function for multi-class classification.

multi:softprob

Alternately, we can manually specify the loss function and confirm it was used to train the model.

# example of manually specifying the loss function for multi-class classification from sklearn.datasets import make_classification from xgboost import XGBClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1, n_classes=3) # define the model model = XGBClassifier(objective="multi:softprob") # fit the model model.fit(X, y) # summarize the model loss function print(model.objective)

Running the example fits the model on the dataset and prints the loss function configuration.

We can see the model used to specify a loss function for multi-class classification.

multi:softprob

Finally, there are other loss functions you can use for classification, including: “*binary:logitraw*” and “*binary:hinge*” for binary classification and “*multi:softmax*” for multi-class classification.

You can see a full list here:

Next, let’s take a look at XGBoost loss functions for regression.

Regression refers to predictive modeling problems where a numerical value is predicted given an input sample.

Although predicting a probability sounds like a regression problem (i.e. a probability is a numerical value), it is generally not considered a regression type predictive modeling problem.

The XGBoost objective function used when predicting numerical values is the “*reg:squarederror*” loss function.

*“reg:squarederror”*: Loss function for regression predictive modeling problems.

This string value can be specified via the “*objective*” hyperparameter when configuring your *XGBRegressor* model.

For example:

... # define the model for regression model = XGBRegressor(objective='reg:squarederror')

Importantly, if you do not specify the “*objective*” hyperparameter, the *XGBRegressor* will automatically choose this objective function for you.

We can make this concrete with a worked example.

The example below creates a synthetic regression dataset, fits an *XGBRegressor* on the dataset, then prints the model objective configuration.

# example of automatically choosing the loss function for regression from sklearn.datasets import make_regression from xgboost import XGBRegressor # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7) # define the model model = XGBRegressor() # fit the model model.fit(X, y) # summarize the model loss function print(model.objective)

Running the example fits the model on the dataset and prints the loss function configuration.

We can see the model automatically choose a loss function for regression.

reg:squarederror

Alternately, we can specify the objective and fit the model, confirming the loss function was used.

# example of manually specifying the loss function for regression from sklearn.datasets import make_regression from xgboost import XGBRegressor # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7) # define the model model = XGBRegressor(objective='reg:squarederror') # fit the model model.fit(X, y) # summarize the model loss function print(model.objective)

Running the example fits the model on the dataset and prints the loss function configuration.

We can see the model used the specified a loss function for regression.

reg:squarederror

Finally, there are other loss functions you can use for regression, including: “*reg:squaredlogerror*“, “*reg:logistic*“, “*reg:pseudohubererror*“, “*reg:gamma*“, and “*reg:tweedie*“.

You can see a full list here:

This section provides more resources on the topic if you are looking to go deeper.

- Extreme Gradient Boosting (XGBoost) Ensemble in Python
- Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost
- 4 Types of Classification Tasks in Machine Learning

In this tutorial, you discovered how to configure loss functions for XGBoost ensemble models.

Specifically, you learned:

- Specifying loss functions used when training XGBoost ensembles is a critical step much like neural networks.
- How to configure XGBoost loss functions for binary and multi-class classification tasks.
- How to configure XGBoost loss functions for regression predictive modeling tasks.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to XGBoost Loss Functions appeared first on Machine Learning Mastery.

]]>The post Gradient Descent Optimization With Nadam From Scratch appeared first on Machine Learning Mastery.

]]>A limitation of gradient descent is that the progress of the search can slow down if the gradient becomes flat or large curvature. Momentum can be added to gradient descent that incorporates some inertia to updates. This can be further improved by incorporating the gradient of the projected new position rather than the current position, called Nesterov’s Accelerated Gradient (NAG) or Nesterov momentum.

Another limitation of gradient descent is that a single step size (learning rate) is used for all input variables. Extensions to gradient descent like the Adaptive Movement Estimation (Adam) algorithm that uses a separate step size for each input variable but may result in a step size that rapidly decreases to very small values.

**Nesterov-accelerated Adaptive Moment Estimation**, or the **Nadam**, is an extension of the Adam algorithm that incorporates Nesterov momentum and can result in better performance of the optimization algorithm.

In this tutorial, you will discover how to develop the gradient descent optimization with Nadam from scratch.

After completing this tutorial, you will know:

- Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
- Nadam is an extension of the Adam version of gradient descent that incorporates Nesterov momentum.
- How to implement the Nadam optimization algorithm from scratch and apply it to an objective function and evaluate the results.

Let’s get started.

This tutorial is divided into three parts; they are:

- Gradient Descent
- Nadam Optimization Algorithm
- Gradient Descent With Nadam
- Two-Dimensional Test Problem
- Gradient Descent Optimization With Nadam
- Visualization of Nadam Optimization

Gradient descent is an optimization algorithm.

It is technically referred to as a first-order optimization algorithm as it explicitly makes use of the first-order derivative of the target objective function.

First-order methods rely on gradient information to help direct the search for a minimum …

— Page 69, Algorithms for Optimization, 2019.

The first-order derivative, or simply the “derivative,” is the rate of change or slope of the target function at a specific point, e.g. for a specific input.

If the target function takes multiple input variables, it is referred to as a multivariate function and the input variables can be thought of as a vector. In turn, the derivative of a multivariate target function may also be taken as a vector and is referred to generally as the gradient.

**Gradient**: First-order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a specific input.

Gradient descent refers to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to locate the minimum of the function.

The gradient descent algorithm requires a target function that is being optimized and the derivative function for the objective function. The target function *f()* returns a score for a given set of inputs, and the derivative function *f'()* gives the derivative of the target function for a given set of inputs.

The gradient descent algorithm requires a starting point (*x*) in the problem, such as a randomly selected point in the input space.

The derivative is then calculated and a step is taken in the input space that is expected to result in a downhill movement in the target function, assuming we are minimizing the target function.

A downhill movement is made by first calculating how far to move in the input space, calculated as the steps size (called alpha or the learning rate) multiplied by the gradient. This is then subtracted from the current point, ensuring we move against the gradient, or down the target function.

- x(t) = x(t-1) – step_size * f'(x(t))

The steeper the objective function at a given point, the larger the magnitude of the gradient, and in turn, the larger the step taken in the search space. The size of the step taken is scaled using a step size hyperparameter.

**Step Size**: Hyperparameter that controls how far to move in the search space against the gradient each iteration of the algorithm.

If the step size is too small, the movement in the search space will be small and the search will take a long time. If the step size is too large, the search may bounce around the search space and skip over the optima.

Now that we are familiar with the gradient descent optimization algorithm, let’s take a look at the Nadam algorithm.

The Nesterov-accelerated Adaptive Moment Estimation, or the **Nadam**, algorithm is an extension to the Adaptive Movement Estimation (Adam) optimization algorithm to add Nesterov’s Accelerated Gradient (NAG) or Nesterov momentum, which is an improved type of momentum.

More broadly, the Nadam algorithm is an extension to the Gradient Descent Optimization algorithm.

The algorithm was described in the 2016 paper by Timothy Dozat titled “Incorporating Nesterov Momentum into Adam.” Although a version of the paper was written up in 2015 as a Stanford project report with the same name.

Momentum adds an exponentially decaying moving average (first moment) of the gradient to the gradient descent algorithm. This has the impact of smoothing out noisy objective functions and improving convergence.

Adam is an extension of gradient descent that adds a first and second moment of the gradient and automatically adapts a learning rate for each parameter that is being optimized. NAG is an extension to momentum where the update is performed using the gradient of the projected update to the parameter rather than the actual current variable value. This has the effect of slowing down the search when the optima is located rather than overshooting, in some situations.

Nadam is an extension to Adam that uses NAG momentum instead of classical momentum.

We show how to modify Adam’s momentum component to take advantage of insights from NAG, and then we present preliminary evidence suggesting that making this substitution improves the speed of convergence and the quality of the learned models.

— Incorporating Nesterov Momentum into Adam, 2016.

Let’s step through each element of the algorithm.

Nadam uses a decaying step size (*alpha*) and first moment (*mu*) hyperparameters that can improve performance. For the case of simplicity, we will ignore this aspect for now and assume constant values.

First, we must maintain the first and second moments of the gradient for each parameter being optimized as part of the search, referred to as *m* and *n* respectively. They are initialized to 0.0 at the start of the search.

- m = 0
- n = 0

The algorithm is executed iteratively over time t starting at *t=1*, and each iteration involves calculating a new set of parameter values *x*, e.g. going from *x(t-1)* to *x(t)*.

It is perhaps easy to understand the algorithm if we focus on updating one parameter, which generalizes to updating all parameters via vector operations.

First, the gradient (partial derivatives) are calculated for the current time step.

- g(t) = f'(x(t-1))

Next, the first moment is updated using the gradient and a hyperparameter “*mu*“.

- m(t) = mu * m(t-1) + (1 – mu) * g(t)

Then the second moment is updated using the “*nu*” hyperparameter.

- n(t) = nu * n(t-1) + (1 – nu) * g(t)^2

Next, the first moment is bias-corrected using the Nesterov momentum.

- mhat = (mu * m(t) / (1 – mu)) + ((1 – mu) * g(t) / (1 – mu))

The second moment is then bias-corrected.

Note: bias-correction is an aspect of Adam and counters the fact that the first and second moments are initialized to zero at the start of the search.

- nhat = nu * n(t) / (1 – nu)

Finally, we can calculate the value for the parameter for this iteration.

- x(t) = x(t-1) – alpha / (sqrt(nhat) + eps) * mhat

Where alpha is the step size (learning rate) hyperparameter, *sqrt()* is the square root function, and *eps* (*epsilon*) is a small value like 1e-8 added to avoid a divide by zero error.

To review, there are three hyperparameters for the algorithm; they are:

**alpha**: Initial step size (learning rate), a typical value is 0.002.**mu**: Decay factor for first moment (*beta1*in Adam), a typical value is 0.975.**nu**: Decay factor for second moment (*beta2*in Adam), a typical value is 0.999.

And that’s it.

Next, let’s look at how we might implement the algorithm from scratch in Python.

In this section, we will explore how to implement the gradient descent optimization algorithm with Nadam Momentum.

First, let’s define an optimization function.

We will use a simple two-dimensional function that squares the input of each dimension and define the range of valid inputs from -1.0 to 1.0.

The *objective()* function below implements this function

# objective function def objective(x, y): return x**2.0 + y**2.0

We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the response surface.

The complete example of plotting the objective function is listed below.

# 3d plot of the test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -1.0, 1.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a three-dimensional surface plot of the objective function.

We can see the familiar bowl shape with the global minima at f(0, 0) = 0.

We can also create a two-dimensional plot of the function. This will be helpful later when we want to plot the progress of the search.

The example below creates a contour plot of the objective function.

# contour plot of the test function from numpy import asarray from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # show the plot pyplot.show()

Running the example creates a two-dimensional contour plot of the objective function.

We can see the bowl shape compressed to contours shown with a color gradient. We will use this plot to plot the specific points explored during the progress of the search.

Now that we have a test objective function, let’s look at how we might implement the Nadam optimization algorithm.

We can apply the gradient descent with Nadam to the test problem.

First, we need a function that calculates the derivative for this function.

The derivative of *x^2* is *x * 2* in each dimension.

- f(x) = x^2
- f'(x) = x * 2

The derivative() function implements this below.

# derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0])

Next, we can implement gradient descent optimization with Nadam.

First, we can select a random point in the bounds of the problem as a starting point for the search.

This assumes we have an array that defines the bounds of the search with one row for each dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

... # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) score = objective(x[0], x[1])

Next, we need to initialize the moment vectors.

... # initialize decaying moving averages m = [0.0 for _ in range(bounds.shape[0])] n = [0.0 for _ in range(bounds.shape[0])]

We then run a fixed number of iterations of the algorithm defined by the “*n_iter*” hyperparameter.

... # run iterations of gradient descent for t in range(n_iter): ...

The first step is to calculate the derivative for the current set of parameters.

... # calculate gradient g(t) g = derivative(x[0], x[1])

Next, we need to perform the Nadam update calculations. We will perform these calculations one variable at a time using an imperative programming style for readability.

In practice, I recommend using NumPy vector operations for efficiency.

... # build a solution one variable at a time for i in range(x.shape[0]): ...

First, we need to calculate the moment vector.

... # m(t) = mu * m(t-1) + (1 - mu) * g(t) m[i] = mu * m[i] + (1.0 - mu) * g[i]

Then the second moment vector.

... # n(t) = nu * n(t-1) + (1 - nu) * g(t)^2 n[i] = nu * n[i] + (1.0 - nu) * g[i]**2

Then the bias-corrected Nesterov momentum.

... # mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu)) mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu))

The bias-correct second moment.

... # nhat = nu * n(t) / (1 - nu) nhat = nu * n[i] / (1.0 - nu)

And finally updating the parameter.

... # x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhat x[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat

This is then repeated for each parameter that is being optimized.

At the end of the iteration, we can evaluate the new parameter values and report the performance of the search.

... # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score))

We can tie all of this together into a function named nadam() that takes the names of the objective and derivative functions, as well as the algorithm hyperparameters, and returns the best solution found at the end of the search and its evaluation.

# gradient descent algorithm with nadam def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) score = objective(x[0], x[1]) # initialize decaying moving averages m = [0.0 for _ in range(bounds.shape[0])] n = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(bounds.shape[0]): # m(t) = mu * m(t-1) + (1 - mu) * g(t) m[i] = mu * m[i] + (1.0 - mu) * g[i] # n(t) = nu * n(t-1) + (1 - nu) * g(t)^2 n[i] = nu * n[i] + (1.0 - nu) * g[i]**2 # mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu)) mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu)) # nhat = nu * n(t) / (1 - nu) nhat = nu * n[i] / (1.0 - nu) # x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhat x[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return [x, score]

We can then define the bounds of the function and the hyperparameters and call the function to perform the optimization.

In this case, we will run the algorithm for 50 iterations with an initial alpha of 0.02, mu of 0.8 and a nu of 0.999, found after a little trial and error.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 50 # steps size alpha = 0.02 # factor for average gradient mu = 0.8 # factor for average squared gradient nu = 0.999 # perform the gradient descent search with nadam best, score = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu)

At the end of the run, we will report the best solution found.

... # summarize the result print('Done!') print('f(%s) = %f' % (best, score))

Tying all of this together, the complete example of Nadam gradient descent applied to our test problem is listed below.

# gradient descent optimization with nadam for a two-dimensional test function from math import sqrt from numpy import asarray from numpy.random import rand from numpy.random import seed # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with nadam def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) score = objective(x[0], x[1]) # initialize decaying moving averages m = [0.0 for _ in range(bounds.shape[0])] n = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(bounds.shape[0]): # m(t) = mu * m(t-1) + (1 - mu) * g(t) m[i] = mu * m[i] + (1.0 - mu) * g[i] # n(t) = nu * n(t-1) + (1 - nu) * g(t)^2 n[i] = nu * n[i] + (1.0 - nu) * g[i]**2 # mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu)) mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu)) # nhat = nu * n(t) / (1 - nu) nhat = nu * n[i] / (1.0 - nu) # x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhat x[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return [x, score] # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 50 # steps size alpha = 0.02 # factor for average gradient mu = 0.8 # factor for average squared gradient nu = 0.999 # perform the gradient descent search with nadam best, score = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu) print('Done!') print('f(%s) = %f' % (best, score))

Running the example applies the optimization algorithm with Nadam to our test problem and reports the performance of the search for each iteration of the algorithm.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a near-optimal solution was found after perhaps 44 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.

... >40 f([ 5.07445337e-05 -3.32910019e-03]) = 0.00001 >41 f([-1.84325171e-05 -3.00939427e-03]) = 0.00001 >42 f([-6.78814472e-05 -2.69839367e-03]) = 0.00001 >43 f([-9.88339249e-05 -2.40042096e-03]) = 0.00001 >44 f([-0.00011368 -0.00211861]) = 0.00000 >45 f([-0.00011547 -0.00185511]) = 0.00000 >46 f([-0.0001075 -0.00161122]) = 0.00000 >47 f([-9.29922627e-05 -1.38760991e-03]) = 0.00000 >48 f([-7.48258406e-05 -1.18436586e-03]) = 0.00000 >49 f([-5.54299505e-05 -1.00116899e-03]) = 0.00000 Done! f([-5.54299505e-05 -1.00116899e-03]) = 0.000001

We can plot the progress of the Nadam search on a contour plot of the domain.

This can provide an intuition for the progress of the search over the iterations of the algorithm.

We must update the nadam() function to maintain a list of all solutions found during the search, then return this list at the end of the search.

The updated version of the function with these changes is listed below.

# gradient descent algorithm with nadam def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) score = objective(x[0], x[1]) # initialize decaying moving averages m = [0.0 for _ in range(bounds.shape[0])] n = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(bounds.shape[0]): # m(t) = mu * m(t-1) + (1 - mu) * g(t) m[i] = mu * m[i] + (1.0 - mu) * g[i] # n(t) = nu * n(t-1) + (1 - nu) * g(t)^2 n[i] = nu * n[i] + (1.0 - nu) * g[i]**2 # mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu)) mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu)) # nhat = nu * n(t) / (1 - nu) nhat = nu * n[i] / (1.0 - nu) # x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhat x[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat # evaluate candidate point score = objective(x[0], x[1]) # store solution solutions.append(x.copy()) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return solutions

We can then execute the search as before, and this time retrieve the list of solutions instead of the best final solution.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 50 # steps size alpha = 0.02 # factor for average gradient mu = 0.8 # factor for average squared gradient nu = 0.999 # perform the gradient descent search with nadam solutions = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu)

We can then create a contour plot of the objective function, as before.

... # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet')

Finally, we can plot each solution found during the search as a white dot connected by a line.

... # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

Tying this all together, the complete example of performing the Nadam optimization on the test problem and plotting the results on a contour plot is listed below.

# example of plotting the nadam search on a contour plot of the test function from math import sqrt from numpy import asarray from numpy import arange from numpy import product from numpy.random import rand from numpy.random import seed from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with nadam def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) score = objective(x[0], x[1]) # initialize decaying moving averages m = [0.0 for _ in range(bounds.shape[0])] n = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(bounds.shape[0]): # m(t) = mu * m(t-1) + (1 - mu) * g(t) m[i] = mu * m[i] + (1.0 - mu) * g[i] # n(t) = nu * n(t-1) + (1 - nu) * g(t)^2 n[i] = nu * n[i] + (1.0 - nu) * g[i]**2 # mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu)) mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu)) # nhat = nu * n(t) / (1 - nu) nhat = nu * n[i] / (1.0 - nu) # x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhat x[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat # evaluate candidate point score = objective(x[0], x[1]) # store solution solutions.append(x.copy()) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return solutions # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 50 # steps size alpha = 0.02 # factor for average gradient mu = 0.8 # factor for average squared gradient nu = 0.999 # perform the gradient descent search with nadam solutions = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w') # show the plot pyplot.show()

Running the example performs the search as before, except in this case, the contour plot of the objective function is created.

In this case, we can see that a white dot is shown for each solution found during the search, starting above the optima and progressively getting closer to the optima at the center of the plot.

This section provides more resources on the topic if you are looking to go deeper.

- Incorporating Nesterov Momentum into Adam, 2016.
- Incorporating Nesterov Momentum into Adam, Stanford Report, 2015.
- A method for solving the convex programming problem with convergence rate O (1/k^2), 1983.
- Adam: A Method for Stochastic Optimization, 2014.
- An Overview Of Gradient Descent Optimization Algorithms, 2016.

- Algorithms for Optimization, 2019.
- Deep Learning, 2016.

- Gradient descent, Wikipedia.
- Stochastic gradient descent, Wikipedia.
- Optimization, Timothy Dozat, GitHub.

In this tutorial, you discovered how to develop the gradient descent optimization with Nadam from scratch.

Specifically, you learned:

- Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
- Nadam is an extension of the Adam version of gradient descent that incorporates Nesterov momentum.
- How to implement the Nadam optimization algorithm from scratch and apply it to an objective function and evaluate the results.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Gradient Descent Optimization With Nadam From Scratch appeared first on Machine Learning Mastery.

]]>The post Gradient Descent With Nesterov Momentum From Scratch appeared first on Machine Learning Mastery.

]]>A limitation of gradient descent is that it can get stuck in flat areas or bounce around if the objective function returns noisy gradients. Momentum is an approach that accelerates the progress of the search to skim across flat areas and smooth out bouncy gradients.

In some cases, the acceleration of momentum can cause the search to miss or overshoot the minima at the bottom of basins or valleys. **Nesterov momentum** is an extension of momentum that involves calculating the decaying moving average of the gradients of projected positions in the search space rather than the actual positions themselves.

This has the effect of harnessing the accelerating benefits of momentum whilst allowing the search to slow down when approaching the optima and reduce the likelihood of missing or overshooting it.

In this tutorial, you will discover how to develop the Gradient Descent optimization algorithm with Nesterov Momentum from scratch.

After completing this tutorial, you will know:

- Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
- The convergence of gradient descent optimization algorithm can be accelerated by extending the algorithm and adding Nesterov Momentum.
- How to implement the Nesterov Momentum optimization algorithm from scratch and apply it to an objective function and evaluate the results.

Let’s get started.

This tutorial is divided into three parts; they are:

- Gradient Descent
- Nesterov Momentum
- Gradient Descent With Nesterov Momentum
- Two-Dimensional Test Problem
- Gradient Descent Optimization With Nesterov Momentum
- Visualization of Nesterov Momentum

Gradient descent is an optimization algorithm.

It is technically referred to as a first-order optimization algorithm as it explicitly makes use of the first order derivative of the target objective function.

First-order methods rely on gradient information to help direct the search for a minimum …

— Page 69, Algorithms for Optimization, 2019.

The first order derivative, or simply the “derivative,” is the rate of change or slope of the target function at a specific point, e.g. for a specific input.

If the target function takes multiple input variables, it is referred to as a multivariate function and the input variables can be thought of as a vector. In turn, the derivative of a multivariate target function may also be taken as a vector and is referred to generally as the “gradient.”

**Gradient**: First order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a specific input.

Gradient descent refers to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to locate the minimum of the function.

The gradient descent algorithm requires a target function that is being optimized and the derivative function for the objective function. The target function f() returns a score for a given set of inputs, and the derivative function f'() gives the derivative of the target function for a given set of inputs.

The gradient descent algorithm requires a starting point (x) in the problem, such as a randomly selected point in the input space.

The derivative is then calculated and a step is taken in the input space that is expected to result in a downhill movement in the target function, assuming we are minimizing the target function.

A downhill movement is made by first calculating how far to move in the input space, calculated as the steps size (called alpha or the learning rate) multiplied by the gradient. This is then subtracted from the current point, ensuring we move against the gradient, or down the target function.

- x(t+1) = x(t) – step_size * f'(x(t))

The steeper the objective function at a given point, the larger the magnitude of the gradient, and in turn, the larger the step taken in the search space. The size of the step taken is scaled using a step size hyperparameter.

**Step Size (**: Hyperparameter that controls how far to move in the search space against the gradient each iteration of the algorithm.*alpha*)

If the step size is too small, the movement in the search space will be small, and the search will take a long time. If the step size is too large, the search may bounce around the search space and skip over the optima.

Now that we are familiar with the gradient descent optimization algorithm, let’s take a look at the Nesterov momentum.

Nesterov Momentum is an extension to the gradient descent optimization algorithm.

The approach was described by (and named for) Yurii Nesterov in his 1983 paper titled “A Method For Solving The Convex Programming Problem With Convergence Rate O(1/k^2).”

Ilya Sutskever, et al. are responsible for popularizing the application of Nesterov Momentum in the training of neural networks with stochastic gradient descent described in their 2013 paper “On The Importance Of Initialization And Momentum In Deep Learning.” They referred to the approach as “**Nesterov’s Accelerated Gradient**,” or NAG for short.

Nesterov Momentum is just like more traditional momentum except the update is performed using the partial derivative of the projected update rather than the derivative current variable value.

While NAG is not typically thought of as a type of momentum, it indeed turns out to be closely related to classical momentum, differing only in the precise update of the velocity vector …

— On The Importance Of Initialization And Momentum In Deep Learning, 2013.

Traditional momentum involves maintaining an additional variable that represents the last update performed to the variable, an exponentially decaying moving average of past gradients.

The momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to move in their direction.

— Page 296, Deep Learning, 2016.

This last update or last change to the variable is then added to the variable scaled by a “*momentum*” hyperparameter that controls how much of the last change to add, e.g. 0.9 for 90%.

It is easier to think about this update in terms of two steps, e.g calculate the change in the variable using the partial derivative then calculate the new value for the variable.

- change(t+1) = (momentum * change(t)) – (step_size * f'(x(t)))
- x(t+1) = x(t) + change(t+1)

We can think of momentum in terms of a ball rolling downhill that will accelerate and continue to go in the same direction even in the presence of small hills.

Momentum can be interpreted as a ball rolling down a nearly horizontal incline. The ball naturally gathers momentum as gravity causes it to accelerate, just as the gradient causes momentum to accumulate in this descent method.

— Page 75, Algorithms for Optimization, 2019.

A problem with momentum is that acceleration can sometimes cause the search to overshoot the minima at the bottom of a basin or valley floor.

Nesterov Momentum can be thought of as a modification to momentum to overcome this problem of overshooting the minima.

It involves first calculating the projected position of the variable using the change from the last iteration and using the derivative of the projected position in the calculation of the new position for the variable.

Calculating the gradient of the projected position acts like a correction factor for the acceleration that has been accumulated.

With Nesterov momentum the gradient is evaluated after the current velocity is applied. Thus one can interpret Nesterov momentum as attempting to add a correction factor to the standard method of momentum.

— Page 300, Deep Learning, 2016.

Nesterov Momentum is easy to think about this in terms of the four steps:

- 1. Project the position of the solution.
- 2. Calculate the gradient of the projection.
- 3. Calculate the change in the variable using the partial derivative.
- 4. Update the variable.

Let’s go through these steps in more detail.

First, the projected position of the entire solution is calculated using the change calculated in the last iteration of the algorithm.

- projection(t+1) = x(t) + (momentum * change(t))

We can then calculate the gradient for this new position.

- gradient(t+1) = f'(projection(t+1))

Now we can calculate the new position of each variable using the gradient of the projection, first by calculating the change in each variable.

- change(t+1) = (momentum * change(t)) – (step_size * gradient(t+1))

And finally, calculating the new value for each variable using the calculated change.

- x(t+1) = x(t) + change(t+1)

In the field of convex optimization more generally, Nesterov Momentum is known to improve the rate of convergence of the optimization algorithm (e.g. reduce the number of iterations required to find the solution).

Like momentum, NAG is a first-order optimization method with better convergence rate guarantee than gradient descent in certain situations.

— On The Importance Of Initialization And Momentum In Deep Learning, 2013.

Although the technique is effective in training neural networks, it may not have the same general effect of accelerating convergence.

Unfortunately, in the stochastic gradient case, Nesterov momentum does not improve the rate of convergence.

— Page 300, Deep Learning, 2016.

Now that we are familiar with the Nesterov Momentum algorithm, let’s explore how we might implement it and evaluate its performance.

In this section, we will explore how to implement the gradient descent optimization algorithm with Nesterov Momentum.

First, let’s define an optimization function.

We will use a simple two-dimensional function that squares the input of each dimension and define the range of valid inputs from -1.0 to 1.0.

The objective() function below implements this function.

# objective function def objective(x, y): return x**2.0 + y**2.0

We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the response surface.

The complete example of plotting the objective function is listed below.

# 3d plot of the test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -1.0, 1.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a three-dimensional surface plot of the objective function.

We can see the familiar bowl shape with the global minima at f(0, 0) = 0.

We can also create a two-dimensional plot of the function. This will be helpful later when we want to plot the progress of the search.

The example below creates a contour plot of the objective function.

# contour plot of the test function from numpy import asarray from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # show the plot pyplot.show()

Running the example creates a two-dimensional contour plot of the objective function.

We can see the bowl shape compressed to contours shown with a color gradient. We will use this plot to plot the specific points explored during the progress of the search.

Now that we have a test objective function, let’s look at how we might implement the Nesterov Momentum optimization algorithm.

We can apply the gradient descent with Nesterov Momentum to the test problem.

First, we need a function that calculates the derivative for this function.

The derivative of x^2 is x * 2 in each dimension and the *derivative()* function implements this below.

# derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0])

Next, we can implement gradient descent optimization.

First, we can select a random point in the bounds of the problem as a starting point for the search.

This assumes we have an array that defines the bounds of the search with one row for each dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

... # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

Next, we need to calculate the projected point from the current position and calculate its derivative.

... # calculate the projected solution projected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])] # calculate the gradient for the projection gradient = derivative(projected[0], projected[1])

We can then create the new solution, one variable at a time.

First, the change in the variable is calculated using the partial derivative and learning rate with the momentum from the last change in the variable. This change is stored for the next iteration of the algorithm. Then the change is used to calculate the new value for the variable.

... # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # calculate the change change[i] = (momentum * change[i]) - step_size * gradient[i] # calculate the new position in this variable value = solution[i] + change[i] # store this variable new_solution.append(value)

This is repeated for each variable for the objective function, then repeated for each iteration of the algorithm.

This new solution can then be evaluated using the *objective()* function and the performance of the search can be reported.

... # evaluate candidate point solution = asarray(new_solution) solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

And that’s it.

We can tie all of this together into a function named *nesterov()* that takes the names of the objective function and the derivative function, an array with the bounds of the domain and hyperparameter values for the total number of algorithm iterations, the learning rate, and the momentum, and returns the final solution and its evaluation.

This complete function is listed below.

# gradient descent algorithm with nesterov momentum def nesterov(objective, derivative, bounds, n_iter, step_size, momentum): # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of changes made to each variable change = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate the projected solution projected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])] # calculate the gradient for the projection gradient = derivative(projected[0], projected[1]) # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # calculate the change change[i] = (momentum * change[i]) - step_size * gradient[i] # calculate the new position in this variable value = solution[i] + change[i] # store this variable new_solution.append(value) # evaluate candidate point solution = asarray(new_solution) solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return [solution, solution_eval]

**Note**, we have intentionally used lists and imperative coding style instead of vectorized operations for readability. Feel free to adapt the implementation to a vectorization implementation with NumPy arrays for better performance.

We can then define our hyperparameters and call the *nesterov()* function to optimize our test objective function.

In this case, we will use 30 iterations of the algorithm with a learning rate of 0.1 and momentum of 0.3. These hyperparameter values were found after a little trial and error.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 30 # define the step size step_size = 0.1 # define momentum momentum = 0.3 # perform the gradient descent search with nesterov momentum best, score = nesterov(objective, derivative, bounds, n_iter, step_size, momentum) print('Done!') print('f(%s) = %f' % (best, score))

Tying all of this together, the complete example of gradient descent optimization with Nesterov Momentum is listed below.

# gradient descent optimization with nesterov momentum for a two-dimensional test function from math import sqrt from numpy import asarray from numpy.random import rand from numpy.random import seed # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with nesterov momentum def nesterov(objective, derivative, bounds, n_iter, step_size, momentum): # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of changes made to each variable change = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate the projected solution projected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])] # calculate the gradient for the projection gradient = derivative(projected[0], projected[1]) # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # calculate the change change[i] = (momentum * change[i]) - step_size * gradient[i] # calculate the new position in this variable value = solution[i] + change[i] # store this variable new_solution.append(value) # evaluate candidate point solution = asarray(new_solution) solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return [solution, solution_eval] # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 30 # define the step size step_size = 0.1 # define momentum momentum = 0.3 # perform the gradient descent search with nesterov momentum best, score = nesterov(objective, derivative, bounds, n_iter, step_size, momentum) print('Done!') print('f(%s) = %f' % (best, score))

Running the example applies the optimization algorithm with Nesterov Momentum to our test problem and reports performance of the search for each iteration of the algorithm.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a near optimal solution was found after perhaps 15 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.

>0 f([-0.13276479 0.35251919]) = 0.14190 >1 f([-0.09824595 0.2608642 ]) = 0.07770 >2 f([-0.07031223 0.18669416]) = 0.03980 >3 f([-0.0495457 0.13155452]) = 0.01976 >4 f([-0.03465259 0.0920101 ]) = 0.00967 >5 f([-0.02414772 0.06411742]) = 0.00469 >6 f([-0.01679701 0.04459969]) = 0.00227 >7 f([-0.01167344 0.0309955 ]) = 0.00110 >8 f([-0.00810909 0.02153139]) = 0.00053 >9 f([-0.00563183 0.01495373]) = 0.00026 >10 f([-0.00391092 0.01038434]) = 0.00012 >11 f([-0.00271572 0.00721082]) = 0.00006 >12 f([-0.00188573 0.00500701]) = 0.00003 >13 f([-0.00130938 0.0034767 ]) = 0.00001 >14 f([-0.00090918 0.00241408]) = 0.00001 >15 f([-0.0006313 0.00167624]) = 0.00000 >16 f([-0.00043835 0.00116391]) = 0.00000 >17 f([-0.00030437 0.00080817]) = 0.00000 >18 f([-0.00021134 0.00056116]) = 0.00000 >19 f([-0.00014675 0.00038964]) = 0.00000 >20 f([-0.00010189 0.00027055]) = 0.00000 >21 f([-7.07505806e-05 1.87858067e-04]) = 0.00000 >22 f([-4.91260884e-05 1.30440372e-04]) = 0.00000 >23 f([-3.41109926e-05 9.05720503e-05]) = 0.00000 >24 f([-2.36851711e-05 6.28892431e-05]) = 0.00000 >25 f([-1.64459397e-05 4.36675208e-05]) = 0.00000 >26 f([-1.14193362e-05 3.03208033e-05]) = 0.00000 >27 f([-7.92908415e-06 2.10534304e-05]) = 0.00000 >28 f([-5.50560682e-06 1.46185748e-05]) = 0.00000 >29 f([-3.82285090e-06 1.01504945e-05]) = 0.00000 Done! f([-3.82285090e-06 1.01504945e-05]) = 0.000000

We can plot the progress of the Nesterov Momentum search on a contour plot of the domain.

This can provide an intuition for the progress of the search over the iterations of the algorithm.

We must update the *nesterov()* function to maintain a list of all solutions found during the search, then return this list at the end of the search.

The updated version of the function with these changes is listed below.

# gradient descent algorithm with nesterov momentum def nesterov(objective, derivative, bounds, n_iter, step_size, momentum): # track all solutions solutions = list() # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of changes made to each variable change = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate the projected solution projected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])] # calculate the gradient for the projection gradient = derivative(projected[0], projected[1]) # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # calculate the change change[i] = (momentum * change[i]) - step_size * gradient[i] # calculate the new position in this variable value = solution[i] + change[i] # store this variable new_solution.append(value) # store the new solution solution = asarray(new_solution) solutions.append(solution) # evaluate candidate point solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return solutions

We can then execute the search as before, and this time retrieve the list of solutions instead of the best final solution.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 50 # define the step size step_size = 0.01 # define momentum momentum = 0.8 # perform the gradient descent search with nesterov momentum solutions = nesterov(objective, derivative, bounds, n_iter, step_size, momentum)

We can then create a contour plot of the objective function, as before.

... # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet')

Finally, we can plot each solution found during the search as a white dot connected by a line.

... # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

Tying this all together, the complete example of performing the Nesterov Momentum optimization on the test problem and plotting the results on a contour plot is listed below.

# example of plotting the nesterov momentum search on a contour plot of the test function from math import sqrt from numpy import asarray from numpy import arange from numpy.random import rand from numpy.random import seed from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with nesterov momentum def nesterov(objective, derivative, bounds, n_iter, step_size, momentum): # track all solutions solutions = list() # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of changes made to each variable change = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate the projected solution projected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])] # calculate the gradient for the projection gradient = derivative(projected[0], projected[1]) # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # calculate the change change[i] = (momentum * change[i]) - step_size * gradient[i] # calculate the new position in this variable value = solution[i] + change[i] # store this variable new_solution.append(value) # store the new solution solution = asarray(new_solution) solutions.append(solution) # evaluate candidate point solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return solutions # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 50 # define the step size step_size = 0.01 # define momentum momentum = 0.8 # perform the gradient descent search with nesterov momentum solutions = nesterov(objective, derivative, bounds, n_iter, step_size, momentum) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w') # show the plot pyplot.show()

Running the example performs the search as before, except in this case, the contour plot of the objective function is created.

In this case, we can see that a white dot is shown for each solution found during the search, starting above the optima and progressively getting closer to the optima at the center of the plot.

This section provides more resources on the topic if you are looking to go deeper.

- A Method For Solving The Convex Programming Problem With Convergence Rate O(1/k^2), 1983.
- On The Importance Of Initialization And Momentum In Deep Learning, 2013.

- Algorithms for Optimization, 2019.
- Deep Learning, 2016.

- Gradient descent, Wikipedia.
- Stochastic gradient descent, Wikipedia.
- An overview of gradient descent optimization algorithms, 2016.

In this tutorial, you discovered how to develop the gradient descent optimization with Nesterov Momentum from scratch.

Specifically, you learned:

- The convergence of gradient descent optimization algorithm can be accelerated by extending the algorithm and adding Nesterov Momentum.
- How to implement the Nesterov Momentum optimization algorithm from scratch and apply it to an objective function and evaluate the results.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Gradient Descent With Nesterov Momentum From Scratch appeared first on Machine Learning Mastery.

]]>The post Develop a Neural Network for Banknote Authentication appeared first on Machine Learning Mastery.

]]>One approach is to first inspect the dataset and develop ideas for what models might work, then explore the learning dynamics of simple models on the dataset, then finally develop and tune a model for the dataset with a robust test harness.

This process can be used to develop effective neural network models for classification and regression predictive modeling problems.

In this tutorial, you will discover how to develop a Multilayer Perceptron neural network model for the banknote binary classification dataset.

After completing this tutorial, you will know:

- How to load and summarize the banknote dataset and use the results to suggest data preparations and model configurations to use.
- How to explore the learning dynamics of simple MLP models on the dataset.
- How to develop robust estimates of model performance, tune model performance and make predictions on new data.

Let’s get started.

This tutorial is divided into 4 parts; they are:

- Banknote Classification Dataset
- Neural Network Learning Dynamics
- Robust Model Evaluation
- Final Model and Make Predictions

The first step is to define and explore the dataset.

We will be working with the “*Banknote*” standard binary classification dataset.

The banknote dataset involves predicting whether a given banknote is authentic given a number of measures taken from a photograph.

The dataset contains 1,372 rows with 5 numeric variables. It is a classification problem with two classes (binary classification).

Below provides a list of the five variables in the dataset.

- variance of Wavelet Transformed image (continuous).
- skewness of Wavelet Transformed image (continuous).
- kurtosis of Wavelet Transformed image (continuous).
- entropy of image (continuous).
- class (integer).

Below is a sample of the first 5 rows of the dataset

3.6216,8.6661,-2.8073,-0.44699,0 4.5459,8.1674,-2.4586,-1.4621,0 3.866,-2.6383,1.9242,0.10645,0 3.4566,9.5228,-4.0112,-3.5944,0 0.32924,-4.4552,4.5718,-0.9888,0 4.3684,9.6718,-3.9606,-3.1625,0 ...

You can learn more about the dataset here:

- Banknote Dataset (banknote_authentication.csv)
- Banknote Dataset Details (banknote_authentication.names)

We can load the dataset as a pandas DataFrame directly from the URL; for example:

# load the banknote dataset and summarize the shape from pandas import read_csv # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/banknote_authentication.csv' # load the dataset df = read_csv(url, header=None) # summarize shape print(df.shape)

Running the example loads the dataset directly from the URL and reports the shape of the dataset.

In this case, we can confirm that the dataset has 5 variables (4 input and one output) and that the dataset has 1,372 rows of data.

This is not many rows of data for a neural network and suggests that a small network, perhaps with regularization, would be appropriate.

It also suggests that using k-fold cross-validation would be a good idea given that it will give a more reliable estimate of model performance than a train/test split and because a single model will fit in seconds instead of hours or days with the largest datasets.

(1372, 5)

Next, we can learn more about the dataset by looking at summary statistics and a plot of the data.

# show summary statistics and plots of the banknote dataset from pandas import read_csv from matplotlib import pyplot # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/banknote_authentication.csv' # load the dataset df = read_csv(url, header=None) # show summary statistics print(df.describe()) # plot histograms df.hist() pyplot.show()

Running the example first loads the data before and then prints summary statistics for each variable.

We can see that values vary with different means and standard deviations, perhaps some normalization or standardization would be required prior to modeling.

0 1 2 3 4 count 1372.000000 1372.000000 1372.000000 1372.000000 1372.000000 mean 0.433735 1.922353 1.397627 -1.191657 0.444606 std 2.842763 5.869047 4.310030 2.101013 0.497103 min -7.042100 -13.773100 -5.286100 -8.548200 0.000000 25% -1.773000 -1.708200 -1.574975 -2.413450 0.000000 50% 0.496180 2.319650 0.616630 -0.586650 0.000000 75% 2.821475 6.814625 3.179250 0.394810 1.000000 max 6.824800 12.951600 17.927400 2.449500 1.000000

A histogram plot is then created for each variable.

We can see that perhaps the first two variables have a Gaussian-like distribution and the next two input variables may have a skewed Gaussian distribution or an exponential distribution.

We may have some benefit in using a power transform on each variable in order to make the probability distribution less skewed which will likely improve model performance.

Now that we are familiar with the dataset, let’s explore how we might develop a neural network model.

We will develop a Multilayer Perceptron (MLP) model for the dataset using TensorFlow.

We cannot know what model architecture of learning hyperparameters would be good or best for this dataset, so we must experiment and discover what works well.

Given that the dataset is small, a small batch size is probably a good idea, e.g. 16 or 32 rows. Using the Adam version of stochastic gradient descent is a good idea when getting started as it will automatically adapt the learning rate and works well on most datasets.

Before we evaluate models in earnest, it is a good idea to review the learning dynamics and tune the model architecture and learning configuration until we have stable learning dynamics, then look at getting the most out of the model.

We can do this by using a simple train/test split of the data and review plots of the learning curves. This will help us see if we are over-learning or under-learning; then we can adapt the configuration accordingly.

First, we must ensure all input variables are floating-point values and encode the target label as integer values 0 and 1.

... # ensure all data are floating point values X = X.astype('float32') # encode strings to integer y = LabelEncoder().fit_transform(y)

Next, we can split the dataset into input and output variables, then into 67/33 train and test sets.

... # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # split into train and test datasets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

We can define a minimal MLP model. In this case, we will use one hidden layer with 10 nodes and one output layer (chosen arbitrarily). We will use the ReLU activation function in the hidden layer and the “*he_normal*” weight initialization, as together, they are a good practice.

The output of the model is a sigmoid activation for binary classification and we will minimize binary cross-entropy loss.

... # determine the number of input features n_features = X.shape[1] # define model model = Sequential() model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy')

We will fit the model for 50 training epochs (chosen arbitrarily) with a batch size of 32 because it is a small dataset.

We are fitting the model on raw data, which we think might be a good idea, but it is an important starting point.

... # fit the model history = model.fit(X_train, y_train, epochs=50, batch_size=32, verbose=0, validation_data=(X_test,y_test))

At the end of training, we will evaluate the model’s performance on the test dataset and report performance as the classification accuracy.

... # predict test set yhat = model.predict_classes(X_test) # evaluate predictions score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score)

Finally, we will plot learning curves of the cross-entropy loss on the train and test sets during training.

... # plot learning curves pyplot.title('Learning Curves') pyplot.xlabel('Epoch') pyplot.ylabel('Cross Entropy') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='val') pyplot.legend() pyplot.show()

Tying this all together, the complete example of evaluating our first MLP on the banknote dataset is listed below.

# fit a simple mlp model on the banknote and review learning curves from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense from matplotlib import pyplot # load the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/banknote_authentication.csv' df = read_csv(path, header=None) # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # ensure all data are floating point values X = X.astype('float32') # encode strings to integer y = LabelEncoder().fit_transform(y) # split into train and test datasets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) # determine the number of input features n_features = X.shape[1] # define model model = Sequential() model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy') # fit the model history = model.fit(X_train, y_train, epochs=50, batch_size=32, verbose=0, validation_data=(X_test,y_test)) # predict test set yhat = model.predict_classes(X_test) # evaluate predictions score = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % score) # plot learning curves pyplot.title('Learning Curves') pyplot.xlabel('Epoch') pyplot.ylabel('Cross Entropy') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='val') pyplot.legend() pyplot.show()

Running the example first fits the model on the training dataset, then reports the classification accuracy on the test dataset.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved great or perfect accuracy of 100% percent. This might suggest that the prediction problem is easy and/or that neural networks are a good fit for the problem.

Accuracy: 1.000

Line plots of the loss on the train and test sets are then created.

We can see that the model appears to converge well and does not show any signs of overfitting or underfitting.

We did amazingly well on our first try.

Now that we have some idea of the learning dynamics for a simple MLP model on the dataset, we can look at developing a more robust evaluation of model performance on the dataset.

The k-fold cross-validation procedure can provide a more reliable estimate of MLP performance, although it can be very slow.

This is because k models must be fit and evaluated. This is not a problem when the dataset size is small, such as the banknote dataset.

We can use the StratifiedKFold class and enumerate each fold manually, fit the model, evaluate it, and then report the mean of the evaluation scores at the end of the procedure.

... # prepare cross validation kfold = KFold(10) # enumerate splits scores = list() for train_ix, test_ix in kfold.split(X, y): # fit and evaluate the model... ... ... # summarize all scores print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

We can use this framework to develop a reliable estimate of MLP model performance with our base configuration, and even with a range of different data preparations, model architectures, and learning configurations.

It is important that we first developed an understanding of the learning dynamics of the model on the dataset in the previous section before using k-fold cross-validation to estimate the performance. If we started to tune the model directly, we might get good results, but if not, we might have no idea of why, e.g. that the model was over or under fitting.

If we make large changes to the model again, it is a good idea to go back and confirm that the model is converging appropriately.

The complete example of this framework to evaluate the base MLP model from the previous section is listed below.

# k-fold cross-validation of base model for the banknote dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense from matplotlib import pyplot # load the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/banknote_authentication.csv' df = read_csv(path, header=None) # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # ensure all data are floating point values X = X.astype('float32') # encode strings to integer y = LabelEncoder().fit_transform(y) # prepare cross validation kfold = StratifiedKFold(10) # enumerate splits scores = list() for train_ix, test_ix in kfold.split(X, y): # split data X_train, X_test, y_train, y_test = X[train_ix], X[test_ix], y[train_ix], y[test_ix] # determine the number of input features n_features = X.shape[1] # define model model = Sequential() model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy') # fit the model model.fit(X_train, y_train, epochs=50, batch_size=32, verbose=0) # predict test set yhat = model.predict_classes(X_test) # evaluate predictions score = accuracy_score(y_test, yhat) print('>%.3f' % score) scores.append(score) # summarize all scores print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports the model performance each iteration of the evaluation procedure and reports the mean and standard deviation of classification accuracy at the end of the run.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the MLP model achieved a mean accuracy of about 99.9 percent.

This confirms our expectation that the base model configuration works very well for this dataset, and indeed the model is a good fit for the problem and perhaps the problem is quite trivial to solve.

This is surprising (to me) because I would have expected some data scaling and perhaps a power transform to be required.

>1.000 >1.000 >1.000 >1.000 >0.993 >1.000 >1.000 >1.000 >1.000 >1.000 Mean Accuracy: 0.999 (0.002)

Next, let’s look at how we might fit a final model and use it to make predictions.

Once we choose a model configuration, we can train a final model on all available data and use it to make predictions on new data.

In this case, we will use the model with dropout and a small batch size as our final model.

We can prepare the data and fit the model as before, although on the entire dataset instead of a training subset of the dataset.

... # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # ensure all data are floating point values X = X.astype('float32') # encode strings to integer le = LabelEncoder() y = le.fit_transform(y) # determine the number of input features n_features = X.shape[1] # define model model = Sequential() model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy')

We can then use this model to make predictions on new data.

First, we can define a row of new data.

... # define a row of new data row = [3.6216,8.6661,-2.8073,-0.44699]

Note: I took this row from the first row of the dataset and the expected label is a ‘0’.

We can then make a prediction.

... # make prediction yhat = model.predict_classes([row])

Then invert the transform on the prediction, so we can use or interpret the result in the correct label (which is just an integer for this dataset).

... # invert transform to get label for class yhat = le.inverse_transform(yhat)

And in this case, we will simply report the prediction.

... # report prediction print('Predicted: %s' % (yhat[0]))

Tying this all together, the complete example of fitting a final model for the banknote dataset and using it to make a prediction on new data is listed below.

# fit a final model and make predictions on new data for the banknote dataset from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.layers import Dropout # load the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/banknote_authentication.csv' df = read_csv(path, header=None) # split into input and output columns X, y = df.values[:, :-1], df.values[:, -1] # ensure all data are floating point values X = X.astype('float32') # encode strings to integer le = LabelEncoder() y = le.fit_transform(y) # determine the number of input features n_features = X.shape[1] # define model model = Sequential() model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,))) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy') # fit the model model.fit(X, y, epochs=50, batch_size=32, verbose=0) # define a row of new data row = [3.6216,8.6661,-2.8073,-0.44699] # make prediction yhat = model.predict_classes([row]) # invert transform to get label for class yhat = le.inverse_transform(yhat) # report prediction print('Predicted: %s' % (yhat[0]))

Running the example fits the model on the entire dataset and makes a prediction for a single row of new data.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model predicted a “0” label for the input row.

Predicted: 0.0

This section provides more resources on the topic if you are looking to go deeper.

- How to Develop a Neural Net for Predicting Disturbances in the Ionosphere
- Best Results for Standard Machine Learning Datasets
- TensorFlow 2 Tutorial: Get Started in Deep Learning With tf.keras
- A Gentle Introduction to k-fold Cross-Validation

In this tutorial, you discovered how to develop a Multilayer Perceptron neural network model for the banknote binary classification dataset.

Specifically, you learned:

- How to load and summarize the banknote dataset and use the results to suggest data preparations and model configurations to use.
- How to explore the learning dynamics of simple MLP models on the dataset.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Develop a Neural Network for Banknote Authentication appeared first on Machine Learning Mastery.

]]>