Last Updated on April 8, 2023
The “weights” of a neural network is referred as “parameters” in PyTorch code and it is fine-tuned by optimizer during training. On the contrary, hyperparameters are the parameters of a neural network that is fixed by design and not tuned by training. Examples are the number of hidden layers and the choice of activation functions. Hyperparameter optimization is a big part of deep learning. The reason is that neural networks are notoriously difficult to configure, and a lot of parameters need to be set. On top of that, individual models can be very slow to train.
In this post, you will discover how to use the grid search capability from the scikit-learn Python machine learning library to tune the hyperparameters of PyTorch deep learning models. After reading this post, you will know:
- How to wrap PyTorch models for use in scikit-learn and how to use grid search
- How to grid search common neural network parameters, such as learning rate, dropout rate, epochs, and number of neurons
- How to define your own hyperparameter tuning experiments on your own projects
Kick-start your project with my book Deep Learning with PyTorch. It provides self-study tutorials with working code.
Let’s get started.

How to Grid Search Hyperparameters for PyTorch Models
Photo by brandon siu. Some rights reserved.
Overview
In this post, you will see how to use the scikit-learn grid search capability with a suite of examples that you can copy-and-paste into your own project as a starting point. Below is a list of the topics we are going to cover:
- How to use PyTorch models in scikit-learn
- How to use grid search in scikit-learn
- How to tune batch size and training epochs
- How to tune optimization algorithms
- How to tune learning rate and momentum
- How to tune network weight initialization
- How to tune activation functions
- How to tune dropout regularization
- How to tune the number of neurons in the hidden layer
How to Use PyTorch Models in scikit-learn
PyTorch models can be used in scikit-learn if wrapped with skorch. This is to leverage the duck-typing nature of Python to make the PyTorch model provide similar API as a scikit-learn model, so everything in scikit-learn can work along. In skorch, there are NeuralNetClassifier
for classification neural networks and NeuralNetRegressor
for regression neural networks. You may need to run the follownig command to install the module.
1 |
pip install skorch |
To use these wrappers, you must define a your PyTorch model as a class using nn.Module
, then pass the name of the class to the module
argument when constructing the NeuralNetClassifier
class. For example:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
class MyClassifier(nn.Module): def __init__(self): super().__init__() ... def forward(self, x): ... return x # create the skorch wrapper model = NeuralNetClassifier( module=MyClassifier ) |
The constructor for the NeuralNetClassifier
class can take default arguments that are passed on to the calls to model.fit()
(the way to invoke a training loop in scikit-learn models), such as the number of epochs and the batch size. For example:
1 2 3 4 5 |
model = NeuralNetClassifier( module=MyClassifier, max_epochs=150, batch_size=10 ) |
The constructor for the NeuralNetClassifier
class can also take new arguments that can be passed to your model class’ constructor, but you have to prepend it with module__
(with two underscores). These new arguments may carry a default value in the constructor but they will be overridden when the wrapper instantiate the model. For example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
import torch.nn as nn from skorch import NeuralNetClassifier class SonarClassifier(nn.Module): def __init__(self, n_layers=3): super().__init__() self.layers = [] self.acts = [] for i in range(n_layers): self.layers.append(nn.Linear(60, 60)) self.acts.append(nn.ReLU()) self.add_module(f"layer{i}", self.layers[-1]) self.add_module(f"act{i}", self.acts[-1]) self.output = nn.Linear(60, 1) def forward(self, x): for layer, act in zip(self.layers, self.acts): x = act(layer(x)) x = self.output(x) return x model = NeuralNetClassifier( module=SonarClassifier, max_epochs=150, batch_size=10, module__n_layers=2 ) |
You can verify the result by initializing a model and print it:
1 |
print(model.initialize()) |
In this example, you should see:
1 2 3 4 5 6 7 8 9 |
<class 'skorch.classifier.NeuralNetClassifier'>[initialized]( module_=SonarClassifier( (layer0): Linear(in_features=60, out_features=60, bias=True) (act0): ReLU() (layer1): Linear(in_features=60, out_features=60, bias=True) (act1): ReLU() (output): Linear(in_features=60, out_features=1, bias=True) ), ) |
Want to Get Started With Deep Learning with PyTorch?
Take my free email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
How to Use Grid Search in scikit-learn
Grid search is a model hyperparameter optimization technique. It simply exhaust all combinations of the hyperparameters and find the one that gave the best score. In scikit-learn, this technique is provided in the GridSearchCV
class. When constructing this class, you must provide a dictionary of hyperparameters to evaluate in the param_grid
argument. This is a map of the model parameter name and an array of values to try.
By default, accuracy is the score that is optimized, but other scores can be specified in the score argument of the GridSearchCV
constructor. The GridSearchCV
process will then construct and evaluate one model for each combination of parameters. Cross-validation is used to evaluate each individual model, and the default of 3-fold cross-validation is used, although you can override this by specifying the cv argument to the GridSearchCV
constructor.
Below is an example of defining a simple grid search:
1 2 3 4 5 |
param_grid = { 'epochs': [10,20,30] } grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3) grid_result = grid.fit(X, Y) |
By setting the n_jobs
argument in the GridSearchCV
constructor to $-1$, the process will use all cores on your machine. Otherwise the grid search process will only run in single thread, which is slower in the multi-core CPUs.
Once completed, you can access the outcome of the grid search in the result object returned from grid.fit()
. The best_score_
member provides access to the best score observed during the optimization procedure, and the best_params_
describes the combination of parameters that achieved the best results. You can learn more about the GridSearchCV
class in the scikit-learn API documentation.
Kick-start your project with my book Deep Learning with PyTorch. It provides self-study tutorials with working code.
Problem Description
Now that you know how to use PyTorch models with scikit-learn and how to use grid search in scikit-learn, let’s look at a bunch of examples.
All examples will be demonstrated on a small standard machine learning dataset called the Pima Indians onset of diabetes classification dataset. This is a small dataset with all numerical attributes that is easy to work with.
As you proceed through the examples in this post, you will aggregate the best parameters. This is not the best way to grid search because parameters can interact, but it is good for demonstration purposes.
How to Tune Batch Size and Number of Epochs
In this first simple example, you will look at tuning the batch size and number of epochs used when fitting the network.
The batch size in iterative gradient descent is the number of patterns shown to the network before the weights are updated. It is also an optimization in the training of the network, defining how many patterns to read at a time and keep in memory.
The number of epochs is the number of times the entire training dataset is shown to the network during training. Some networks are sensitive to the batch size, such as LSTM recurrent neural networks and Convolutional Neural Networks.
Here you will evaluate a suite of different minibatch sizes from 10 to 100 in steps of 20.
The full code listing is provided below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
import random import numpy as np import torch import torch.nn as nn import torch.optim as optim from skorch import NeuralNetClassifier from sklearn.model_selection import GridSearchCV # load the dataset, split into input (X) and output (y) variables dataset = np.loadtxt('pima-indians-diabetes.csv', delimiter=',') X = dataset[:,0:8] y = dataset[:,8] X = torch.tensor(X, dtype=torch.float32) y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1) # PyTorch classifier class PimaClassifier(nn.Module): def __init__(self): super().__init__() self.layer = nn.Linear(8, 12) self.act = nn.ReLU() self.output = nn.Linear(12, 1) self.prob = nn.Sigmoid() def forward(self, x): x = self.act(self.layer(x)) x = self.prob(self.output(x)) return x # create model with skorch model = NeuralNetClassifier( PimaClassifier, criterion=nn.BCELoss, optimizer=optim.Adam, verbose=False ) # define the grid search parameters param_grid = { 'batch_size': [10, 20, 40, 60, 80, 100], 'max_epochs': [10, 50, 100] } grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3) grid_result = grid.fit(X, y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) |
Running this example produces the following output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Best: 0.714844 using {'batch_size': 10, 'max_epochs': 100} 0.665365 (0.020505) with: {'batch_size': 10, 'max_epochs': 10} 0.588542 (0.168055) with: {'batch_size': 10, 'max_epochs': 50} 0.714844 (0.032369) with: {'batch_size': 10, 'max_epochs': 100} 0.671875 (0.022326) with: {'batch_size': 20, 'max_epochs': 10} 0.696615 (0.008027) with: {'batch_size': 20, 'max_epochs': 50} 0.714844 (0.019918) with: {'batch_size': 20, 'max_epochs': 100} 0.666667 (0.009744) with: {'batch_size': 40, 'max_epochs': 10} 0.687500 (0.033603) with: {'batch_size': 40, 'max_epochs': 50} 0.707031 (0.024910) with: {'batch_size': 40, 'max_epochs': 100} 0.667969 (0.014616) with: {'batch_size': 60, 'max_epochs': 10} 0.694010 (0.036966) with: {'batch_size': 60, 'max_epochs': 50} 0.694010 (0.042473) with: {'batch_size': 60, 'max_epochs': 100} 0.670573 (0.023939) with: {'batch_size': 80, 'max_epochs': 10} 0.674479 (0.020752) with: {'batch_size': 80, 'max_epochs': 50} 0.703125 (0.026107) with: {'batch_size': 80, 'max_epochs': 100} 0.680990 (0.014382) with: {'batch_size': 100, 'max_epochs': 10} 0.670573 (0.013279) with: {'batch_size': 100, 'max_epochs': 50} 0.687500 (0.017758) with: {'batch_size': 100, 'max_epochs': 100} |
You can see that the batch size of 10 and 100 epochs achieved the best result of about 71% accuracy (but you should also take into account the accuracy’s standard deviation).
How to Tune the Training Optimization Algorithm
All deep learning library should offer a variety of optimization algorithms. PyTorch is no exception.
In this example, you will tune the optimization algorithm used to train the network, each with default parameters.
This is an odd example because often, you will choose one approach a priori and instead focus on tuning its parameters on your problem (see the next example).
Here, you will evaluate the suite of optimization algorithms available in PyTorch.
The full code listing is provided below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
import numpy as np import torch import torch.nn as nn import torch.optim as optim from skorch import NeuralNetClassifier from sklearn.model_selection import GridSearchCV # load the dataset, split into input (X) and output (y) variables dataset = np.loadtxt('pima-indians-diabetes.csv', delimiter=',') X = dataset[:,0:8] y = dataset[:,8] X = torch.tensor(X, dtype=torch.float32) y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1) # PyTorch classifier class PimaClassifier(nn.Module): def __init__(self): super().__init__() self.layer = nn.Linear(8, 12) self.act = nn.ReLU() self.output = nn.Linear(12, 1) self.prob = nn.Sigmoid() def forward(self, x): x = self.act(self.layer(x)) x = self.prob(self.output(x)) return x # create model with skorch model = NeuralNetClassifier( PimaClassifier, criterion=nn.BCELoss, max_epochs=100, batch_size=10, verbose=False ) # define the grid search parameters param_grid = { 'optimizer': [optim.SGD, optim.RMSprop, optim.Adagrad, optim.Adadelta, optim.Adam, optim.Adamax, optim.NAdam], } grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3) grid_result = grid.fit(X, y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) |
Running this example produces the following output:
1 2 3 4 5 6 7 8 |
Best: 0.721354 using {'optimizer': <class 'torch.optim.adamax.Adamax'>} 0.674479 (0.036828) with: {'optimizer': <class 'torch.optim.sgd.SGD'>} 0.700521 (0.043303) with: {'optimizer': <class 'torch.optim.rmsprop.RMSprop'>} 0.682292 (0.027126) with: {'optimizer': <class 'torch.optim.adagrad.Adagrad'>} 0.572917 (0.051560) with: {'optimizer': <class 'torch.optim.adadelta.Adadelta'>} 0.714844 (0.030758) with: {'optimizer': <class 'torch.optim.adam.Adam'>} 0.721354 (0.019225) with: {'optimizer': <class 'torch.optim.adamax.Adamax'>} 0.709635 (0.024360) with: {'optimizer': <class 'torch.optim.nadam.NAdam'>} |
The results suggest that the Adamax optimization algorithm is the best with a score of about 72% accuracy.
It is worth to mention that GridSearchCV
will recreate your model often so every trial is independent. The reason it can be done is because of the NeuralNetClassifier
wrapper, which knows the name of the class for your PyTorch model and instantiate one for you upon request.
How to Tune Learning Rate and Momentum
It is common to pre-select an optimization algorithm to train your network and tune its parameters.
By far, the most common optimization algorithm is plain old Stochastic Gradient Descent (SGD) because it is so well understood. In this example, you will look at optimizing the SGD learning rate and momentum parameters.
The learning rate controls how much to update the weight at the end of each batch, and the momentum controls how much to let the previous update influence the current weight update.
You will try a suite of small standard learning rates and a momentum values from 0.2 to 0.8 in steps of 0.2, as well as 0.9 (because it can be a popular value in practice). In PyTorch, the way to set the learning rate and momentum is the following:
1 |
optimizer = optim.SGD(lr=0.001, momentum=0.9) |
In the skorch wrapper, you will can route the parameters to the optimizer with the prefix optimizer__
.
Generally, it is a good idea to also include the number of epochs in an optimization like this as there is a dependency between the amount of learning per batch (learning rate), the number of updates per epoch (batch size), and the number of epochs.
The full code listing is provided below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
import numpy as np import torch import torch.nn as nn import torch.optim as optim from skorch import NeuralNetClassifier from sklearn.model_selection import GridSearchCV # load the dataset, split into input (X) and output (y) variables dataset = np.loadtxt('pima-indians-diabetes.csv', delimiter=',') X = dataset[:,0:8] y = dataset[:,8] X = torch.tensor(X, dtype=torch.float32) y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1) # PyTorch classifier class PimaClassifier(nn.Module): def __init__(self): super().__init__() self.layer = nn.Linear(8, 12) self.act = nn.ReLU() self.output = nn.Linear(12, 1) self.prob = nn.Sigmoid() def forward(self, x): x = self.act(self.layer(x)) x = self.prob(self.output(x)) return x # create model with skorch model = NeuralNetClassifier( PimaClassifier, criterion=nn.BCELoss, optimizer=optim.SGD, max_epochs=100, batch_size=10, verbose=False ) # define the grid search parameters param_grid = { 'optimizer__lr': [0.001, 0.01, 0.1, 0.2, 0.3], 'optimizer__momentum': [0.0, 0.2, 0.4, 0.6, 0.8, 0.9], } grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3) grid_result = grid.fit(X, y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) |
Running this example produces the following output.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
Best: 0.682292 using {'optimizer__lr': 0.001, 'optimizer__momentum': 0.9} 0.648438 (0.016877) with: {'optimizer__lr': 0.001, 'optimizer__momentum': 0.0} 0.671875 (0.017758) with: {'optimizer__lr': 0.001, 'optimizer__momentum': 0.2} 0.674479 (0.022402) with: {'optimizer__lr': 0.001, 'optimizer__momentum': 0.4} 0.677083 (0.011201) with: {'optimizer__lr': 0.001, 'optimizer__momentum': 0.6} 0.679688 (0.027621) with: {'optimizer__lr': 0.001, 'optimizer__momentum': 0.8} 0.682292 (0.026557) with: {'optimizer__lr': 0.001, 'optimizer__momentum': 0.9} 0.671875 (0.019918) with: {'optimizer__lr': 0.01, 'optimizer__momentum': 0.0} 0.648438 (0.024910) with: {'optimizer__lr': 0.01, 'optimizer__momentum': 0.2} 0.546875 (0.143454) with: {'optimizer__lr': 0.01, 'optimizer__momentum': 0.4} 0.567708 (0.153668) with: {'optimizer__lr': 0.01, 'optimizer__momentum': 0.6} 0.552083 (0.141790) with: {'optimizer__lr': 0.01, 'optimizer__momentum': 0.8} 0.451823 (0.144561) with: {'optimizer__lr': 0.01, 'optimizer__momentum': 0.9} 0.348958 (0.001841) with: {'optimizer__lr': 0.1, 'optimizer__momentum': 0.0} 0.450521 (0.142719) with: {'optimizer__lr': 0.1, 'optimizer__momentum': 0.2} 0.450521 (0.142719) with: {'optimizer__lr': 0.1, 'optimizer__momentum': 0.4} 0.450521 (0.142719) with: {'optimizer__lr': 0.1, 'optimizer__momentum': 0.6} 0.348958 (0.001841) with: {'optimizer__lr': 0.1, 'optimizer__momentum': 0.8} 0.348958 (0.001841) with: {'optimizer__lr': 0.1, 'optimizer__momentum': 0.9} 0.444010 (0.136265) with: {'optimizer__lr': 0.2, 'optimizer__momentum': 0.0} 0.450521 (0.142719) with: {'optimizer__lr': 0.2, 'optimizer__momentum': 0.2} 0.348958 (0.001841) with: {'optimizer__lr': 0.2, 'optimizer__momentum': 0.4} 0.552083 (0.141790) with: {'optimizer__lr': 0.2, 'optimizer__momentum': 0.6} 0.549479 (0.142719) with: {'optimizer__lr': 0.2, 'optimizer__momentum': 0.8} 0.651042 (0.001841) with: {'optimizer__lr': 0.2, 'optimizer__momentum': 0.9} 0.552083 (0.141790) with: {'optimizer__lr': 0.3, 'optimizer__momentum': 0.0} 0.348958 (0.001841) with: {'optimizer__lr': 0.3, 'optimizer__momentum': 0.2} 0.450521 (0.142719) with: {'optimizer__lr': 0.3, 'optimizer__momentum': 0.4} 0.552083 (0.141790) with: {'optimizer__lr': 0.3, 'optimizer__momentum': 0.6} 0.450521 (0.142719) with: {'optimizer__lr': 0.3, 'optimizer__momentum': 0.8} 0.450521 (0.142719) with: {'optimizer__lr': 0.3, 'optimizer__momentum': 0.9} |
You can see that, with SGD, the best results were achieved using a learning rate of 0.001 and a momentum of 0.9 with an accuracy of about 68%.
How to Tune Network Weight Initialization
Neural network weight initialization used to be simple: use small random values.
Now there is a suite of different techniques to choose from. You can get a laundry list from torch.nn.init
documentation.
In this example, you will look at tuning the selection of network weight initialization by evaluating all the available techniques.
You will use the same weight initialization method on each layer. Ideally, it may be better to use different weight initialization schemes according to the activation function used on each layer. In the example below, you will use a rectifier for the hidden layer. Use sigmoid for the output layer because the predictions are binary. The weight initialization is implicit in PyTorch models. Therefore you need to write your own logic to initialize the weight, after the layer is created but before it is used. Let’s modify the PyTorch as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# PyTorch classifier class PimaClassifier(nn.Module): def __init__(self, weight_init=torch.nn.init.xavier_uniform_): super().__init__() self.layer = nn.Linear(8, 12) self.act = nn.ReLU() self.output = nn.Linear(12, 1) self.prob = nn.Sigmoid() # manually init weights weight_init(self.layer.weight) weight_init(self.output.weight) def forward(self, x): x = self.act(self.layer(x)) x = self.prob(self.output(x)) return x |
An argument weight_init
is added to the class PimaClassifier
and it expects one of the initializers from torch.nn.init
. In GridSearchCV
, you need to use the module__
prefix to make NeuralNetClassifier
route the parameter to the model’s class constructor.
The full code listing is provided below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
import numpy as np import torch import torch.nn as nn import torch.nn.init as init import torch.optim as optim from skorch import NeuralNetClassifier from sklearn.model_selection import GridSearchCV # load the dataset, split into input (X) and output (y) variables dataset = np.loadtxt('pima-indians-diabetes.csv', delimiter=',') X = dataset[:,0:8] y = dataset[:,8] X = torch.tensor(X, dtype=torch.float32) y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1) # PyTorch classifier class PimaClassifier(nn.Module): def __init__(self, weight_init=init.xavier_uniform_): super().__init__() self.layer = nn.Linear(8, 12) self.act = nn.ReLU() self.output = nn.Linear(12, 1) self.prob = nn.Sigmoid() # manually init weights weight_init(self.layer.weight) weight_init(self.output.weight) def forward(self, x): x = self.act(self.layer(x)) x = self.prob(self.output(x)) return x # create model with skorch model = NeuralNetClassifier( PimaClassifier, criterion=nn.BCELoss, optimizer=optim.Adamax, max_epochs=100, batch_size=10, verbose=False ) # define the grid search parameters param_grid = { 'module__weight_init': [init.uniform_, init.normal_, init.zeros_, init.xavier_normal_, init.xavier_uniform_, init.kaiming_normal_, init.kaiming_uniform_] } grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3) grid_result = grid.fit(X, y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) |
Running this example produces the following output.
1 2 3 4 5 6 7 8 |
Best: 0.697917 using {'module__weight_init': <function kaiming_uniform_ at 0x112020c10>} 0.348958 (0.001841) with: {'module__weight_init': <function uniform_ at 0x1120204c0>} 0.602865 (0.061708) with: {'module__weight_init': <function normal_ at 0x112020550>} 0.652344 (0.003189) with: {'module__weight_init': <function zeros_ at 0x112020820>} 0.691406 (0.030758) with: {'module__weight_init': <function xavier_normal_ at 0x112020af0>} 0.592448 (0.171589) with: {'module__weight_init': <function xavier_uniform_ at 0x112020a60>} 0.563802 (0.152971) with: {'module__weight_init': <function kaiming_normal_ at 0x112020ca0>} 0.697917 (0.013279) with: {'module__weight_init': <function kaiming_uniform_ at 0x112020c10>} |
The best results were achieved with a He-uniform weight initialization scheme achieving a performance of about 70%.
How to Tune the Neuron Activation Function
The activation function controls the nonlinearity of individual neurons and when to fire.
Generally, the rectifier activation function is the most popular. However, it used to be the sigmoid and the tanh functions, and these functions may still be more suitable for different problems.
In this example, you will evaluate some of the activation functions available in PyTorch. You will only use these functions in the hidden layer, as a sigmoid activation function is required in the output for the binary classification problem. Similar to the previous example, this is an argument to the class constructor of the model, and you will use the module__
prefix for the GridSearchCV
parameter grid.
Generally, it is a good idea to prepare data to the range of the different transfer functions, which you will not do in this case.
The full code listing is provided below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
import numpy as np import torch import torch.nn as nn import torch.nn.init as init import torch.optim as optim from skorch import NeuralNetClassifier from sklearn.model_selection import GridSearchCV # load the dataset, split into input (X) and output (y) variables dataset = np.loadtxt('pima-indians-diabetes.csv', delimiter=',') X = dataset[:,0:8] y = dataset[:,8] X = torch.tensor(X, dtype=torch.float32) y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1) # PyTorch classifier class PimaClassifier(nn.Module): def __init__(self, activation=nn.ReLU): super().__init__() self.layer = nn.Linear(8, 12) self.act = activation() self.output = nn.Linear(12, 1) self.prob = nn.Sigmoid() # manually init weights init.kaiming_uniform_(self.layer.weight) init.kaiming_uniform_(self.output.weight) def forward(self, x): x = self.act(self.layer(x)) x = self.prob(self.output(x)) return x # create model with skorch model = NeuralNetClassifier( PimaClassifier, criterion=nn.BCELoss, optimizer=optim.Adamax, max_epochs=100, batch_size=10, verbose=False ) # define the grid search parameters param_grid = { 'module__activation': [nn.Identity, nn.ReLU, nn.ELU, nn.ReLU6, nn.GELU, nn.Softplus, nn.Softsign, nn.Tanh, nn.Sigmoid, nn.Hardsigmoid] } grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3) grid_result = grid.fit(X, y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) |
Running this example produces the following output.
1 2 3 4 5 6 7 8 9 10 11 |
Best: 0.699219 using {'module__activation': <class 'torch.nn.modules.activation.ReLU'>} 0.687500 (0.025315) with: {'module__activation': <class 'torch.nn.modules.linear.Identity'>} 0.699219 (0.011049) with: {'module__activation': <class 'torch.nn.modules.activation.ReLU'>} 0.674479 (0.035849) with: {'module__activation': <class 'torch.nn.modules.activation.ELU'>} 0.621094 (0.063549) with: {'module__activation': <class 'torch.nn.modules.activation.ReLU6'>} 0.674479 (0.017566) with: {'module__activation': <class 'torch.nn.modules.activation.GELU'>} 0.558594 (0.149189) with: {'module__activation': <class 'torch.nn.modules.activation.Softplus'>} 0.675781 (0.014616) with: {'module__activation': <class 'torch.nn.modules.activation.Softsign'>} 0.619792 (0.018688) with: {'module__activation': <class 'torch.nn.modules.activation.Tanh'>} 0.643229 (0.019225) with: {'module__activation': <class 'torch.nn.modules.activation.Sigmoid'>} 0.636719 (0.022326) with: {'module__activation': <class 'torch.nn.modules.activation.Hardsigmoid'>} |
It shows that ReLU activation function achieved the best results with an accuracy of about 70%.
How to Tune Dropout Regularization
In this example, you will look at tuning the dropout rate for regularization in an effort to limit overfitting and improve the model’s ability to generalize.
For the best results, dropout is best combined with a weight constraint such as the max norm constraint, which is implemented in the forward pass function.
This involves fitting both the dropout percentage and the weight constraint. We will try dropout percentages between 0.0 and 0.9 (1.0 does not make sense) and MaxNorm weight constraint values between 0 and 5.
The full code listing is provided below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
import numpy as np import torch import torch.nn as nn import torch.nn.init as init import torch.optim as optim from skorch import NeuralNetClassifier from sklearn.model_selection import GridSearchCV # load the dataset, split into input (X) and output (y) variables dataset = np.loadtxt('pima-indians-diabetes.csv', delimiter=',') X = dataset[:,0:8] y = dataset[:,8] X = torch.tensor(X, dtype=torch.float32) y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1) # PyTorch classifier class PimaClassifier(nn.Module): def __init__(self, dropout_rate=0.5, weight_constraint=1.0): super().__init__() self.layer = nn.Linear(8, 12) self.act = nn.ReLU() self.dropout = nn.Dropout(dropout_rate) self.output = nn.Linear(12, 1) self.prob = nn.Sigmoid() self.weight_constraint = weight_constraint # manually init weights init.kaiming_uniform_(self.layer.weight) init.kaiming_uniform_(self.output.weight) def forward(self, x): # maxnorm weight before actual forward pass with torch.no_grad(): norm = self.layer.weight.norm(2, dim=0, keepdim=True).clamp(min=self.weight_constraint / 2) desired = torch.clamp(norm, max=self.weight_constraint) self.layer.weight *= (desired / norm) # actual forward pass x = self.act(self.layer(x)) x = self.dropout(x) x = self.prob(self.output(x)) return x # create model with skorch model = NeuralNetClassifier( PimaClassifier, criterion=nn.BCELoss, optimizer=optim.Adamax, max_epochs=100, batch_size=10, verbose=False ) # define the grid search parameters param_grid = { 'module__weight_constraint': [1.0, 2.0, 3.0, 4.0, 5.0], 'module__dropout_rate': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] } grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3) grid_result = grid.fit(X, y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) |
Running this example produces the following output.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
Best: 0.701823 using {'module__dropout_rate': 0.1, 'module__weight_constraint': 2.0} 0.669271 (0.015073) with: {'module__dropout_rate': 0.0, 'module__weight_constraint': 1.0} 0.692708 (0.035132) with: {'module__dropout_rate': 0.0, 'module__weight_constraint': 2.0} 0.589844 (0.170180) with: {'module__dropout_rate': 0.0, 'module__weight_constraint': 3.0} 0.561198 (0.151131) with: {'module__dropout_rate': 0.0, 'module__weight_constraint': 4.0} 0.688802 (0.021710) with: {'module__dropout_rate': 0.0, 'module__weight_constraint': 5.0} 0.697917 (0.009744) with: {'module__dropout_rate': 0.1, 'module__weight_constraint': 1.0} 0.701823 (0.016367) with: {'module__dropout_rate': 0.1, 'module__weight_constraint': 2.0} 0.694010 (0.010253) with: {'module__dropout_rate': 0.1, 'module__weight_constraint': 3.0} 0.686198 (0.025976) with: {'module__dropout_rate': 0.1, 'module__weight_constraint': 4.0} 0.679688 (0.026107) with: {'module__dropout_rate': 0.1, 'module__weight_constraint': 5.0} 0.701823 (0.029635) with: {'module__dropout_rate': 0.2, 'module__weight_constraint': 1.0} 0.682292 (0.014731) with: {'module__dropout_rate': 0.2, 'module__weight_constraint': 2.0} 0.701823 (0.009744) with: {'module__dropout_rate': 0.2, 'module__weight_constraint': 3.0} 0.701823 (0.026557) with: {'module__dropout_rate': 0.2, 'module__weight_constraint': 4.0} 0.687500 (0.015947) with: {'module__dropout_rate': 0.2, 'module__weight_constraint': 5.0} 0.686198 (0.006639) with: {'module__dropout_rate': 0.3, 'module__weight_constraint': 1.0} 0.656250 (0.006379) with: {'module__dropout_rate': 0.3, 'module__weight_constraint': 2.0} 0.565104 (0.155608) with: {'module__dropout_rate': 0.3, 'module__weight_constraint': 3.0} 0.700521 (0.028940) with: {'module__dropout_rate': 0.3, 'module__weight_constraint': 4.0} 0.669271 (0.012890) with: {'module__dropout_rate': 0.3, 'module__weight_constraint': 5.0} 0.661458 (0.018688) with: {'module__dropout_rate': 0.4, 'module__weight_constraint': 1.0} 0.669271 (0.017566) with: {'module__dropout_rate': 0.4, 'module__weight_constraint': 2.0} 0.652344 (0.006379) with: {'module__dropout_rate': 0.4, 'module__weight_constraint': 3.0} 0.680990 (0.037783) with: {'module__dropout_rate': 0.4, 'module__weight_constraint': 4.0} 0.692708 (0.042112) with: {'module__dropout_rate': 0.4, 'module__weight_constraint': 5.0} 0.666667 (0.006639) with: {'module__dropout_rate': 0.5, 'module__weight_constraint': 1.0} 0.652344 (0.011500) with: {'module__dropout_rate': 0.5, 'module__weight_constraint': 2.0} 0.662760 (0.007366) with: {'module__dropout_rate': 0.5, 'module__weight_constraint': 3.0} 0.558594 (0.146610) with: {'module__dropout_rate': 0.5, 'module__weight_constraint': 4.0} 0.552083 (0.141826) with: {'module__dropout_rate': 0.5, 'module__weight_constraint': 5.0} 0.548177 (0.141826) with: {'module__dropout_rate': 0.6, 'module__weight_constraint': 1.0} 0.653646 (0.013279) with: {'module__dropout_rate': 0.6, 'module__weight_constraint': 2.0} 0.661458 (0.008027) with: {'module__dropout_rate': 0.6, 'module__weight_constraint': 3.0} 0.553385 (0.142719) with: {'module__dropout_rate': 0.6, 'module__weight_constraint': 4.0} 0.669271 (0.035132) with: {'module__dropout_rate': 0.6, 'module__weight_constraint': 5.0} 0.662760 (0.015733) with: {'module__dropout_rate': 0.7, 'module__weight_constraint': 1.0} 0.636719 (0.024910) with: {'module__dropout_rate': 0.7, 'module__weight_constraint': 2.0} 0.550781 (0.146818) with: {'module__dropout_rate': 0.7, 'module__weight_constraint': 3.0} 0.537760 (0.140094) with: {'module__dropout_rate': 0.7, 'module__weight_constraint': 4.0} 0.542969 (0.138144) with: {'module__dropout_rate': 0.7, 'module__weight_constraint': 5.0} 0.565104 (0.148654) with: {'module__dropout_rate': 0.8, 'module__weight_constraint': 1.0} 0.657552 (0.008027) with: {'module__dropout_rate': 0.8, 'module__weight_constraint': 2.0} 0.428385 (0.111418) with: {'module__dropout_rate': 0.8, 'module__weight_constraint': 3.0} 0.549479 (0.142719) with: {'module__dropout_rate': 0.8, 'module__weight_constraint': 4.0} 0.648438 (0.005524) with: {'module__dropout_rate': 0.8, 'module__weight_constraint': 5.0} 0.540365 (0.136861) with: {'module__dropout_rate': 0.9, 'module__weight_constraint': 1.0} 0.605469 (0.053083) with: {'module__dropout_rate': 0.9, 'module__weight_constraint': 2.0} 0.553385 (0.139948) with: {'module__dropout_rate': 0.9, 'module__weight_constraint': 3.0} 0.549479 (0.142719) with: {'module__dropout_rate': 0.9, 'module__weight_constraint': 4.0} 0.595052 (0.075566) with: {'module__dropout_rate': 0.9, 'module__weight_constraint': 5.0} |
You can see that the dropout rate of 10% and the weight constraint of 2.0 resulted in the best accuracy of about 70%.
How to Tune the Number of Neurons in the Hidden Layer
The number of neurons in a layer is an important parameter to tune. Generally the number of neurons in a layer controls the representational capacity of the network, at least at that point in the topology.
Generally, a large enough single layer network can approximate any other neural network, due to the universal approximation theorem.
In this example, you will look at tuning the number of neurons in a single hidden layer. you will try values from 1 to 30 in steps of 5.
A larger network requires more training and at least the batch size and number of epochs should ideally be optimized with the number of neurons.
The full code listing is provided below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
import numpy as np import torch import torch.nn as nn import torch.nn.init as init import torch.optim as optim from skorch import NeuralNetClassifier from sklearn.model_selection import GridSearchCV # load the dataset, split into input (X) and output (y) variables dataset = np.loadtxt('pima-indians-diabetes.csv', delimiter=',') X = dataset[:,0:8] y = dataset[:,8] X = torch.tensor(X, dtype=torch.float32) y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1) class PimaClassifier(nn.Module): def __init__(self, n_neurons=12): super().__init__() self.layer = nn.Linear(8, n_neurons) self.act = nn.ReLU() self.dropout = nn.Dropout(0.1) self.output = nn.Linear(n_neurons, 1) self.prob = nn.Sigmoid() self.weight_constraint = 2.0 # manually init weights init.kaiming_uniform_(self.layer.weight) init.kaiming_uniform_(self.output.weight) def forward(self, x): # maxnorm weight before actual forward pass with torch.no_grad(): norm = self.layer.weight.norm(2, dim=0, keepdim=True).clamp(min=self.weight_constraint / 2) desired = torch.clamp(norm, max=self.weight_constraint) self.layer.weight *= (desired / norm) # actual forward pass x = self.act(self.layer(x)) x = self.dropout(x) x = self.prob(self.output(x)) return x # create model with skorch model = NeuralNetClassifier( PimaClassifier, criterion=nn.BCELoss, optimizer=optim.Adamax, max_epochs=100, batch_size=10, verbose=False ) # define the grid search parameters param_grid = { 'module__n_neurons': [1, 5, 10, 15, 20, 25, 30] } grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3) grid_result = grid.fit(X, y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) |
Running this example produces the following output.
1 2 3 4 5 6 7 8 |
Best: 0.708333 using {'module__n_neurons': 30} 0.654948 (0.003683) with: {'module__n_neurons': 1} 0.666667 (0.023073) with: {'module__n_neurons': 5} 0.694010 (0.014382) with: {'module__n_neurons': 10} 0.682292 (0.014382) with: {'module__n_neurons': 15} 0.707031 (0.028705) with: {'module__n_neurons': 20} 0.703125 (0.030758) with: {'module__n_neurons': 25} 0.708333 (0.015733) with: {'module__n_neurons': 30} |
You can see that the best results were achieved with a network with 30 neurons in the hidden layer with an accuracy of about 71%.
Tips for Hyperparameter Optimization
This section lists some handy tips to consider when tuning hyperparameters of your neural network.
- $k$-Fold Cross-Validation. You can see that the results from the examples in this post show some variance. A default cross-validation of 3 was used, but perhaps $k=5$ or $k=10$ would be more stable. Carefully choose your cross-validation configuration to ensure your results are stable.
- Review the Whole Grid. Do not just focus on the best result, review the whole grid of results and look for trends to support configuration decisions. Of course, there will be more combinations and it takes a longer time to evaluate.
- Parallelize. Use all your cores if you can, neural networks are slow to train and we often want to try a lot of different parameters. Consider to run it on a cloud platform, such as AWS.
- Use a Sample of Your Dataset. Because networks are slow to train, try training them on a smaller sample of your training dataset, just to get an idea of general directions of parameters rather than optimal configurations.
- Start with Coarse Grids. Start with coarse-grained grids and zoom into finer grained grids once you can narrow the scope.
- Do Not Transfer Results. Results are generally problem specific. Try to avoid favorite configurations on each new problem that you see. It is unlikely that optimal results you discover on one problem will transfer to your next project. Instead look for broader trends like number of layers or relationships between parameters.
- Reproducibility is a Problem. Although we set the seed for the random number generator in NumPy, the results are not 100% reproducible. There is more to reproducibility when grid searching wrapped PyTorch models than is presented in this post.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
- skorch documentation
- torch.nn from PyTorch
- GridSearchCV from scikit-learn
Summary
In this post, you discovered how you can tune the hyperparameters of your deep learning networks in Python using PyTorch and scikit-learn.
Specifically, you learned:
- How to wrap PyTorch models for use in scikit-learn and how to use grid search.
- How to grid search a suite of different standard neural network parameters for PyTorch models.
- How to design your own hyperparameter optimization experiments.
No comments yet.