# How to Grid Search Hyperparameters for PyTorch Models

Last Updated on March 22, 2023

The “weights” of a neural network is referred as “parameters” in PyTorch code and it is fine-tuned by optimizer during training. On the contrary, hyperparameters are the parameters of a neural network that is fixed by design and not tuned by training. Examples are the number of hidden layers and the choice of activation functions. Hyperparameter optimization is a big part of deep learning. The reason is that neural networks are notoriously difficult to configure, and a lot of parameters need to be set. On top of that, individual models can be very slow to train.

In this post, you will discover how to use the grid search capability from the scikit-learn Python machine learning library to tune the hyperparameters of PyTorch deep learning models. After reading this post, you will know:

• How to wrap PyTorch models for use in scikit-learn and how to use grid search
• How to grid search common neural network parameters, such as learning rate, dropout rate, epochs, and number of neurons
• How to define your own hyperparameter tuning experiments on your own projects

Let’s get started.

How to Grid Search Hyperparameters for PyTorch Models
Photo by brandon siu. Some rights reserved.

## Overview

In this post, you will see how to use the scikit-learn grid search capability with a suite of examples that you can copy-and-paste into your own project as a starting point. Below is a list of the topics we are going to cover:

• How to use PyTorch models in scikit-learn
• How to use grid search in scikit-learn
• How to tune batch size and training epochs
• How to tune optimization algorithms
• How to tune learning rate and momentum
• How to tune network weight initialization
• How to tune activation functions
• How to tune dropout regularization
• How to tune the number of neurons in the hidden layer

## How to Use PyTorch Models in scikit-learn

PyTorch models can be used in scikit-learn if wrapped with skorch. This is to leverage the duck-typing nature of Python to make the PyTorch model provide similar API as a scikit-learn model, so everything in scikit-learn can work along. In skorch, there are NeuralNetClassifier for classification neural networks and NeuralNetRegressor for regression neural networks. You may need to run the follownig command to install the module.

To use these wrappers, you must define a your PyTorch model as a class using nn.Module, then pass the name of the class to the module argument when constructing the NeuralNetClassifier class. For example:

The constructor for the NeuralNetClassifier class can take default arguments that are passed on to the calls to model.fit() (the way to invoke a training loop in scikit-learn models), such as the number of epochs and the batch size. For example:

The constructor for the NeuralNetClassifier class can also take new arguments that can be passed to your model class’ constructor, but you have to prepend it with module__ (with two underscores). These new arguments may carry a default value in the constructor but they will be overridden when the wrapper instantiate the model. For example:

You can verify the result by initializing a model and print it:

In this example, you should see:

Kick-start your project with my book Deep Learning with PyTorch. It provides self-study tutorials with working code to guide you into building a fully-working transformer model that can
translate sentences from one language to another...

## How to Use Grid Search in scikit-learn

Grid search is a model hyperparameter optimization technique. It simply exhaust all combinations of the hyperparameters and find the one that gave the best score. In scikit-learn, this technique is provided in the GridSearchCV class. When constructing this class, you must provide a dictionary of hyperparameters to evaluate in the param_grid argument. This is a map of the model parameter name and an array of values to try.

By default, accuracy is the score that is optimized, but other scores can be specified in the score argument of the GridSearchCV constructor. The GridSearchCV process will then construct and evaluate one model for each combination of parameters. Cross-validation is used to evaluate each individual model, and the default of 3-fold cross-validation is used, although you can override this by specifying the cv argument to the GridSearchCV constructor.

Below is an example of defining a simple grid search:

By setting the n_jobs argument in the GridSearchCV constructor to $-1$, the process will use all cores on your machine. Otherwise the grid search process will only run in single thread, which is slower in the multi-core CPUs.

Once completed, you can access the outcome of the grid search in the result object returned from grid.fit(). The best_score_ member provides access to the best score observed during the optimization procedure, and the best_params_ describes the combination of parameters that achieved the best results. You can learn more about the GridSearchCV class in the scikit-learn API documentation.

Kick-start your project with my book Deep Learning with PyTorch. It provides self-study tutorials with working code to guide you into building a fully-working transformer model that can
translate sentences from one language to another...

## Problem Description

Now that you know how to use PyTorch models with scikit-learn and how to use grid search in scikit-learn, let’s look at a bunch of examples.

All examples will be demonstrated on a small standard machine learning dataset called the Pima Indians onset of diabetes classification dataset. This is a small dataset with all numerical attributes that is easy to work with.

As you proceed through the examples in this post, you will aggregate the best parameters. This is not the best way to grid search because parameters can interact, but it is good for demonstration purposes.

## How to Tune Batch Size and Number of Epochs

In this first simple example, you will look at tuning the batch size and number of epochs used when fitting the network.

The batch size in iterative gradient descent is the number of patterns shown to the network before the weights are updated. It is also an optimization in the training of the network, defining how many patterns to read at a time and keep in memory.

The number of epochs is the number of times the entire training dataset is shown to the network during training. Some networks are sensitive to the batch size, such as LSTM recurrent neural networks and Convolutional Neural Networks.

Here you will evaluate a suite of different minibatch sizes from 10 to 100 in steps of 20.

The full code listing is provided below:

Running this example produces the following output:

You can see that the batch size of 10 and 100 epochs achieved the best result of about 71% accuracy (but you should also take into account the accuracy’s standard deviation).

## How to Tune the Training Optimization Algorithm

All deep learning library should offer a variety of optimization algorithms. PyTorch is no exception.

In this example, you will tune the optimization algorithm used to train the network, each with default parameters.

This is an odd example because often, you will choose one approach a priori and instead focus on tuning its parameters on your problem (see the next example).
Here, you will evaluate the suite of optimization algorithms available in PyTorch.

The full code listing is provided below:

Running this example produces the following output:

The results suggest that the Adamax optimization algorithm is the best with a score of about 72% accuracy.

It is worth to mention that GridSearchCV will recreate your model often so every trial is independent. The reason it can be done is because of the NeuralNetClassifier wrapper, which knows the name of the class for your PyTorch model and instantiate one for you upon request.

## How to Tune Learning Rate and Momentum

It is common to pre-select an optimization algorithm to train your network and tune its parameters.

By far, the most common optimization algorithm is plain old Stochastic Gradient Descent (SGD) because it is so well understood. In this example, you will look at optimizing the SGD learning rate and momentum parameters.

The learning rate controls how much to update the weight at the end of each batch, and the momentum controls how much to let the previous update influence the current weight update.

You will try a suite of small standard learning rates and a momentum values from 0.2 to 0.8 in steps of 0.2, as well as 0.9 (because it can be a popular value in practice). In PyTorch, the way to set the learning rate and momentum is the following:

In the skorch wrapper, you will can route the parameters to the optimizer with the prefix optimizer__.

Generally, it is a good idea to also include the number of epochs in an optimization like this as there is a dependency between the amount of learning per batch (learning rate), the number of updates per epoch (batch size), and the number of epochs.

The full code listing is provided below:

Running this example produces the following output.

You can see that, with SGD, the best results were achieved using a learning rate of 0.001 and a momentum of 0.9 with an accuracy of about 68%.

## How to Tune Network Weight Initialization

Neural network weight initialization used to be simple: use small random values.

Now there is a suite of different techniques to choose from. You can get a laundry list from torch.nn.init documentation.

In this example, you will look at tuning the selection of network weight initialization by evaluating all the available techniques.

You will use the same weight initialization method on each layer. Ideally, it may be better to use different weight initialization schemes according to the activation function used on each layer. In the example below, you will use a rectifier for the hidden layer. Use sigmoid for the output layer because the predictions are binary. The weight initialization is implicit in PyTorch models. Therefore you need to write your own logic to initialize the weight, after the layer is created but before it is used. Let’s modify the PyTorch as follows:

An argument weight_init is added to the class PimaClassifier and it expects one of the initializers from torch.nn.init. In GridSearchCV, you need to use the module__ prefix to make NeuralNetClassifier route the parameter to the model’s class constructor.

The full code listing is provided below:

Running this example produces the following output.

The best results were achieved with a He-uniform weight initialization scheme achieving a performance of about 70%.

## How to Tune the Neuron Activation Function

The activation function controls the nonlinearity of individual neurons and when to fire.

Generally, the rectifier activation function is the most popular. However, it used to be the sigmoid and the tanh functions, and these functions may still be more suitable for different problems.

In this example, you will evaluate some of the activation functions available in PyTorch. You will only use these functions in the hidden layer, as a sigmoid activation function is required in the output for the binary classification problem. Similar to the previous example, this is an argument to the class constructor of the model, and you will use the module__ prefix for the GridSearchCV parameter grid.

Generally, it is a good idea to prepare data to the range of the different transfer functions, which you will not do in this case.

The full code listing is provided below:

Running this example produces the following output.

It shows that ReLU activation function achieved the best results with an accuracy of about 70%.

## How to Tune Dropout Regularization

In this example, you will look at tuning the dropout rate for regularization in an effort to limit overfitting and improve the model’s ability to generalize.

For the best results, dropout is best combined with a weight constraint such as the max norm constraint, which is implemented in the forward pass function.

This involves fitting both the dropout percentage and the weight constraint. We will try dropout percentages between 0.0 and 0.9 (1.0 does not make sense) and MaxNorm weight constraint values between 0 and 5.

The full code listing is provided below.

Running this example produces the following output.

You can see that the dropout rate of 10% and the weight constraint of 2.0 resulted in the best accuracy of about 70%.

## How to Tune the Number of Neurons in the Hidden Layer

The number of neurons in a layer is an important parameter to tune. Generally the number of neurons in a layer controls the representational capacity of the network, at least at that point in the topology.

Generally, a large enough single layer network can approximate any other neural network, due to the universal approximation theorem.

In this example, you will look at tuning the number of neurons in a single hidden layer. you will try values from 1 to 30 in steps of 5.
A larger network requires more training and at least the batch size and number of epochs should ideally be optimized with the number of neurons.

The full code listing is provided below.

Running this example produces the following output.

You can see that the best results were achieved with a network with 30 neurons in the hidden layer with an accuracy of about 71%.

## Tips for Hyperparameter Optimization

This section lists some handy tips to consider when tuning hyperparameters of your neural network.

• $k$-Fold Cross-Validation. You can see that the results from the examples in this post show some variance. A default cross-validation of 3 was used, but perhaps $k=5$ or $k=10$ would be more stable. Carefully choose your cross-validation configuration to ensure your results are stable.
• Review the Whole Grid. Do not just focus on the best result, review the whole grid of results and look for trends to support configuration decisions. Of course, there will be more combinations and it takes a longer time to evaluate.
• Parallelize. Use all your cores if you can, neural networks are slow to train and we often want to try a lot of different parameters. Consider to run it on a cloud platform, such as AWS.
• Use a Sample of Your Dataset. Because networks are slow to train, try training them on a smaller sample of your training dataset, just to get an idea of general directions of parameters rather than optimal configurations.
• Start with Coarse Grids. Start with coarse-grained grids and zoom into finer grained grids once you can narrow the scope.
• Do Not Transfer Results. Results are generally problem specific. Try to avoid favorite configurations on each new problem that you see. It is unlikely that optimal results you discover on one problem will transfer to your next project. Instead look for broader trends like number of layers or relationships between parameters.
• Reproducibility is a Problem. Although we set the seed for the random number generator in NumPy, the results are not 100% reproducible. There is more to reproducibility when grid searching wrapped PyTorch models than is presented in this post.