Hyperparameter optimization is a big part of deep learning.

The reason is that neural networks are notoriously difficult to configure and there are a lot of parameters that need to be set. On top of that, individual models can be very slow to train.

In this post you will discover how you can use the grid search capability from the scikit-learn python machine learning library to tune the hyperparameters of Keras deep learning models.

After reading this post you will know:

- How to wrap Keras models for use in scikit-learn and how to use grid search.
- How to grid search common neural network parameters such as learning rate, dropout rate, epochs and number of neurons.
- How to define your own hyperparameter tuning experiments on your own projects.

Let’s get started.

**Update Nov/2016**: Fixed minor issue in displaying grid search results in code examples.**Update Oct/2016**: Updated examples for Keras 1.1.0, TensorFlow 0.10.0 and scikit-learn v0.18.**Update Mar/2017**: Updated example for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0.

## Overview

In this post I want to show you both how you can use the scikit-learn grid search capability and give you a suite of examples that you can copy-and-paste into your own project as a starting point.

Below is a list of the topics we are going to cover:

- How to use Keras models in scikit-learn.
- How to use grid search in scikit-learn.
- How to tune batch size and training epochs.
- How to tune optimization algorithms.
- How to tune learning rate and momentum.
- How to tune network weight initialization.
- How to tune activation functions.
- How to tune dropout regularization.
- How to tune the number of neurons in the hidden layer.

## How to Use Keras Models in scikit-learn

Keras models can be used in scikit-learn by wrapping them with the **KerasClassifier** or **KerasRegressor** class.

To use these wrappers you must define a function that creates and returns your Keras sequential model, then pass this function to the **build_fn** argument when constructing the **KerasClassifier** class.

For example:

1 2 3 4 5 |
def create_model(): ... return model model = KerasClassifier(build_fn=create_model) |

The constructor for the **KerasClassifier** class can take default arguments that are passed on to the calls to **model.fit()**, such as the number of epochs and the batch size.

For example:

1 2 3 4 5 |
def create_model(): ... return model model = KerasClassifier(build_fn=create_model, epochs=10) |

The constructor for the **KerasClassifier** class can also take new arguments that can be passed to your custom **create_model()** function. These new arguments must also be defined in the signature of your **create_model()** function with default parameters.

For example:

1 2 3 4 5 |
def create_model(dropout_rate=0.0): ... return model model = KerasClassifier(build_fn=create_model, dropout_rate=0.2) |

You can learn more about the scikit-learn wrapper in Keras API documentation.

## How to Use Grid Search in scikit-learn

Grid search is a model hyperparameter optimization technique.

In scikit-learn this technique is provided in the **GridSearchCV** class.

When constructing this class you must provide a dictionary of hyperparameters to evaluate in the **param_grid** argument. This is a map of the model parameter name and an array of values to try.

By default, accuracy is the score that is optimized, but other scores can be specified in the **score** argument of the **GridSearchCV** constructor.

By default, the grid search will only use one thread. By setting the **n_jobs** argument in the **GridSearchCV** constructor to -1, the process will use all cores on your machine. Depending on your Keras backend, this may interfere with the main neural network training process.

The **GridSearchCV** process will then construct and evaluate one model for each combination of parameters. Cross validation is used to evaluate each individual model and the default of 3-fold cross validation is used, although this can be overridden by specifying the **cv** argument to the **GridSearchCV** constructor.

Below is an example of defining a simple grid search:

1 2 3 |
param_grid = dict(nb_epochs=[10,20,30]) grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1) grid_result = grid.fit(X, Y) |

Once completed, you can access the outcome of the grid search in the result object returned from **grid.fit()**. The **best_score_** member provides access to the best score observed during the optimization procedure and the **best_params_** describes the combination of parameters that achieved the best results.

You can learn more about the GridSearchCV class in the scikit-learn API documentation.

## Problem Description

Now that we know how to use Keras models with scikit-learn and how to use grid search in scikit-learn, let’s look at a bunch of examples.

All examples will be demonstrated on a small standard machine learning dataset called the Pima Indians onset of diabetes classification dataset. This is a small dataset with all numerical attributes that is easy to work with.

- Download the dataset and place it in your currently working directly with the name
**pima-indians-diabetes.csv**.

As we proceed through the examples in this post, we will aggregate the best parameters. This is not the best way to grid search because parameters can interact, but it is good for demonstration purposes.

### Note on Parallelizing Grid Search

All examples are configured to use parallelism (**n_jobs=-1**).

If you get an error like the one below:

1 2 |
INFO (theano.gof.compilelock): Waiting for existing lock by process '55614' (I am process '55613') INFO (theano.gof.compilelock): To manually release the lock, delete ... |

Kill the process and change the code to not perform the grid search in parallel, set **n_jobs=1**.

### Need help with Deep Learning in Python?

Take my free 2-week email course and discover MLPs, CNNs and LSTMs (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

## How to Tune Batch Size and Number of Epochs

In this first simple example, we look at tuning the batch size and number of epochs used when fitting the network.

The batch size in iterative gradient descent is the number of patterns shown to the network before the weights are updated. It is also an optimization in the training of the network, defining how many patterns to read at a time and keep in memory.

The number of epochs is the number of times that the entire training dataset is shown to the network during training. Some networks are sensitive to the batch size, such as LSTM recurrent neural networks and Convolutional Neural Networks.

Here we will evaluate a suite of different mini batch sizes from 10 to 100 in steps of 20.

The full code listing is provided below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
# Use scikit-learn to grid search the batch size and epochs import numpy from sklearn.model_selection import GridSearchCV from keras.models import Sequential from keras.layers import Dense from keras.wrappers.scikit_learn import KerasClassifier # Function to create model, required for KerasClassifier def create_model(): # create model model = Sequential() model.add(Dense(12, input_dim=8, activation='relu')) model.add(Dense(1, activation='sigmoid')) # Compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) return model # fix random seed for reproducibility seed = 7 numpy.random.seed(seed) # load dataset dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",") # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # create model model = KerasClassifier(build_fn=create_model, verbose=0) # define the grid search parameters batch_size = [10, 20, 40, 60, 80, 100] epochs = [10, 50, 100] param_grid = dict(batch_size=batch_size, epochs=epochs) grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1) grid_result = grid.fit(X, Y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) |

Running this example produces the following output.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Best: 0.686198 using {'nb_epoch': 100, 'batch_size': 20} 0.348958 (0.024774) with: {'nb_epoch': 10, 'batch_size': 10} 0.348958 (0.024774) with: {'nb_epoch': 50, 'batch_size': 10} 0.466146 (0.149269) with: {'nb_epoch': 100, 'batch_size': 10} 0.647135 (0.021236) with: {'nb_epoch': 10, 'batch_size': 20} 0.660156 (0.014616) with: {'nb_epoch': 50, 'batch_size': 20} 0.686198 (0.024774) with: {'nb_epoch': 100, 'batch_size': 20} 0.489583 (0.075566) with: {'nb_epoch': 10, 'batch_size': 40} 0.652344 (0.019918) with: {'nb_epoch': 50, 'batch_size': 40} 0.654948 (0.027866) with: {'nb_epoch': 100, 'batch_size': 40} 0.518229 (0.032264) with: {'nb_epoch': 10, 'batch_size': 60} 0.605469 (0.052213) with: {'nb_epoch': 50, 'batch_size': 60} 0.665365 (0.004872) with: {'nb_epoch': 100, 'batch_size': 60} 0.537760 (0.143537) with: {'nb_epoch': 10, 'batch_size': 80} 0.591146 (0.094954) with: {'nb_epoch': 50, 'batch_size': 80} 0.658854 (0.054904) with: {'nb_epoch': 100, 'batch_size': 80} 0.402344 (0.107735) with: {'nb_epoch': 10, 'batch_size': 100} 0.652344 (0.033299) with: {'nb_epoch': 50, 'batch_size': 100} 0.542969 (0.157934) with: {'nb_epoch': 100, 'batch_size': 100} |

We can see that the batch size of 20 and 100 epochs achieved the best result of about 68% accuracy.

## How to Tune the Training Optimization Algorithm

Keras offers a suite of different state-of-the-art optimization algorithms.

In this example, we tune the optimization algorithm used to train the network, each with default parameters.

This is an odd example, because often you will choose one approach a priori and instead focus on tuning its parameters on your problem (e.g. see the next example).

Here we will evaluate the suite of optimization algorithms supported by the Keras API.

The full code listing is provided below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# Use scikit-learn to grid search the batch size and epochs import numpy from sklearn.model_selection import GridSearchCV from keras.models import Sequential from keras.layers import Dense from keras.wrappers.scikit_learn import KerasClassifier # Function to create model, required for KerasClassifier def create_model(optimizer='adam'): # create model model = Sequential() model.add(Dense(12, input_dim=8, activation='relu')) model.add(Dense(1, activation='sigmoid')) # Compile model model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy']) return model # fix random seed for reproducibility seed = 7 numpy.random.seed(seed) # load dataset dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",") # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # create model model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0) # define the grid search parameters optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam'] param_grid = dict(optimizer=optimizer) grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1) grid_result = grid.fit(X, Y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) |

Running this example produces the following output.

1 2 3 4 5 6 7 8 |
Best: 0.704427 using {'optimizer': 'Adam'} 0.348958 (0.024774) with: {'optimizer': 'SGD'} 0.348958 (0.024774) with: {'optimizer': 'RMSprop'} 0.471354 (0.156586) with: {'optimizer': 'Adagrad'} 0.669271 (0.029635) with: {'optimizer': 'Adadelta'} 0.704427 (0.031466) with: {'optimizer': 'Adam'} 0.682292 (0.016367) with: {'optimizer': 'Adamax'} 0.703125 (0.003189) with: {'optimizer': 'Nadam'} |

The results suggest that the ADAM optimization algorithm is the best with a score of about 70% accuracy.

## How to Tune Learning Rate and Momentum

It is common to pre-select an optimization algorithm to train your network and tune its parameters.

By far the most common optimization algorithm is plain old Stochastic Gradient Descent (SGD) because it is so well understood. In this example, we will look at optimizing the SGD learning rate and momentum parameters.

Learning rate controls how much to update the weight at the end of each batch and the momentum controls how much to let the previous update influence the current weight update.

We will try a suite of small standard learning rates and a momentum values from 0.2 to 0.8 in steps of 0.2, as well as 0.9 (because it can be a popular value in practice).

Generally, it is a good idea to also include the number of epochs in an optimization like this as there is a dependency between the amount of learning per batch (learning rate), the number of updates per epoch (batch size) and the number of epochs.

The full code listing is provided below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
# Use scikit-learn to grid search the learning rate and momentum import numpy from sklearn.model_selection import GridSearchCV from keras.models import Sequential from keras.layers import Dense from keras.wrappers.scikit_learn import KerasClassifier from keras.optimizers import SGD # Function to create model, required for KerasClassifier def create_model(learn_rate=0.01, momentum=0): # create model model = Sequential() model.add(Dense(12, input_dim=8, activation='relu')) model.add(Dense(1, activation='sigmoid')) # Compile model optimizer = SGD(lr=learn_rate, momentum=momentum) model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy']) return model # fix random seed for reproducibility seed = 7 numpy.random.seed(seed) # load dataset dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",") # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # create model model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0) # define the grid search parameters learn_rate = [0.001, 0.01, 0.1, 0.2, 0.3] momentum = [0.0, 0.2, 0.4, 0.6, 0.8, 0.9] param_grid = dict(learn_rate=learn_rate, momentum=momentum) grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1) grid_result = grid.fit(X, Y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) |

Running this example produces the following output.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
Best: 0.680990 using {'learn_rate': 0.01, 'momentum': 0.0} 0.348958 (0.024774) with: {'learn_rate': 0.001, 'momentum': 0.0} 0.348958 (0.024774) with: {'learn_rate': 0.001, 'momentum': 0.2} 0.467448 (0.151098) with: {'learn_rate': 0.001, 'momentum': 0.4} 0.662760 (0.012075) with: {'learn_rate': 0.001, 'momentum': 0.6} 0.669271 (0.030647) with: {'learn_rate': 0.001, 'momentum': 0.8} 0.666667 (0.035564) with: {'learn_rate': 0.001, 'momentum': 0.9} 0.680990 (0.024360) with: {'learn_rate': 0.01, 'momentum': 0.0} 0.677083 (0.026557) with: {'learn_rate': 0.01, 'momentum': 0.2} 0.427083 (0.134575) with: {'learn_rate': 0.01, 'momentum': 0.4} 0.427083 (0.134575) with: {'learn_rate': 0.01, 'momentum': 0.6} 0.544271 (0.146518) with: {'learn_rate': 0.01, 'momentum': 0.8} 0.651042 (0.024774) with: {'learn_rate': 0.01, 'momentum': 0.9} 0.651042 (0.024774) with: {'learn_rate': 0.1, 'momentum': 0.0} 0.651042 (0.024774) with: {'learn_rate': 0.1, 'momentum': 0.2} 0.572917 (0.134575) with: {'learn_rate': 0.1, 'momentum': 0.4} 0.572917 (0.134575) with: {'learn_rate': 0.1, 'momentum': 0.6} 0.651042 (0.024774) with: {'learn_rate': 0.1, 'momentum': 0.8} 0.651042 (0.024774) with: {'learn_rate': 0.1, 'momentum': 0.9} 0.533854 (0.149269) with: {'learn_rate': 0.2, 'momentum': 0.0} 0.427083 (0.134575) with: {'learn_rate': 0.2, 'momentum': 0.2} 0.427083 (0.134575) with: {'learn_rate': 0.2, 'momentum': 0.4} 0.651042 (0.024774) with: {'learn_rate': 0.2, 'momentum': 0.6} 0.651042 (0.024774) with: {'learn_rate': 0.2, 'momentum': 0.8} 0.651042 (0.024774) with: {'learn_rate': 0.2, 'momentum': 0.9} 0.455729 (0.146518) with: {'learn_rate': 0.3, 'momentum': 0.0} 0.455729 (0.146518) with: {'learn_rate': 0.3, 'momentum': 0.2} 0.455729 (0.146518) with: {'learn_rate': 0.3, 'momentum': 0.4} 0.348958 (0.024774) with: {'learn_rate': 0.3, 'momentum': 0.6} 0.348958 (0.024774) with: {'learn_rate': 0.3, 'momentum': 0.8} 0.348958 (0.024774) with: {'learn_rate': 0.3, 'momentum': 0.9} |

We can see that relatively SGD is not very good on this problem, nevertheless best results were achieved using a learning rate of 0.01 and a momentum of 0.0 with an accuracy of about 68%.

## How to Tune Network Weight Initialization

Neural network weight initialization used to be simple: use small random values.

Now there is a suite of different techniques to choose from. Keras provides a laundry list.

In this example, we will look at tuning the selection of network weight initialization by evaluating all of the available techniques.

We will use the same weight initialization method on each layer. Ideally, it may be better to use different weight initialization schemes according to the activation function used on each layer. In the example below we use rectifier for the hidden layer. We use sigmoid for the output layer because the predictions are binary.

The full code listing is provided below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# Use scikit-learn to grid search the weight initialization import numpy from sklearn.model_selection import GridSearchCV from keras.models import Sequential from keras.layers import Dense from keras.wrappers.scikit_learn import KerasClassifier # Function to create model, required for KerasClassifier def create_model(init_mode='uniform'): # create model model = Sequential() model.add(Dense(12, input_dim=8, kernel_initializer=init_mode, activation='relu')) model.add(Dense(1, kernel_initializer=init_mode, activation='sigmoid')) # Compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) return model # fix random seed for reproducibility seed = 7 numpy.random.seed(seed) # load dataset dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",") # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # create model model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0) # define the grid search parameters init_mode = ['uniform', 'lecun_uniform', 'normal', 'zero', 'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'] param_grid = dict(init_mode=init_mode) grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1) grid_result = grid.fit(X, Y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) |

Running this example produces the following output.

1 2 3 4 5 6 7 8 9 |
Best: 0.720052 using {'init_mode': 'uniform'} 0.720052 (0.024360) with: {'init_mode': 'uniform'} 0.348958 (0.024774) with: {'init_mode': 'lecun_uniform'} 0.712240 (0.012075) with: {'init_mode': 'normal'} 0.651042 (0.024774) with: {'init_mode': 'zero'} 0.700521 (0.010253) with: {'init_mode': 'glorot_normal'} 0.674479 (0.011201) with: {'init_mode': 'glorot_uniform'} 0.661458 (0.028940) with: {'init_mode': 'he_normal'} 0.678385 (0.004872) with: {'init_mode': 'he_uniform'} |

We can see that the best results were achieved with a uniform weight initialization scheme achieving a performance of about 72%.

## How to Tune the Neuron Activation Function

The activation function controls the non-linearity of individual neurons and when to fire.

Generally, the rectifier activation function is the most popular, but it used to be the sigmoid and the tanh functions and these functions may still be more suitable for different problems.

In this example, we will evaluate the suite of different activation functions available in Keras. We will only use these functions in the hidden layer, as we require a sigmoid activation function in the output for the binary classification problem.

Generally, it is a good idea to prepare data to the range of the different transfer functions, which we will not do in this case.

The full code listing is provided below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# Use scikit-learn to grid search the activation function import numpy from sklearn.model_selection import GridSearchCV from keras.models import Sequential from keras.layers import Dense from keras.wrappers.scikit_learn import KerasClassifier # Function to create model, required for KerasClassifier def create_model(activation='relu'): # create model model = Sequential() model.add(Dense(12, input_dim=8, kernel_initializer='uniform', activation=activation)) model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid')) # Compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) return model # fix random seed for reproducibility seed = 7 numpy.random.seed(seed) # load dataset dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",") # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # create model model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0) # define the grid search parameters activation = ['softmax', 'softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear'] param_grid = dict(activation=activation) grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1) grid_result = grid.fit(X, Y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) |

Running this example produces the following output.

1 2 3 4 5 6 7 8 9 |
Best: 0.722656 using {'activation': 'linear'} 0.649740 (0.009744) with: {'activation': 'softmax'} 0.720052 (0.032106) with: {'activation': 'softplus'} 0.688802 (0.019225) with: {'activation': 'softsign'} 0.720052 (0.018136) with: {'activation': 'relu'} 0.691406 (0.019401) with: {'activation': 'tanh'} 0.680990 (0.009207) with: {'activation': 'sigmoid'} 0.691406 (0.014616) with: {'activation': 'hard_sigmoid'} 0.722656 (0.003189) with: {'activation': 'linear'} |

Surprisingly (to me at least), the ‘linear’ activation function achieved the best results with an accuracy of about 72%.

## How to Tune Dropout Regularization

In this example, we will look at tuning the dropout rate for regularization in an effort to limit overfitting and improve the model’s ability to generalize.

To get good results, dropout is best combined with a weight constraint such as the max norm constraint.

For more on using dropout in deep learning models with Keras see the post:

This involves fitting both the dropout percentage and the weight constraint. We will try dropout percentages between 0.0 and 0.9 (1.0 does not make sense) and maxnorm weight constraint values between 0 and 5.

The full code listing is provided below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
# Use scikit-learn to grid search the dropout rate import numpy from sklearn.model_selection import GridSearchCV from keras.models import Sequential from keras.layers import Dense from keras.layers import Dropout from keras.wrappers.scikit_learn import KerasClassifier from keras.constraints import maxnorm # Function to create model, required for KerasClassifier def create_model(dropout_rate=0.0, weight_constraint=0): # create model model = Sequential() model.add(Dense(12, input_dim=8, kernel_initializer='uniform', activation='linear', kernel_constraint=maxnorm(weight_constraint))) model.add(Dropout(dropout_rate)) model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid')) # Compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) return model # fix random seed for reproducibility seed = 7 numpy.random.seed(seed) # load dataset dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",") # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # create model model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0) # define the grid search parameters weight_constraint = [1, 2, 3, 4, 5] dropout_rate = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] param_grid = dict(dropout_rate=dropout_rate, weight_constraint=weight_constraint) grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1) grid_result = grid.fit(X, Y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) |

Running this example produces the following output.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
Best: 0.723958 using {'dropout_rate': 0.2, 'weight_constraint': 4} 0.696615 (0.031948) with: {'dropout_rate': 0.0, 'weight_constraint': 1} 0.696615 (0.031948) with: {'dropout_rate': 0.0, 'weight_constraint': 2} 0.691406 (0.026107) with: {'dropout_rate': 0.0, 'weight_constraint': 3} 0.708333 (0.009744) with: {'dropout_rate': 0.0, 'weight_constraint': 4} 0.708333 (0.009744) with: {'dropout_rate': 0.0, 'weight_constraint': 5} 0.710937 (0.008438) with: {'dropout_rate': 0.1, 'weight_constraint': 1} 0.709635 (0.007366) with: {'dropout_rate': 0.1, 'weight_constraint': 2} 0.709635 (0.007366) with: {'dropout_rate': 0.1, 'weight_constraint': 3} 0.695312 (0.012758) with: {'dropout_rate': 0.1, 'weight_constraint': 4} 0.695312 (0.012758) with: {'dropout_rate': 0.1, 'weight_constraint': 5} 0.701823 (0.017566) with: {'dropout_rate': 0.2, 'weight_constraint': 1} 0.710938 (0.009568) with: {'dropout_rate': 0.2, 'weight_constraint': 2} 0.710938 (0.009568) with: {'dropout_rate': 0.2, 'weight_constraint': 3} 0.723958 (0.027126) with: {'dropout_rate': 0.2, 'weight_constraint': 4} 0.718750 (0.030425) with: {'dropout_rate': 0.2, 'weight_constraint': 5} 0.721354 (0.032734) with: {'dropout_rate': 0.3, 'weight_constraint': 1} 0.707031 (0.036782) with: {'dropout_rate': 0.3, 'weight_constraint': 2} 0.707031 (0.036782) with: {'dropout_rate': 0.3, 'weight_constraint': 3} 0.694010 (0.019225) with: {'dropout_rate': 0.3, 'weight_constraint': 4} 0.709635 (0.006639) with: {'dropout_rate': 0.3, 'weight_constraint': 5} 0.704427 (0.008027) with: {'dropout_rate': 0.4, 'weight_constraint': 1} 0.717448 (0.031304) with: {'dropout_rate': 0.4, 'weight_constraint': 2} 0.718750 (0.030425) with: {'dropout_rate': 0.4, 'weight_constraint': 3} 0.718750 (0.030425) with: {'dropout_rate': 0.4, 'weight_constraint': 4} 0.722656 (0.029232) with: {'dropout_rate': 0.4, 'weight_constraint': 5} 0.720052 (0.028940) with: {'dropout_rate': 0.5, 'weight_constraint': 1} 0.703125 (0.009568) with: {'dropout_rate': 0.5, 'weight_constraint': 2} 0.716146 (0.029635) with: {'dropout_rate': 0.5, 'weight_constraint': 3} 0.709635 (0.008027) with: {'dropout_rate': 0.5, 'weight_constraint': 4} 0.703125 (0.011500) with: {'dropout_rate': 0.5, 'weight_constraint': 5} 0.707031 (0.017758) with: {'dropout_rate': 0.6, 'weight_constraint': 1} 0.701823 (0.018688) with: {'dropout_rate': 0.6, 'weight_constraint': 2} 0.701823 (0.018688) with: {'dropout_rate': 0.6, 'weight_constraint': 3} 0.690104 (0.027498) with: {'dropout_rate': 0.6, 'weight_constraint': 4} 0.695313 (0.022326) with: {'dropout_rate': 0.6, 'weight_constraint': 5} 0.697917 (0.014382) with: {'dropout_rate': 0.7, 'weight_constraint': 1} 0.697917 (0.014382) with: {'dropout_rate': 0.7, 'weight_constraint': 2} 0.687500 (0.008438) with: {'dropout_rate': 0.7, 'weight_constraint': 3} 0.704427 (0.011201) with: {'dropout_rate': 0.7, 'weight_constraint': 4} 0.696615 (0.016367) with: {'dropout_rate': 0.7, 'weight_constraint': 5} 0.680990 (0.025780) with: {'dropout_rate': 0.8, 'weight_constraint': 1} 0.699219 (0.019401) with: {'dropout_rate': 0.8, 'weight_constraint': 2} 0.701823 (0.015733) with: {'dropout_rate': 0.8, 'weight_constraint': 3} 0.684896 (0.023510) with: {'dropout_rate': 0.8, 'weight_constraint': 4} 0.696615 (0.017566) with: {'dropout_rate': 0.8, 'weight_constraint': 5} 0.653646 (0.034104) with: {'dropout_rate': 0.9, 'weight_constraint': 1} 0.677083 (0.012075) with: {'dropout_rate': 0.9, 'weight_constraint': 2} 0.679688 (0.013902) with: {'dropout_rate': 0.9, 'weight_constraint': 3} 0.669271 (0.017566) with: {'dropout_rate': 0.9, 'weight_constraint': 4} 0.669271 (0.012075) with: {'dropout_rate': 0.9, 'weight_constraint': 5} |

We can see that the dropout rate of 0.2% and the maxnorm weight constraint of 4 resulted in the best accuracy of about 72%.

## How to Tune the Number of Neurons in the Hidden Layer

The number of neurons in a layer is an important parameter to tune. Generally the number of neurons in a layer controls the representational capacity of the network, at least at that point in the topology.

Also, generally, a large enough single layer network can approximate any other neural network, at least in theory.

In this example, we will look at tuning the number of neurons in a single hidden layer. We will try values from 1 to 30 in steps of 5.

A larger network requires more training and at least the batch size and number of epochs should ideally be optimized with the number of neurons.

The full code listing is provided below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
# Use scikit-learn to grid search the number of neurons import numpy from sklearn.model_selection import GridSearchCV from keras.models import Sequential from keras.layers import Dense from keras.layers import Dropout from keras.wrappers.scikit_learn import KerasClassifier from keras.constraints import maxnorm # Function to create model, required for KerasClassifier def create_model(neurons=1): # create model model = Sequential() model.add(Dense(neurons, input_dim=8, kernel_initializer='uniform', activation='linear', kernel_constraint=maxnorm(4))) model.add(Dropout(0.2)) model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid')) # Compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) return model # fix random seed for reproducibility seed = 7 numpy.random.seed(seed) # load dataset dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",") # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # create model model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0) # define the grid search parameters neurons = [1, 5, 10, 15, 20, 25, 30] param_grid = dict(neurons=neurons) grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1) grid_result = grid.fit(X, Y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) |

Running this example produces the following output.

1 2 3 4 5 6 7 8 |
Best: 0.714844 using {'neurons': 5} 0.700521 (0.011201) with: {'neurons': 1} 0.714844 (0.011049) with: {'neurons': 5} 0.712240 (0.017566) with: {'neurons': 10} 0.705729 (0.003683) with: {'neurons': 15} 0.696615 (0.020752) with: {'neurons': 20} 0.713542 (0.025976) with: {'neurons': 25} 0.705729 (0.008027) with: {'neurons': 30} |

We can see that the best results were achieved with a network with 5 neurons in the hidden layer with an accuracy of about 71%.

## Tips for Hyperparameter Optimization

This section lists some handy tips to consider when tuning hyperparameters of your neural network.

**k-fold Cross Validation**. You can see that the results from the examples in this post show some variance. A default cross-validation of 3 was used, but perhaps k=5 or k=10 would be more stable. Carefully choose your cross validation configuration to ensure your results are stable.**Review the Whole Grid**. Do not just focus on the best result, review the whole grid of results and look for trends to support configuration decisions.**Parallelize**. Use all your cores if you can, neural networks are slow to train and we often want to try a lot of different parameters. Consider spinning up a lot of AWS instances.**Use a Sample of Your Dataset**. Because networks are slow to train, try training them on a smaller sample of your training dataset, just to get an idea of general directions of parameters rather than optimal configurations.**Start with Coarse Grids**. Start with coarse-grained grids and zoom into finer grained grids once you can narrow the scope.**Do not Transfer Results**. Results are generally problem specific. Try to avoid favorite configurations on each new problem that you see. It is unlikely that optimal results you discover on one problem will transfer to your next project. Instead look for broader trends like number of layers or relationships between parameters.**Reproducibility is a Problem**. Although we set the seed for the random number generator in NumPy, the results are not 100% reproducible. There is more to reproducibility when grid searching wrapped Keras models than is presented in this post.

## Summary

In this post, you discovered how you can tune the hyperparameters of your deep learning networks in Python using Keras and scikit-learn.

Specifically, you learned:

- How to wrap Keras models for use in scikit-learn and how to use grid search.
- How to grid search a suite of different standard neural network parameters for Keras models.
- How to design your own hyperparameter optimization experiments.

Do you have any experience tuning hyperparameters of large neural networks? Please share your stories below.

Do you have any questions about hyperparameter optimization of neural networks or about this post? Ask your questions in the comments and I will do my best to answer.

As always excellent post,. I’ve been doing some hyper-parameter optimization by hand, but I’ll definitely give Grid Search a try.

Is it possible to set up a different threshold for sigmoid output in Keras? Rather then using 0.5 I was thinking of trying 0.7 or 0.8

Thanks Yanbo.

I don’t think so, but you could implement your own activation function and do anything you wish.

My question is related to this thread. How to get the probablities as the output? I dont want the class output. I read for a regression problem that no activation function is needed in the output layer. Similiar implementation will get me the probabilities ?? or the output will exceed 0 and 1??

Hi Shudhan, you can use a sigmoid activation and treat the outputs like probabilities (they will be in the range of 0-1).

Sound awesome!Will this grid search method use the full cpu(which can be 8/16 cores) ?

It can if you set n_jobs=-1

Hi,

Great post,

Can I use this tips on CNNs in keras as well?

Thanks!

They can be a start, but remember it is a good idea to use a repeating structure in a large CNN and you will need to tune the number of filters and pool size.

Hi Jason, First of all great post! I applied this by dividing the data into train and test and used train dataset for grid fit. Plan was to capture best parameters in train and apply them on test to see accuracy. But it seems grid.fit and model.fit applied with same parameters on same dataset (in this case train) give different accuracy results. Any idea why this happens. I can share the code if it helps.

You will see small variation in the performance of a neural net with the same parameters from run to run. This is because of the stochastic nature of the technique and how very hard it is to fix the random number seed successfully in python/numpy/theano.

You will also see small variation due to the data used to train the method.

Generally, you could use all of your data to grid search to try to reduce the second type of variation (slower). You could store results and use statistical significance tests to compare populations of results to see if differences are significant to sort out the first type or variation.

I hope that helps.

hi, I think this will best tutorial i ever found on web….Thanks for sharing….is it possible to use these tips on LSTM, Bilstm cnnlstm

Thanks Vinay, I’m glad it’s useful.

Absolutely, you could use these tactics on other algorithm types.

Best place to learn the tuning.. my question – is it good to follow the order you mentioned to tune the parameters? I know the most significant parameters should be tuned first

Thanks. The order is a good start. It is best to focus on areas where you think you will get the biggest improvement first – which is often the structure of the network (layers and neurons).

when I am using the categorical_entropy loss function and running the grid search with n_jobs more than 1 its throwing error “cannot pickle object class”, but the same thing is working fine with binary_entropyloss. Can you tell me if I am making any mistake in my code:

def create_model(optimizer=’adam’):

# create model

model.add(Dense(30, input_dim=59, init=’normal’, activation=’relu’))

model.add(Dense(15, init=’normal’, activation=’sigmoid’))

model.add(Dense(3, init=’normal’, activation=’sigmoid’))

# Compile model

model.compile(loss=’categorical_crossentropy’, optimizer=optimizer, metrics=[‘accuracy’])

return model

# Create Keras Classifier

print “——————— Running Grid Search on Keras Classifier for epochs and batch ——————”

clf = model = KerasClassifier(build_fn = create_model, verbose=0)

param_grid = {“batch_size”:range(10, 30, 10), “nb_epoch”:range(50, 150, 50)}

optimizer = [‘SGD’, ‘RMSprop’, ‘Adagrad’, ‘Adadelta’, ‘Adam’, ‘Adamax’, ‘Nadam’]

grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=4)

grid_result = grid.fit(x_train, y_train)

print(“Best: %f using %s” % (grid_result.best_score_, grid_result.best_params_))

Strange Satheesh, I have not seen that before.

Let me know if you figure it out.

excellent post, thanks. It’s been very helpful to get me started on hyperparameterisation.

One thing I haven’t been able to do yet is to grid search over parameters which are not proper to the NN but to the trainign set. For example, I can fine-tune the input_dim parameter by creating a function generator which takes care of creating the function that will create the model, like this:

# fp_subset is a subset of columns of my whole training set.

create_basic_ANN_model = kt.ANN_model_gen( # defined elsewhere

input_dim=len(fp_subset), output_dim=1, layers_num=2, layers_sizes=[len(fp_subset)/5, len(fp_subset)/10, ],

loss=’mean_squared_error’, optimizer=’adadelta’, metrics=[‘mean_squared_error’, ‘mean_absolute_error’]

)

model = KerasRegressor(build_fn=create_basic_ANN_model, verbose=1)

# define the grid search parameters

batch_size = [10, 100]

epochs = [5, 10]

param_grid = dict(batch_size=batch_size, nb_epoch=epochs)

grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=1, cv=7)

grid_results = grid.fit(trX, trY)

this works but only as a for loop over the different fp_subset, which I must define manually.

I could easily pick the best out of every run but it wuld be great if I could fold them all inside a big grid definition and fit, so as to automatically pick the largest.

However, until now haven’t been able to figure out a way to get that in my head.

If the wrapper function is useful to anyone, I can post a generalised version here.

Good question.

You might just need to us a loop around the whole lot for different projections/views of your training data.

Thanks. I ended up coding my own for loop, saving the results of each grid in a dict, sorting the hash by the perofrmance metrics, and picking the best model.

Now, the next question is: How do I save the model’s architecture and weights to a .json .hdf5 file? I know how to do that for a simple model. But how do I extract the best model out of the gridsearch results?

Well done.

No need. Once you know the parameters, you can use them to train a new standalone model on all of your training data and start making predictions.

I may have found a way. How about this?

best_model = grid_result.best_estimator_.model

best_model_file_path = ‘your_pick_here’

model2json = best_model.to_json()

with open( best_model_file_path+’.json’, ‘w’) as json_file:

json_file.write(model2json)

best_model.save_weights(best_model_file_path+’.h5′)

Hi Jason, I think this is very best deep learning tutorial on the web. Thanks for your work. I have a question is :how to use the heuristic algorithm to optimize Hyperparameters for Deep Learning Models in Python With Keras, these algorithms like: Genetic algorithm, Particle swarm optimization, and Cuckoo algorithm etc. If the idea could be experimented, could you give an example

Thanks for your support volador.

You could search the hyperparameter space using a stochastic optimization algorithm like a genetic algorithm and use the mean performance as the cost function orf fitness function. I don’t have a worked example, but it would be relatively easy to setup.

Hi Jason, very helpful intro into gridsearch for Keras. I have used your guidance in my code, but rather than using the default ‘accuracy’ to be optimized, my model requires a specific evaluation function to be optimized. You hint at this possibility in the introduction, but there is no example of it. I have followed the SciKit-learn documentation, but I fail to come up with the correct syntax.

I have posted my question at StackOverflow, but since it is quite specific, it requires understanding of SciKit-learn in combination with Keras.

Perhaps you can have a look? I think it would nicely extend your tutorial.

http://stackoverflow.com/questions/40572743/scikit-learn-grid-search-own-scoring-object-syntax

Thanks, Jan

Sorry Jan, I have not used a custom scoring function before.

Here are a list of built-in scoring functions:

http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

Here is help on defining your own scoring function:

http://scikit-learn.org/stable/modules/model_evaluation.html#defining-your-scoring-strategy-from-metric-functions

Let me know how you go.

Yup, same sources as I referenced in my post at Stackoverflow.

Excellent. Good luck Jan.

Good tutorial again Jason…keep on the good job!

Thanks Anthony.

Hi Jason

First off, thank you for the tutorial. It’s very helpful.

I was also hoping you would assist on how to adapt the keras grid search to stateful lstms as discussed in

http://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/

I’ve coded the following:

# create model

model = KerasRegressor(build_fn=create_model, nb_epoch=1, batch_size=bats,

verbose=2, shuffle=False)

# define the grid search parameters

h1n = [5, 10] # number of hidden neurons

param_grid = dict(h1n=h1n)

grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=5)

for i in range(100):

grid.fit(trainX, trainY)

grid.reset_states()

Is grid.reset_states() corrrect? or would you suggest creating function callback for reset states.

Thanks,

Great question.

With stateful LSTMs we must control the resetting of states after each epoch. The sklearn framework does not open this capacity to us – at least it looks that way to me off the cuff.

I think you may have to grid search stateful LSTM params manually with a ton of for loops. Sorry.

If you discover something different, let me know. i.e. there may be a way in the back door to the sklearn grid search functionality that we can inject our own custom epoch handing.

Hi Jason

Thanks a lot for this and all the other great tutorials!

I tried to combine this gridsearch/keras approach with a pipeline. It works if I tune nb_epoch or batch_size, but I get an error if I try to tune the optimizer or something else in the keras building function (I did not forget to include the variable as an argument):

def keras_model(optimizer = ‘adam’):

model = Sequential()

model.add(Dense(80, input_dim=79, init= ‘normal’))

model.add(Activation(‘relu’))

model.add(Dense(1, init=’normal’))

model.add(Activation(‘linear’))

model.compile(optimizer=optimizer, loss=’mse’)

return model

kRegressor = KerasRegressor(build_fn=keras_model, nb_epoch=500, batch_size=10, verbose=0)

estimators = []

estimators.append((‘imputer’, preprocessing.Imputer(strategy=’mean’)))

estimators.append((‘scaler’, preprocessing.StandardScaler()))

estimators.append((‘kerasR’, kRegressor))

pipeline = Pipeline(estimators)

param_grid = dict(kerasR__optimizer = [‘adam’,’rmsprop’])

grid = GridSearchCV(pipeline, param_grid, cv=5, scoring=’neg_mean_squared_error’)

Do you know this problem?

Thanks, Thomas

Thanks Thomas. I’ve not seen this issue.

I think we’re starting to push the poor Keras sklearn wrapper to the limit.

Maybe the next step is to build out a few functions to do manual grid searching across network configs.

Great resource!

Any thoughts on how to get the “history” objects out of grid search? It could be beneficial to plot the loss and accuracy to see when a model starts to flatten out.

Not sure off the cuff Jimi, perhaps repeat the run standalone for the top performing configuration.

Thanks for the post. Can we optimize the number of hidden layers as well on top of number of neurons in each layers?

Thanks

Yes, it just may be very time consuming depending on the size of the dataset and the number of layers/nodes involved.

Try it on some small datasets from the UCI ML Repo.

Thanks. Would you mind looking at below code?

def create_model(neurons=1, neurons2=1):

# create model

model = Sequential()

model.add(Dense(neurons1, input_dim=8))

model.add(Dense(neurons2))

model.add(Dense(1, init=’uniform’, activation=’sigmoid’))

# Compile model

model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

return model

# define the grid search parameters

neurons1 = [1, 3, 5, 7]

neurons2=[0,1,2]

param_grid = dict(neurons1=neurons1, neurons2=neurons2)

grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)

grid_result = grid.fit(X, Y)

This code runs without error (I excluded certain X, y parts for brewity) but when I run “grid.fit(X, Y), it gives AssertionError.

I’d appreciate if you can show me where I am wrong.

Update” It worked when I deleted 0 from neurons2. Thanks

Excellent, glad to hear it.

A Dense() with a value of 0 neurons might blow up. Try removing the 0 from your neurons2 array.

A good debug strategy is to cut code back to the minimum, make it work, then and add complexity. Here. Try searching a grid of 1 and 1 neurons, make it all work, then expand the grid you search.

Let me know how you go.

I keep getting error messages and I tried a big for loops that scan for all possible combinations of layer numbers, neuron numbers, other optimization stuff within defined limits. It is very time consuming code, but I could not figure it out how to adjust layer structure and other optimization parameters in the same code using GridSearch. If you would provide a code for that in your blog one day, that would be much appreciated. Thanks.

I’ll try to find the time.

Hi Jason,

Many thanks for this awesome tutorial !

I’m glad you found it useful Rajneesh.

Hi Jason,

Great tutorial! I’m running into a slight issue. I tried running this on my own variation of the code and got the following error:

TypeError: get_params() got an unexpected keyword argument ‘deep’

I copied and pasted your code using the given data set and got the same error. The code is showing an error on the grid_result = grid.fit(X, Y) line. I looked through the other comments and didn’t see anyone with the same issue. Do you know where this could be coming from?

Thanks for your help!

same issue here,

great tutorial, life saver.

Hi Andy, sorry to hear that.

Is this happening with a specific example or with all of them?

Are you able to check your version of Python/sklearn/keras/tf/theano?

UPDATE:

I can confirm the first example still works fine with Python 2.7, sklearn 0.18.1, Keras 1.2.0 and TensorFlow 0.12.1.

The only differences are I am running Python 3.5 and Keras 1.2.1. The example I ran previously was the grid search for the number of neurons in a layer. But I just ran the first example and got the same error.

Do you think the issue is due to the next version of Python? If so, what should my next steps be?

Thanks for your help and quick response!

It’s a bug in Keras 1.2.1. You can either downgrade to 1.2.0 or get the code from their github (where they already fixed it).

Yes, I have a write up of the problem and available fixes here:

http://stackoverflow.com/questions/41796618/python-keras-cross-val-score-error/41841066#41841066

Thank you so much for your help!

Jason,

Can you use early_stopping to decide n_epoch?

Yes, that is a good method to find a generalized model.

Hi Jason,

Really great article. I am a big fan of your blog and your books. Can you please explain your following statement?

“A default cross-validation of 3 was used, but perhaps k=5 or k=10 would be more stable. Carefully choose your cross validation configuration to ensure your results are stable.”

I didn’t see anywhere cross-validation being used.

Hi Jayant,

Grid search uses k-fold cross-validation to evaluate the performance of each combination of parameters on unseen data.

Hi Jason,

thanks for this awesome tutorial !

I have two questions: 1. In “model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])”, accuracy is used for evaluate results. But GridSearchCV also has scoring parameter, if I set “scoring=’f1’”,which one is used for evaluate the results of grid search? 2.How to set two evaluate parameters ,e.g. ‘accuracy’and ’f1’ evaluating the results of grid search？

Hi Jing,

You can set the “scoring” argument for GridSearchCV with a string of the performance measure to use, or the name of your own scoring function. You can learn about this argument here:

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

You can see a full list of supported scoring measures here:

http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

As far as I know you can only grid search using a single measure.

Thank you so much for your help!

I find no matter what evaluate parameters used in GridSearchCV “scoring”,”metrics” in “model.compile” must be [‘accuracy’],otherwise the program gives “ValueError: The model is not configured to compute accuracy.You should pass ‘metrics=[“accuracy”]’ to the ‘model.compile()’method. So, if I set:

model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=’recall’)

the grid_result.best_score_ =0.72.My question is: 0.72 is accuracy or recall ? Thank you!

Hi Jing,

When using GridSearchCV with Keras, I would suggest not specifying any metrics when compiling your Keras model.

I would suggest only setting the “scoring” argument on the GridSearchCV. I would expect the metric reported by GridSearchCV to be the one that you specified.

I hope that helps.

Great Blogpost. Love it. You are awesome Jason. I got one question to GridsearchCV. As far as i understand the crossvalidation already takes place in there. That’s why we do not need any kfold anymore.

But with this technique we would have no validation set correct? e.g. with a default value of 3 we would have 2 training sets and one test set.

That means in kfold as well as in GridsearchCV there is no requirement for creating a validation set anymore?

Thanks

Hi Dan,

Yes, GridSearchCV performs cross validation and you must specify the number of folds. You can hold back a validation set to double check the parameters found by the search if you like. This is optional.

Thank you for the quick response Jason. Especially considering the huge amount of questions you get.

I’m here to help, if I can Dan.

What I’m missing in the tutorial is the info, how to get the best params in the model with KERAS. Do I pickup the best parameters and call ‘create_model’ again with those parameters or can I call the GridSearchCV’s ‘predict’ function? (I will try out for myself but for completeness it would be good to have it in the tutorial as well.)

I see, but we don’t know the best parameters, we must search for them.

Hi, Jason. I am getting

/usr/local/lib/python2.7/dist-packages/keras/wrappers/scikit_learn.py in check_params(self=, params={‘batch_size’: 10, ‘epochs’: 10})

80 legal_params += inspect.getargspec(fn)[0]

81 legal_params = set(legal_params)

82

83 for params_name in params:

84 if params_name not in legal_params:

—> 85 raise ValueError(‘{} is not a legal parameter’.format(params_name))

params_name = ‘epochs’

86

87 def get_params(self, _):

88 “””Gets parameters for this estimator.

89

ValueError: epochs is not a legal parameter

It sounds like you need to upgrade to Keras v2.0 or higher.

Nice tutorial. I would like to optimize the number of hidden layers in the model. Can you please guide in this regard, thanks

Thanks Usman.

Consider exploring specific patterns, e.g. small-big-small, etc.

Do you know any way this could be possible using a network with multiple inputs?

http://imgur.com/a/JJ7f1

Hi Jason, great to see posts like this – amazing job!

Just noticed, when you tune the optimisation algorithm SGD performs at 34% accuracy. As no parameters are being passed to the SGD function, I’d assume it takes the default configuration, lr=0.01, momentum=0.0.

Later on, as you look for better configurations for SGD, best result (68%) is found when {‘learn_rate’: 0.01, ‘momentum’: 0.0}.

It seems to me that these two experiments use exactly the same network configuration (including the same SGD parameters), yet their resulting accuracies differ significantly. Do you have any intuition as to why this may be happening?

Hi Daniel, yes great point.

Neural networks are stochastic and give different results when evaluated on the same data.

Ideally, each configuration would be evaluated using the average of multiple (30+) repeats.

This post might help:

http://machinelearningmastery.com/randomness-in-machine-learning/

Hi Jason!

absolutely love your tutorial! But would you mind to give tutorial for how to tune the number of hidden layer?

Thanks

I have an example here:

http://machinelearningmastery.com/exploratory-configuration-multilayer-perceptron-network-time-series-forecasting/

Thank you so much Jason!

I’m glad it helped Pradanuari.

Hello Jason

I tried to use your idea in a similar problem but I am getting error : AttributeError: ‘NoneType’ object has no attribute ‘loss’

it looks like the model does not define loss function?

This is the error I get:

b\site-packages\keras-2.0.4-py3.5.egg\keras\wrappers\scikit_learn.py in fit(self=, x=memmap([[[ 0., 0., 0., …, 0., 0., 0.],

…, 0., 0., …, 0., 0., 0.]]], dtype=float32), y=array([[ 0., 0., 0., …, 0., 0., 0.],

…0.],

[ 0., 0., 0., …, 0., 1., 0.]]), **kwargs={})

135 self.model = self.build_fn(

136 **self.filter_sk_params(self.build_fn.__call__))

137 else:

138 self.model = self.build_fn(**self.filter_sk_params(self.build_fn))

139

–> 140 loss_name = self.model.loss

loss_name = undefined

self.model.loss = undefined

141 if hasattr(loss_name, ‘__name__’):

142 loss_name = loss_name.__name__

143 if loss_name == ‘categorical_crossentropy’ and len(y.shape) != 2:

144 y = to_categorical(y)

AttributeError: ‘NoneType’ object has no attribute ‘loss’

___________________________________________________________________________

Process finished with exit code 1

Regards

Ibrahim

Does the example in the blog post work on your system?

Ok, I think your code needs to be placed after

if __name__ == ‘__main__’:

to work with multiprocess…

But thanks for the post is great…

Not on Linux and OS X when I tested it, but thanks for the tip.

Hello Jason!

I do the first step – try to tune Batch Size and Number of Epochs and get

print(“Best: %f using %s” % (grid_result.best_score_, grid_result.best_params_))

Best: 0.707031 using {‘epochs’: 100, ‘batch_size’: 40}

After that I do the same and get

print(“Best: %f using %s” % (grid_result.best_score_, grid_result.best_params_))

Best: 0.688802 using {‘epochs’: 100, ‘batch_size’: 20}

And so on

The problem is in the grid_result.best_score_

I expect that in the second step (for ample tuning optimizer) I will get grid_result.best_score_ better than in the first step (in the second step i use grid_result.best_params_ from the first step). But it is not true

Tune all Hyperparameters is a very long time

How to fix it?

Consider tuning different parameters, like network structure or number of input features.

Thanks a lot Jason!

Hello,

I’d like to have your opinion about a problem:

I have two loss function plots, with SGD and Adamax as optimizer with same learning rate.

Loss function of SGD looks like the red one, whereas Adamax’s looks like blue one.

(http://cs231n.github.io/assets/nn3/learningrates.jpeg)

I have better scores with Adamax on validation data. I’m confused about how to proceed, should I choose Adamax and play with learning rates a little more, or go on with SGD and somehow try to improve performance?

Thanks!

Explore both, but focus on the validation score of interest (e.g. accuracy, RMSE, etc.) over loss.

For example, you can get very low loss and get worse accuracy.

Thanks for your response! I experimented with different learning rates and found out a reasonable one, (good for both Adamax and SGD) and now I try to fix learning rate and optimizer and focus on other hyperparameters such as batch-size and number of neurons. Or would be better if I set those first?

Number of neurons will have a big effect along with learning rate.

Batch size will have a smaller effect and could be optimized last.

Thanks for this post!

One question – why not use grid search on all the parameters together, rather than preforming several grid searches and finding each parameter separately? surly the results are not the same…

Great question,

In practice, the datasets are large and it can take a long time and require a lot of RAM.

Hi Jason,

Excellent post!

It seems to me that if you use the entire training set during your cross-validation, then your cross-validation error is going to give you an optimistically biased estimate of your validation error. I think this is because when you train the final model on the entire dataset, the validation set you create to estimate test performance comes out of the training set.

My question is: assuming we have a lot of data, should we use perhaps only 50% of the training data for cross-validation for the hyperparameters, and then use the remaining 50% for fitting the final model (and a portion of that remaining 50% would be used for the validation set)? That way we wouldn’t be using the same data twice. I am assuming in this case that we would also have a separate test set.

Yes, it is a good idea to hold back a test set when tuning.

Thanks for your valuable post. I learned a lot from it.

When I wrote my code for grid search, I encountered a question:

I use fit_generator instead of fit in keras.

Is it possible to use grid search with fit_generator ?

I have some Merge layers in my deep learning model.

Hence, the input of the neural network is not a single matrix.

For example:

Suppose we have 1,000 samples

Input = [Input1,Input2]

Input1 is a 1,000 *3 matrix

Input2 is a 1,000*3*50*50 matrix (image)

When I use the fit in your post, there is a bug….because the input1 and input2 don’t have the same dimension. So I wonder whether the fit_generator can work with grid search ?

Thanks in advance!

Please ignore my previous reply.

I find an answer here: https://github.com/fchollet/keras/issues/6451

Right now, the GridsearchCV using the scikit wrapper for network with multiple inputs is not available.

Hi Jason, thank you for your good tutorial of the grid research with Keras. I followed your example with my own dataset. It could be run. But when I using the autoencoder structure, instead of the sequential structure, to gird the parameters with my own data. It could not be run. I don’t know the reason. Could you help me? Are there any differences between the gird of sequential structure and the grid of model structure?

The follows are my codes:

from keras.models import Sequential

from keras.layers import Dense, Input

from keras.wrappers.scikit_learn import KerasClassifier

from sklearn.model_selection import StratifiedKFold

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import GridSearchCV

import numpy as np

from keras.optimizers import SGD, Adam, RMSprop, Adagrad

from keras.regularizers import l1,l2

from keras.models import Model

import pandas as pd

from keras.models import load_model

np.random.seed(2017)

def create_model(optimizer=’rmsprop’):

# encoder layers

encoding_dim =140

input_img = Input(shape=(6,))

encoded = Dense(300, activation=’relu’,W_regularizer=l1(0.01))(input_img)

encoded = Dense(300, activation=’relu’,W_regularizer=l1(0.01))(encoded)

encoded = Dense(300, activation=’relu’,W_regularizer=l1(0.01))(encoded)

encoder_output = Dense(encoding_dim, activation=’relu’,W_regularizer=l1(0.01))(encoded)

# decoder layers

decoded = Dense(300, activation=’relu’,W_regularizer=l1(0.01))(encoder_output)

decoded = Dense(300, activation=’relu’,W_regularizer=l1(0.01))(decoded)

decoded = Dense(300, activation=’relu’,W_regularizer=l1(0.01))(decoded)

decoded = Dense(6, activation=’relu’,W_regularizer=l1(0.01))(decoded)

# construct the autoencoder model

autoencoder = Model(input_img, decoded)

# construct the encoder model for plotting

encoder = Model(input_img, encoder_output)

# Compile model

autoencoder.compile(optimizer=’RMSprop’, loss=’mean_squared_error’,metrics=[‘accuracy’])

return autoencoder

I’m surprised, I would not think the network architecture would make a difference.

Sorry, I have no good suggestions other than try to debug the cause of the fault.

the command of autoencoder.compile is modified as the follows:

# Compile model

autoencoder.compile(optimizer=optimizer, loss=’mean_squared_error’,metrics=[‘accuracy’])

Can we do this for functional API as well ?

Perhaps, I have not done this.

Thanks for a great tutorial Jason, appreciated.

njobs=-1 didn’t work very well on my Windows 10 machine: took a very long time and never finished.

https://stackoverflow.com/questions/28005307/gridsearchcv-no-reporting-on-high-verbosity seems to suggest this is (or at least was in 2015) a known problem under Windows so I changed to n_jobs=1, which also allowed me to see throughput using verbose=10.

Thanks for the tip.

Jason —

Given all the parameters it is possible to adjust, is there any recommendation for which should be fixed first before exploring others, or can ALL results for one change when others are changed?

Great question, see this paper:

https://arxiv.org/abs/1206.5533

Thanks Jason, I’ll check it out.

Hi and thank you for the resource.

Am I right in my understanding that this only works on one machine?

Any hints / pointers on how to run this on a cluster? I have found https://goo.gl/Q9Xy7B as a potential avenue using Spark (no Keras though).

Any comment at all? Information on the subject is scarce.

Yes, this example is for a single machine. Sorry, I do not have examples for running on a cluster.

Hi Jason,

I’m a little bit confused about the definition of the “score” or “accuracy”. How are they made? I believe that they are not simply comparing the results with target, otherwise it will be the overfitting model being the best (like the more neurons the better).

But on the other hand, they are just using those combinations of parameters to train the model, so what is the difference between I manually set the parameters and see my result good or not, with risk of overfitting and the grid search that creates an accuracy score to determine which one is the best?

Best regards,

The grid search will provide an estimate of the skill of the model with a set of parameters.

Any one configuration in the grid search can be set and evaluated manually.

Neural networks are stochastic and will give different predictions/skill when trained on the same data.

Ideally, if you have the time/compute the grid search should use repeated k-fold cross validation to provide robust estimates of model skill. More here:

http://machinelearningmastery.com/evaluate-skill-deep-learning-models/

Does that help?

I’m new to the NN, a little bit puzzled. So say, if I have to many neurons that leads to overfitting (good on the train set, bad on the validation or test set), can grid search detect it by the score?

My guess is yes, because there is a validation set in the GridsearchCV. Is that correct?

A larget network can overfit.

The idea is to find a config that does well on the train and validation sets. We require a robust test harness. With enough resources, I’d recommend repeated k-fold cross validation within the grid search.

One more very useful tutorial, thank Jason.

One question about GridSearch in my case. I have tried to tune parameters of my neural network for regression with 18 inputs size 800 but the time to use GridSearch totally long, like forever even though I have limited to the number. I saw in your code:

grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)

Normally, n_jobs=1, can I increase that number to improve the performances?

We often cannot grid search with neural nets because it takes so long!

Consider running on a large computer in the cloud over the weekend.

Hi Jason,

Any idea how to use GridSearchCV if you don’t want cross validation?

GridSearch supports k-fold cross-validation by default. That is what the “CV” is in the name:

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

So sklearn has no GridSearch without cross validation?

In any case I found kind of a hack here to get rid of cv:

https://stackoverflow.com/questions/44636370/scikit-learn-gridsearchcv-without-cross-validation-unsupervised-learning

You can configure the k in CV to 1 to it does train/test. Then configure it to repeat.

Hello. Thank you for the nice tutorial.

I am trying to combine pipeline and gridsearch.

Inside my keras model i use kernel_initializer=init_mode.

Then I am trying to assign values to the init_mode dictionary in order to perform the gridsearch.

I get the following error: ValueError: init_mode is not a legal parameter

My code is here: https://www.dropbox.com/s/57n777j9w8bxf4t/keras_grid.py?dl=0

Any tip? Thank you

Hi Dr. Brownlee,

When I run this in Spyder IDE nothing happens after grid.fit.

It just appears to do nothing.

Any suggestions as to why?

Consider running from the command line.

The grid search may take a long time.

Hello Dr Brownlee,

I saved your example codes into .py file and run it. Nothing happens after grid.fit. However, if I run line by line from your example codes it works. Do you know why?

It may take a long time. Consider reducing the scope of the search to see if you can get results sooner.

How can I do Hyper-parameter optimization for MLPRegressor in scikit learn?

Yes.

Hi Jason,

I’m unable to apply the grid search to a seq to seq LSTM network (Keras Regressor model in the scikit API). When I set the GridSearchCV scoring algorithm to r^2 (or any scoring function for regression problems) the model.fit expect a 2 dim input vector, not the 3 dim used in Keras.

Otherwise, if I left the default scoring algorithm named “_passthrough_scorer”( I don’t know what it does, I don’t even know what it is) it works but the best_score doesn’t match with the real best parametrization. I’m really confused…I’ll had to write the grid search manually…

I’ve solved it, I share it if someone have the same issue…,If you set the gridsearch scoring function to “None” it uses the scoring metrics of the Keras model.

Sorry for bothering, but the results of the approach I’ve said are incorrect. I don’t know what to do.

Hi Josep,

Consider writing your own for loop to iterate over params and run a Cross Validation for the params within the loop.

This is how I do it now for large/complex models.

Can i use this grid search without using keras model

For sure!

Hello Jason,

Thanks for such a nice tutorial.

Instead of getting a output as ‘Best: 0.720052 using {‘init_mode’: ‘uniform’}’ , it would be really nice if you could show us how to visualize this result with matplotlib so that it gets more easier.

Great suggestion, thanks.