How to Grid Search Hyperparameters for PyTorch Models

hasnainmehdi1172@gmail.com

2 months ago

In PyTorch, the “weights” of a neural network are known as “parameters,” which are fine-tuned during training by an optimizer. In contrast, hyperparameters are fixed characteristics of the model determined before training starts, such as the number of hidden layers and the choice of activation functions. Optimizing hyperparameters is crucial in deep learning due to the inherent complexity and difficulty in configuring neural networks, along with the extensive training times involved.

In this post, you’ll learn how to utilize grid search from the Scikit-learn library to systematically tune hyperparameters for PyTorch deep learning models. After reading, you will understand:

How to wrap PyTorch models for compatibility with Scikit-learn and implement grid search.
How to grid search various neural network parameters including learning rate, dropout rate, epochs, and number of neurons.
How to design hyperparameter tuning experiments tailored to your projects.

Table of Contents

Toggle

Let’s Get Started

Overview

This article will guide you through using Scikit-learn’s grid search functionality, complete with practical examples you can easily adapt for your projects. Here’s an overview of the topics we will cover:

Using PyTorch models within Scikit-learn
Implementing grid search in Scikit-learn
Tuning batch size and training epochs
Tuning optimization algorithms
Adjusting learning rate and momentum
Tuning network weight initialization
Optimizing activation functions
Adjusting dropout regularization
Tuning the number of neurons in hidden layers

Using PyTorch Models in Scikit-learn

You can integrate PyTorch models with Scikit-learn by using the skorch library, which enables you to wrap your PyTorch models, providing a similar API to Scikit-learn models. This allows you to leverage Scikit-learn’s functionality seamlessly.

First, install the skorch library if you haven’t already:

pip install skorch

To utilize these wrappers, define your PyTorch model as a class inheriting from nn.Module. Then, pass the class name to the module argument when initializing the NeuralNetClassifier. Here’s an example:

import torch.nn as nn
from skorch import NeuralNetClassifier

class MyClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        ...

    def forward(self, x):
        ...
        return x

# Create the skorch wrapper
model = NeuralNetClassifier(
    module=MyClassifier
)

The NeuralNetClassifier constructor accepts default parameters, which can be passed directly to the model.fit() method. For instance:

model = NeuralNetClassifier(
    module=MyClassifier,
    max_epochs=150,
    batch_size=10
)

You can also pass new parameters to your model’s constructor by prefixing them with module__. For example:

class SonarClassifier(nn.Module):
    def __init__(self, n_layers=3):
        super().__init__()
        self.layers = []
        self.acts = []
        for i in range(n_layers):
            self.layers.append(nn.Linear(60, 60))
            self.acts.append(nn.ReLU())
            self.add_module(f"layer{i}", self.layers[-1])
            self.add_module(f"act{i}", self.acts[-1])
        self.output = nn.Linear(60, 1)

    def forward(self, x):
        for layer, act in zip(self.layers, self.acts):
            x = act(layer(x))
        x = self.output(x)
        return x

model = NeuralNetClassifier(
    module=SonarClassifier,
    max_epochs=150,
    batch_size=10,
    module__n_layers=2
)

To confirm the setup, initialize the model and display it:

print(model.initialize())

Using Grid Search in Scikit-learn

Grid search is an essential technique for hyperparameter optimization that systematically evaluates all combinations of specified hyperparameters to find the optimal set. The GridSearchCV class in Scikit-learn facilitates this process.

While constructing the GridSearchCV class, provide a dictionary of hyperparameters in the param_grid argument, mapping parameter names to arrays of values to try:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'epochs': [10, 20, 30]
}
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X, y)

Setting n_jobs=-1 utilizes all available CPU cores, improving the speed of the grid search over a single-threaded execution.

Once completed, access the result of the grid search using grid_result. The best_score_ attribute reveals the highest score obtained during the optimization, and best_params_ specifies the parameter set that achieved this score.

Problem Description

Now that you understand how to use PyTorch models with Scikit-learn and how to implement grid search, let’s explore several examples using the Pima Indians onset of diabetes dataset. This dataset is manageable and entirely numerical.

As we work through these examples, we will combine different parameters. While this approach might not be the ideal method for efficiently grid searching due to potential parameter interactions, it serves well illustrative purposes.

Tuning Batch Size and Number of Epochs

In our initial example, we will tune two key parameters: batch size and number of epochs.

The batch size determines how many samples are presented to the model before updating weights. In contrast, the number of epochs specifies how many times the complete dataset is passed through the network.

For this example, we will evaluate a range of batch sizes from 10 to 100 in increments of 20:

import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from skorch import NeuralNetClassifier
from sklearn.model_selection import GridSearchCV

# Load the dataset, split into input (X) and output (y) variables
dataset = np.loadtxt('pima-indians-diabetes.csv', delimiter=',')
X = dataset[:,0:8]
y = dataset[:,8]
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)

# PyTorch classifier
class PimaClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer = nn.Linear(8, 12)
        self.act = nn.ReLU()
        self.output = nn.Linear(12, 1)
        self.prob = nn.Sigmoid()

    def forward(self, x):
        x = self.act(self.layer(x))
        x = self.prob(self.output(x))
        return x

# Create model with skorch
model = NeuralNetClassifier(
    PimaClassifier,
    criterion=nn.BCELoss,
    optimizer=optim.Adam,
    verbose=False
)

# Define the grid search parameters
param_grid = {
    'batch_size': [10, 20, 40, 60, 80, 100],
    'max_epochs': [10, 50, 100]
}
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X, y)

# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Tuning the Training Optimization Algorithm

Different optimization algorithms can influence the training of your neural network significantly. In this example, we will tune various optimization algorithms available within PyTorch.

from sklearn.model_selection import GridSearchCV

# Load the dataset
dataset = np.loadtxt('pima-indians-diabetes.csv', delimiter=',')
X = dataset[:,0:8]
y = dataset[:,8]
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)

# PyTorch classifier
class PimaClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer = nn.Linear(8, 12)
        self.act = nn.ReLU()
        self.output = nn.Linear(12, 1)
        self.prob = nn.Sigmoid()

    def forward(self, x):
        x = self.act(self.layer(x))
        x = self.prob(self.output(x))
        return x

# Create model with skorch
model = NeuralNetClassifier(
    PimaClassifier,
    criterion=nn.BCELoss,
    max_epochs=100,
    batch_size=10,
    verbose=False
)

# Define the grid search parameters
param_grid = {
    'optimizer': [optim.SGD, optim.RMSprop, optim.Adagrad, optim.Adadelta,
                  optim.Adam, optim.Adamax, optim.NAdam],
}
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X, y)

# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

How to Tune Learning Rate and Momentum

The learning rate determines the step size during optimization, while momentum helps smooth updates by incorporating past gradients. In this example, we will assess the impact of varying the learning rate and momentum while using the SGD optimizer.

from sklearn.model_selection import GridSearchCV

# Load the dataset
dataset = np.loadtxt('pima-indians-diabetes.csv', delimiter=',')
X = dataset[:,0:8]
y = dataset[:,8]
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)

# PyTorch classifier
class PimaClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer = nn.Linear(8, 12)
        self.act = nn.ReLU()
        self.output = nn.Linear(12, 1)
        self.prob = nn.Sigmoid()

    def forward(self, x):
        x = self.act(self.layer(x))
        x = self.prob(self.output(x))
        return x

# Create model with skorch
model = NeuralNetClassifier(
    PimaClassifier,
    criterion=nn.BCELoss,
    optimizer=optim.SGD,
    max_epochs=100,
    batch_size=10,
    verbose=False
)

# Define the grid search parameters
param_grid = {
    'optimizer__lr': [0.001, 0.01, 0.1, 0.2, 0.3],
    'optimizer__momentum': [0.0, 0.2, 0.4, 0.6, 0.8, 0.9],
}
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X, y)

# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Tuning Network Weight Initialization

Weight initialization can significantly affect model performance. In this section, we’ll explore various weight initialization techniques.

import torch.nn.init as init

# Modify classifier to allow weight initialization parameter
class PimaClassifier(nn.Module):
    def __init__(self, weight_init=init.xavier_uniform_):
        super().__init__()
        self.layer = nn.Linear(8, 12)
        self.act = nn.ReLU()
        self.output = nn.Linear(12, 1)
        self.prob = nn.Sigmoid()
        weight_init(self.layer.weight)
        weight_init(self.output.weight)

    def forward(self, x):
        x = self.act(self.layer(x))
        x = self.prob(self.output(x))
        return x

# Grid search to find effective weight initialization method
param_grid = {
    'module__weight_init': [init.uniform_, init.normal_, init.zeros_,
                           init.xavier_normal_, init.xavier_uniform_,
                           init.kaiming_normal_, init.kaiming_uniform_]
}
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X, y)

# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Tuning the Neuron Activation Function

Activation functions introduce non-linearity into the model. In the following example, we will explore various activation functions.

# PyTorch classifier with adjustable activation function
class PimaClassifier(nn.Module):
    def __init__(self, activation=nn.ReLU):
        super().__init__()
        self.layer = nn.Linear(8, 12)
        self.act = activation()
        self.output = nn.Linear(12, 1)
        self.prob = nn.Sigmoid()
        init.kaiming_uniform_(self.layer.weight)
        init.kaiming_uniform_(self.output.weight)

    def forward(self, x):
        x = self.act(self.layer(x))
        x = self.prob(self.output(x))
        return x

# Grid search over various activation functions
param_grid = {
    'module__activation': [nn.Identity, nn.ReLU, nn.ELU, nn.ReLU6,
                           nn.GELU, nn.Softplus, nn.Softsign, nn.Tanh,
                           nn.Sigmoid, nn.Hardsigmoid]
}
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X, y)

# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Tuning Dropout Regularization

In this section, we will optimize the dropout rate to reduce overfitting and enhance the model’s generalization:

# PyTorch classifier with adjustable dropout rate
class PimaClassifier(nn.Module):
    def __init__(self, dropout_rate=0.5):
        super().__init__()
        self.layer = nn.Linear(8, 12)
        self.act = nn.ReLU()
        self.dropout = nn.Dropout(dropout_rate)
        self.output = nn.Linear(12, 1)
        self.prob = nn.Sigmoid()
        init.kaiming_uniform_(self.layer.weight)
        init.kaiming_uniform_(self.output.weight)

    def forward(self, x):
        x = self.act(self.layer(x))
        x = self.dropout(x)
        x = self.prob(self.output(x))
        return x

# Grid search for dropout rates
param_grid = {
    'module__dropout_rate': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
}
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X, y)

# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Tuning the Number of Neurons in the Hidden Layer

The number of neurons in a hidden layer directly impacts the model’s capacity to learn. In this example, we will experiment with how the number of neurons affects performance.

class PimaClassifier(nn.Module):
    def __init__(self, n_neurons=12):
        super().__init__()
        self.layer = nn.Linear(8, n_neurons)
        self.act = nn.ReLU()
        self.output = nn.Linear(n_neurons, 1)
        self.prob = nn.Sigmoid()
        init.kaiming_uniform_(self.layer.weight)
        init.kaiming_uniform_(self.output.weight)

    def forward(self, x):
        x = self.act(self.layer(x))
        x = self.prob(self.output(x))
        return x

# Grid search for the number of neurons
param_grid = {
    'module__n_neurons': [5, 10, 15, 20, 25, 30]
}
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X, y)

# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Tips for Hyperparameter Optimization

Consider these tips during hyperparameter tuning:

Use Cross-Validation: To ensure stability, consider using more than the default 3-fold cross-validation.
Examine All Results: Don’t just focus on the best results; analyze the entire grid for trends that inform your decisions.
Parallel Processing: Use all CPU cores to speed up training, especially with multiple parameters. Consider utilizing cloud platforms for extensive computations.
Work with Dataset Samples: First, train on a subset of your data to get a sense of parameter behaviors before running on the full dataset.
Start Broad: Begin with coarse grids to identify promising regions, then zoom in for finer searches.
Avoid Transfer Learning: Each problem is unique; what’s optimal for one may not be for another.
Reproducibility Challenges: Due to random initialization, results can vary. Consider using seed settings but expect some variance.

Summary

In this post, you learned how to tune hyperparameters for your deep learning networks in Python using PyTorch and Scikit-learn. Specifically, you gained insights on:

Wrapping PyTorch models for compatibility with Scikit-learn and conducting grid searches.
Searching through a variety of standard neural network parameters in PyTorch models.
Designing your own hyperparameter optimization experiments.

This version presents the concepts clearly while maintaining the original article’s structure and key information.