Using Activation Functions in Deep Learning Models

In its most basic form, a deep learning model comprises layers of perceptrons connected in sequence. Without activation functions, these layers merely perform matrix multiplications, resulting in limited functionality, regardless of how many layers are included. Activation functions are pivotal as they allow neural networks to approximate various non-linear functions. In PyTorch, a range of activation functions is available, and this post will demonstrate how their choice influences model performance.

Table of Contents

What You’ll Learn

Common activation functions used in deep learning.
The characteristics of these activation functions.
The effect of different activation functions on learning rates.
How the selection of activation functions can mitigate the vanishing gradient problem.

Let’s Get Started

Overview

This article is structured into three parts:

A Toy Model for Binary Classification
The Importance of Nonlinear Functions
The Impact of Activation Functions

A Toy Model for Binary Classification

We’ll begin with a straightforward example of binary classification. Using the make_circles() function from Scikit-learn, we can generate a synthetic dataset featuring two features: the x- and y-coordinates of points. Each point belongs to one of two classes, with a total of 1000 data points depicted as follows:

from sklearn.datasets import make_circles
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim

# Generate two circles for a classification problem
X, y = make_circles(n_samples=1000, factor=0.5, noise=0.1)
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y.reshape(-1, 1), dtype=torch.float32)

plt.figure(figsize=(8,6))
plt.scatter(X[:,0], X[:,1], c=y)
plt.show()

This dataset is special because it is simple but not linearly separable. The challenge for your neural network is to identify the circular boundary between the classes.

Let’s construct a deep learning model tailored for this binary classification task. We’ll forgo cross-validation for simplicity; however, be aware that this may lead to overfitting. Our model will consist of four hidden layers, with the output layer employing a sigmoid activation function to produce a value between 0 and 1. Here’s how to implement it:

class Model(nn.Module):
    def __init__(self, activation=nn.ReLU):
        super().__init__()
        self.layer0 = nn.Linear(2, 5)
        self.act0 = activation()
        self.layer1 = nn.Linear(5, 5)
        self.act1 = activation()
        self.layer2 = nn.Linear(5, 5)
        self.act2 = activation()
        self.layer3 = nn.Linear(5, 5)
        self.act3 = activation()
        self.layer4 = nn.Linear(5, 1)
        self.act4 = nn.Sigmoid()

    def forward(self, x):
        x = self.act0(self.layer0(x))
        x = self.act1(self.layer1(x))
        x = self.act2(self.layer2(x))
        x = self.act3(self.layer3(x))
        x = self.act4(self.layer4(x))
        return x

Next, we define the training loop to run for several epochs:

def train_loop(model, X, y, n_epochs=300, batch_size=32):
    loss_fn = nn.BCELoss()
    optimizer = optim.Adam(model.parameters(), lr=0.0001)
    batch_start = torch.arange(0, len(X), batch_size)

    bce_hist = []
    acc_hist = []

    for epoch in range(n_epochs):
        model.train()
        for start in batch_start:
            X_batch = X[start:start+batch_size]
            y_batch = y[start:start+batch_size]
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        model.eval()
        with torch.no_grad():
            y_pred = model(X)
            bce = float(loss_fn(y_pred, y))
            acc = float((y_pred.round() == y).float().mean())
        bce_hist.append(bce)
        acc_hist.append(acc)

        if (epoch + 1) % 10 == 0:
            print("Before epoch %d: BCE=%.4f, Accuracy=%.2f%%" % (epoch + 1, bce, acc * 100))

    return bce_hist, acc_hist

After training the model with the ReLU activation function, we plot the training history:

activation = nn.ReLU
model = Model(activation=activation)
bce_hist, acc_hist = train_loop(model, X, y)
plt.plot(bce_hist, label="BCE")
plt.plot(acc_hist, label="Accuracy")
plt.xlabel("Epochs")
plt.ylim(0, 1)
plt.title(str(activation))
plt.legend()
plt.show()

The model performs well, achieving about 97% accuracy after 300 epochs.

Comparing Activation Functions

Let’s compare three activation functions (ReLU, sigmoid, and hyperbolic tangent):

x = torch.linspace(-4, 4, 200)
relu = nn.ReLU()(x)
tanh = nn.Tanh()(x)
sigmoid = nn.Sigmoid()(x)

plt.plot(x, sigmoid, label="sigmoid")
plt.plot(x, tanh, label="tanh")
plt.plot(x, relu, label="ReLU")
plt.ylim(-1.5, 2)
plt.legend()
plt.show()

ReLU (Rectified Linear Unit) operates as a linear function for positive inputs but remains zero for negative ones. The sigmoid function maps inputs between 0 and 1, while the hyperbolic tangent function spans from -1 to 1. In terms of differentiation, ReLU is the simplest, with a gradient of 1 for positive inputs and 0 for negatives. In contrast, hyperbolic tangent gradients are steeper than those of the sigmoid function.

Why Nonlinear Functions?

You may wonder why nonlinear activation functions are emphasized. Using multiple linear layers without nonlinearity would effectively revert to a single linear transformation. Nonlinear functions enable the model to express more complex relationships and representations.

The Effect of Activation Functions

To illustrate the impact of activation functions, we’ll modify the training function to capture gradient information during training. Each training step will record the mean absolute value of gradients for each layer.

def train_loop(model, X, y, n_epochs=300, batch_size=32):
    ...
    grad_hist = [[], [], [], [], []]

    for epoch in range(n_epochs):
        ...
        for start in batch_start:
            ...
            for n, layer in enumerate(layers):
                mean_grad = float(layer.weight.grad.abs().mean())
                layer_grad[n].append(mean_grad)

    return bce_hist, acc_hist, layer_grad

Visualization

After running this enhanced training loop for different activation functions, you can visualize the changes in accuracy, loss, and layer gradients.

Further Exploration with Variants of ReLU

We can also experiment with variations of ReLU, such as ReLU6 and Leaky ReLU:

x = torch.linspace(-8, 8, 200)
relu = nn.ReLU()(x)
relu6 = nn.ReLU6()(x)
leaky = nn.LeakyReLU()(x)

plt.plot(x, relu, label="ReLU")
plt.plot(x, relu6, label="ReLU6")
plt.plot(x, leaky, label="LeakyReLU")
plt.legend()
plt.show()

The ReLU6 caps the function at 6.0, while the Leaky ReLU introduces a small positive slope for negative inputs, aiding in gradient flow.

Summary

In this chapter, you learned how to select activation functions for your PyTorch models, explored common activation functions and their properties, understood the vanishing gradient problem, and noted the influence of activation functions on model performance.

This rewritten version captures the original article’s core concepts while enhancing clarity and readability.