Training a PyTorch Model with DataLoader and Dataset

When training a PyTorch deep learning model, you have multiple ways to supply the training data. At its core, a PyTorch model processes tensors as input and produces tensors as output. You can prepare a large tensor containing the entire dataset and manually extract batches during training. However, PyTorch’s DataLoader significantly simplifies this process.

This article will guide you through using DataLoader and Dataset for efficient data handling in PyTorch. By the end, you’ll understand:

How to create and use DataLoader to train PyTorch models.
How to leverage the Dataset class to generate data on the fly.

Table of Contents

Overview

This guide is divided into the following sections:

What is DataLoader?
Using DataLoader in a Training Loop
Using the Dataset Class

What is DataLoader?

To train a deep learning model, you need a dataset comprising numerous samples. While you can manually slice the dataset to create batches, shuffle the data, and apply augmentations, the DataLoader automates these tasks efficiently.

Here’s an example that demonstrates creating a DataLoader for the Sonar dataset:

import pandas as pd
import torch
from torch.utils.data import DataLoader
from sklearn.preprocessing import LabelEncoder

# Load data
data = pd.read_csv("sonar.csv", header=None)
X = data.iloc[:, 0:60].values
y = data.iloc[:, 60].values

# Encode class labels
encoder = LabelEncoder()
y = encoder.fit_transform(y)

# Convert to PyTorch tensors
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)

# Create DataLoader
loader = DataLoader(list(zip(X, y)), shuffle=True, batch_size=16)

# Fetch a batch
for X_batch, y_batch in loader:
    print(X_batch, y_batch)
    break

In this example:

Input and Target: The features (X) and labels (y) are packaged as tensors.
DataLoader: Combines the dataset into iterable batches. The shuffle=True argument ensures data is shuffled at the start of each epoch.

Using DataLoader in a Training Loop

Below is an example of using DataLoader in a complete training loop:

import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True)

# Create DataLoader for training set
loader = DataLoader(list(zip(X_train, y_train)), shuffle=True, batch_size=16)

# Define the model
model = nn.Sequential(
    nn.Linear(60, 60),
    nn.ReLU(),
    nn.Linear(60, 30),
    nn.ReLU(),
    nn.Linear(30, 1),
    nn.Sigmoid()
)

# Loss and optimizer
loss_fn = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Training loop
n_epochs = 200
model.train()
for epoch in range(n_epochs):
    for X_batch, y_batch in loader:
        y_pred = model(X_batch)
        loss = loss_fn(y_pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# Evaluate model
model.eval()
y_pred = model(X_test)
accuracy = (y_pred.round() == y_test).float().mean()
print(f"Model accuracy: {accuracy.item() * 100:.2f}%")

This approach simplifies handling data batches during training. While the test set here uses the entire dataset, you could also use a DataLoader for evaluation if needed.

Using the Dataset Class

The Dataset class allows you to create custom data handling logic, such as loading data from disk or applying augmentations on-the-fly. By subclassing Dataset, you define how to access and return individual data samples.

Implementing a Custom Dataset

Here’s an example:

from torch.utils.data import Dataset

class SonarDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

Integrating Dataset with DataLoader

# Create dataset and DataLoader
train_dataset = SonarDataset(X_train, y_train)
loader = DataLoader(train_dataset, shuffle=True, batch_size=16)

# Train model
model.train()
for epoch in range(n_epochs):
    for X_batch, y_batch in loader:
        y_pred = model(X_batch)
        loss = loss_fn(y_pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# Evaluate model
model.eval()
y_pred = model(torch.tensor(X_test, dtype=torch.float32))
y_test = torch.tensor(y_test, dtype=torch.float32)
accuracy = (y_pred.round() == y_test).float().mean()
print(f"Model accuracy: {accuracy.item() * 100:.2f}%")

By subclassing Dataset, you can add preprocessing, augmentations, or dynamic loading from external sources in the __getitem__ method.

Using `random_split` for Dataset Splitting

Instead of using train_test_split from scikit-learn, you can leverage PyTorch’s random_split to split datasets:

from torch.utils.data import random_split

dataset = SonarDataset(X, y)
train_dataset, test_dataset = random_split(dataset, [0.7 * len(dataset), 0.3 * len(dataset)])

loader = DataLoader(train_dataset, shuffle=True, batch_size=16)

# Training loop remains the same

For evaluation, you can use the default_collate function to combine test samples into tensors:

from torch.utils.data.dataloader import default_collate

X_test, y_test = default_collate(test_dataset)
y_pred = model(X_test)
accuracy = (y_pred.round() == y_test).float().mean()
print(f"Model accuracy: {accuracy.item() * 100:.2f}%")

Summary

In this article, you learned:

How DataLoader simplifies batching, shuffling, and data augmentation.
How to use the Dataset class for custom data handling.
How to integrate Dataset and DataLoader into training and evaluation loops.

These tools not only streamline data handling in PyTorch but also make your training pipeline more flexible and efficient.