Building a Regression Model in PyTorch

PyTorch is a popular library for deep learning, enabling users to tackle various problems including regression and classification tasks. In this post, you will learn how to develop and evaluate neural network models for regression problems using PyTorch.

By the end of this article, you will know:

How to load data from Scikit-learn and prepare it for use with PyTorch models.
How to create a neural network to solve regression problems in PyTorch.
How to enhance model performance using data preparation techniques.

Table of Contents

Let’s Get Started

Description of the Dataset

For this tutorial, we will use the California housing dataset, which contains information about median house values across different California districts. Each data point corresponds to a census block group, with the target variable being the median house value in USD (scaled by 100,000) as of 1990. The dataset includes eight input features:

MedInc: Median income in the block group
HouseAge: Median house age in the block group
AveRooms: Average number of rooms per household
AveBedrms: Average number of bedrooms per household
Population: Block group population
AveOccup: Average number of household members
Latitude: Block group centroid latitude
Longitude: Block group centroid longitude

This dataset presents unique challenges due to the varying scales of input data; for instance, the number of rooms is typically much lower than the population, and while most features should be positive, the longitude can have negative values.

You can retrieve the dataset using Scikit-learn’s built-in functionality:

from sklearn.datasets import fetch_california_housing

data = fetch_california_housing()
print(data.feature_names)

X, y = data.data, data.target

Building and Training the Model

Since this is a regression problem, the output variable is continuous. In neural networks, the output layer typically does not use an activation function (linear activation), which allows the output to span theoretically from negative to positive infinity.

Instead of expecting perfect predictions, focus on how closely your model’s predictions match the actual values. Common loss metrics for regression include Mean Square Error (MSE) and Mean Absolute Error (MAE). Additionally, Root Mean Squared Error (RMSE) can be useful as it expresses errors in the same unit as the target variable.

We’ll employ a pyramid structure for our neural network, where the number of neurons decreases as we progress through the layers. Here’s one potential architecture:

import torch.nn as nn

# Define the model
model = nn.Sequential(
    nn.Linear(8, 24),
    nn.ReLU(),
    nn.Linear(24, 12),
    nn.ReLU(),
    nn.Linear(12, 6),
    nn.ReLU(),
    nn.Linear(6, 1)
)

To train this model, define a loss function such as MSE and an optimizer, like Adam:

import torch.optim as optim

# Loss function and optimizer
loss_fn = nn.MSELoss()  # Mean square error
optimizer = optim.Adam(model.parameters(), lr=0.0001)

Now, prepare to train the model. It’s essential to split the data into training and test sets to evaluate model performance and reduce overfitting risks by monitoring the test MSE.

import copy
import numpy as np
import torch
import tqdm
from sklearn.model_selection import train_test_split

# Train-test split of the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True)
X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32).reshape(-1, 1)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_test = torch.tensor(y_test, dtype=torch.float32).reshape(-1, 1)

# Training parameters
n_epochs = 100   # Number of epochs
batch_size = 10  # Size of each batch
batch_start = torch.arange(0, len(X_train), batch_size)

# Hold the best model
best_mse = np.inf   # Initialize to infinity
best_weights = None
history = []

# Training loop
for epoch in range(n_epochs):
    model.train()
    with tqdm.tqdm(batch_start, unit="batch", mininterval=0, disable=True) as bar:
        bar.set_description(f"Epoch {epoch}")
        for start in bar:
            # Take a batch
            X_batch = X_train[start:start + batch_size]
            y_batch = y_train[start:start + batch_size]

            # Forward pass
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)

            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # Print progress
            bar.set_postfix(mse=float(loss))

    # Evaluate accuracy at the end of each epoch
    model.eval()
    y_pred = model(X_test)
    mse = loss_fn(y_pred, y_test)
    mse = float(mse)
    history.append(mse)
    if mse < best_mse:
        best_mse = mse
        best_weights = copy.deepcopy(model.state_dict())

# Restore model and report the best accuracy
model.load_state_dict(best_weights)

After executing the code, you will have the best model and associated MSE metrics. The MSE is computed as the average of squared differences between the predicted and actual values. RMSE can provide a more interpretable measure of average prediction error.

To visualize the MSE history and report performance metrics:

print("MSE: %.2f" % best_mse)
print("RMSE: %.2f" % np.sqrt(best_mse))
plt.plot(history)
plt.show()

Improving the Model with Preprocessing

With an initial RMSE of around 0.68, there is potential to enhance model accuracy through data preprocessing. The heterogeneity in feature scales can hinder training efficiency. A common solution is to apply a Standard Scaler, which standardizes features by removing the mean and scaling to unit variance:

from sklearn.preprocessing import StandardScaler

# Before splitting, standardizing the data
scaler = StandardScaler()
scaler.fit(X_train_raw)

X_train = scaler.transform(X_train_raw)
X_test = scaler.transform(X_test_raw)

# Convert to PyTorch tensors after scaling
X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32).reshape(-1, 1)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_test = torch.tensor(y_test, dtype=torch.float32).reshape(-1, 1)

It is critical to fit the scaler on the training data only and then transform both the training and test datasets afterward to prevent data leakage.

After scaling, re-run the training loop with the modified datasets. You should observe improved metrics:

# Post-training evaluation
print("MSE: %.2f" % best_mse)
print("RMSE: %.2f" % np.sqrt(best_mse))

Complete Code

Below is the complete code, combining all steps effectively:

import copy
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
import tqdm
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler

# Load data
data = fetch_california_housing()
X, y = data.data, data.target

# Train-test split for evaluation
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True)

# Standardizing data
scaler = StandardScaler()
scaler.fit(X_train_raw)
X_train = scaler.transform(X_train_raw)
X_test = scaler.transform(X_test_raw)

# Convert to PyTorch tensors
X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32).reshape(-1, 1)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_test = torch.tensor(y_test, dtype=torch.float32).reshape(-1, 1)

# Define the model
model = nn.Sequential(
    nn.Linear(8, 24),
    nn.ReLU(),
    nn.Linear(24, 12),
    nn.ReLU(),
    nn.Linear(12, 6),
    nn.ReLU(),
    nn.Linear(6, 1)
)

# Loss function and optimizer
loss_fn = nn.MSELoss()  # Mean square error
optimizer = optim.Adam(model.parameters(), lr=0.0001)

n_epochs = 100   # Number of epochs to run
batch_size = 10  # Size of each batch
batch_start = torch.arange(0, len(X_train), batch_size)

# Hold the best model
best_mse = np.inf   # Initialize to infinity
best_weights = None
history = []

# Training loop
for epoch in range(n_epochs):
    model.train()
    with tqdm.tqdm(batch_start, unit="batch", mininterval=0, disable=True) as bar:
        bar.set_description(f"Epoch {epoch}")
        for start in bar:
            X_batch = X_train[start:start + batch_size]
            y_batch = y_train[start:start + batch_size]

            # Forward pass
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)

            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            bar.set_postfix(mse=float(loss))

    # Evaluate at end of epoch
    model.eval()
    y_pred = model(X_test)
    mse = loss_fn(y_pred, y_test)
    mse = float(mse)
    history.append(mse)
    if mse < best_mse:
        best_mse = mse
        best_weights = copy.deepcopy(model.state_dict())

# Restore model with best weights
model.load_state_dict(best_weights)
print("MSE: %.2f" % best_mse)
print("RMSE: %.2f" % np.sqrt(best_mse))
plt.plot(history)
plt.show()

# Inference with new data
model.eval()
with torch.no_grad():
    for i in range(5):
        X_sample = X_test_raw[i: i+1]
        X_sample = scaler.transform(X_sample)
        X_sample = torch.tensor(X_sample, dtype=torch.float32)
        y_pred = model(X_sample)
        print(f"{X_test_raw[i]} -> {y_pred[0].numpy()} (expected {y_test[i].numpy()})")

Summary

Throughout this post, you explored the process of building a regression model in PyTorch, including data preparation, model creation, training, and performance enhancement techniques like preprocessing. The model’s performance can be significantly improved through techniques like standardizing the features, demonstrating the importance of data preparation in machine learning.

This rewritten version maintains the core message and details of the original article while enhancing clarity and readability.