Exploring LightGBM: Leaf-Wise Growth with GBDT and GOSS

LightGBM is a highly efficient gradient boosting framework that has gained popularity for its impressive speed and performance with large and complex datasets. Developed by Microsoft, this robust algorithm is particularly effective at managing substantial data volumes more easily than traditional methods.

In this article, we’ll explore the LightGBM framework using the Ames Housing dataset. We will particularly focus on its versatile boosting strategies—Gradient Boosting Decision Tree (GBDT) and Gradient-based One-Side Sampling (GOSS)—highlighting their unique advantages and comparing their performance.

We’ll begin with an overview of LightGBM, followed by practical applications and theoretical insights.

Kick-start your project with my book, Next-Level Data Science, featuring self-study tutorials and working code.

Let’s dive in!

![LightGBM Overview](Photo by Marcus Dall Col. Some rights reserved.)

Table of Contents

Overview

This article is structured into four main sections:

Introduction to LightGBM and Initial Setup
Testing GBDT and GOSS with the Ames Dataset
Fine-Tuning LightGBM’s Leaf-wise Growth Strategy
Comparing Feature Importance Between GBDT and GOSS Models

Introduction to LightGBM and Initial Setup

LightGBM (Light Gradient Boosting Machine) is a powerful machine learning framework offering essential components to build, train, and deploy models based on decision tree algorithms, utilizing gradient boosting at its core. It is open-source and can be installed with the following command:

pip install lightgbm

During installation, the LightGBM package and its dependencies will be downloaded.

Although LightGBM, XGBoost, and Gradient Boosting Regressor (GBR) all follow the principle of gradient boosting, LightGBM stands out due to unique features and optional parameters that enhance its performance:

Exclusive Feature Bundling (EFB): This default feature reduces the number of dimensions in sparse data, adeptly managing data dimensionality with minimal manual effort.
Gradient-Based One-Side Sampling (GOSS): An optional feature that prioritizes instances with large gradients, which signal significant discrepancies between the model’s predictions and actual values. GOSS retains these crucial data points during sampling, ensuring they always contribute to model training, while instances with smaller gradients (deemed “well-trained”) are less emphasized.
Leaf-wise Tree Growth: Contrasting with GBR and XGBoost’s level-wise growth, LightGBM grows trees by splitting the leaf that offers the most loss reduction, resulting in asymmetrical trees that may perform better than balanced counterparts.

These unique characteristics position LightGBM as a formidable candidate in the gradient boosting landscape, setting the stage for practical experimentation.

Testing LightGBM’s GBDT and GOSS on the Ames Dataset

With a clear understanding of LightGBM’s capabilities, we shift focus to practical applications using the Ames Housing dataset. Our goal is to rigorously compare the performance of GBDT and GOSS boosting strategies within the LightGBM framework.

Proper dataset preparation is essential, involving loading the data and accurately processing categorical features to leverage LightGBM’s strengths with categorical variables. LightGBM can handle missing values and categorical data natively, streamlining preprocessing and enhancing model robustness.

Here’s how to set up the environment and evaluate model performance:

import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import cross_val_score

# Load the Ames Housing Dataset
data = pd.read_csv('Ames.csv')
X = data.drop('SalePrice', axis=1)
y = data['SalePrice']

# Convert categorical columns to 'category' dtype
categorical_cols = X.select_dtypes(include=['object']).columns
X[categorical_cols] = X[categorical_cols].apply(lambda x: x.astype('category'))

# Define the GBDT model
gbdt_model = lgb.LGBMRegressor()
gbdt_scores = cross_val_score(gbdt_model, X, y, cv=5)
print(f"Average R² score for default LightGBM (GBDT): {gbdt_scores.mean():.4f}")

# Define the GOSS model
goss_model = lgb.LGBMRegressor(boosting_type='goss')
goss_scores = cross_val_score(goss_model, X, y, cv=5)
print(f"Average R² score for LightGBM with GOSS: {goss_scores.mean():.4f}")

Results:

Average R² score for default LightGBM (GBDT): 0.9145
Average R² score for LightGBM with GOSS: 0.9109

Our 5-fold cross-validation results reveal that the GBDT model achieves a commendable average R² score of 0.9145, whereas the GOSS model scores slightly lower at 0.9109. The small performance difference may stem from GOSS’s focus on specific data points, beneficial in high misprediction scenarios. However, in a relatively uniform dataset like Ames, GOSS’s advantages may not be fully realized.

Fine-Tuning LightGBM’s Leaf-wise Growth Strategy

A notable feature of LightGBM is its leaf-wise tree construction, allowing trees to grow by maximizing loss reductions. This strategy can enhance model performance, but it risks overfitting if not carefully tuned. In this section, we’ll examine how varying the number of leaves affects model performance for GBDT and GOSS.

We’ll conduct experiments to assess different num_leaves settings for both variants:

# Experimenting with Leaf-wise Tree Growth
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import cross_val_score

# Load the Ames Housing Dataset
data = pd.read_csv('Ames.csv')
X = data.drop('SalePrice', axis=1)
y = data['SalePrice']

# Convert categorical columns
categorical_cols = X.select_dtypes(include=['object']).columns
X[categorical_cols] = X[categorical_cols].apply(lambda x: x.astype('category'))

# Testing various leaf sizes
leaf_sizes = [5, 10, 15, 31, 50, 100]
results = {'GBDT': {}, 'GOSS': {}}

# GBDT Model Evaluation
print("Testing 'num_leaves' for GBDT:")
for leaf_size in leaf_sizes:
    model = lgb.LGBMRegressor(boosting_type='gbdt', num_leaves=leaf_size)
    scores = cross_val_score(model, X, y, cv=5, scoring='r2')
    results['GBDT'][leaf_size] = scores.mean()
    print(f"num_leaves = {leaf_size}: Average R² score = {scores.mean():.4f}")

# GOSS Model Evaluation
print("\nTesting 'num_leaves' for GOSS:")
for leaf_size in leaf_sizes:
    model = lgb.LGBMRegressor(boosting_type='goss', num_leaves=leaf_size)
    scores = cross_val_score(model, X, y, cv=5, scoring='r2')
    results['GOSS'][leaf_size] = scores.mean()
    print(f"num_leaves = {leaf_size}: Average R² score = {scores.mean():.4f}")

Results:

Testing ‘num_leaves’ for GBDT:

num_leaves = 5: Average R² score = 0.9150
num_leaves = 10: Average R² score = 0.9193
num_leaves = 15: Average R² score = 0.9158
num_leaves = 31: Average R² score = 0.9145
num_leaves = 50: Average R² score = 0.9111
num_leaves = 100: Average R² score = 0.9101

Testing ‘num_leaves’ for GOSS:

num_leaves = 5: Average R² score = 0.9151
num_leaves = 10: Average R² score = 0.9168
num_leaves = 15: Average R² score = 0.9130
num_leaves = 31: Average R² score = 0.9109
num_leaves = 50: Average R² score = 0.9117
num_leaves = 100: Average R² score = 0.9124

Both models achieved optimal performance at a num_leaves setting of 10, demonstrating that moderate complexity captures the dataset’s patterns without overfitting. For GBDT, increasing the leaf count beyond 10 hindered performance, while GOSS showcased slightly more resilience to higher leaf counts.

This experiment illustrates the critical nature of tuning num_leaves within LightGBM, successfully balancing accuracy and model complexity.

Comparing Feature Importance in LightGBM’s GBDT and GOSS Models

Following our analysis of num_leaves, we explore feature importance within the GBDT and GOSS models. Understanding feature importance is key to interpreting model behavior, revealing which features contribute most to predictive power.

Here’s how we analyze feature importance:

import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt
import seaborn as sns

# Prepare data
data = pd.read_csv('Ames.csv')
X = data.drop('SalePrice', axis=1)
y = data['SalePrice']
categorical_cols = X.select_dtypes(include=['object']).columns
X[categorical_cols] = X[categorical_cols].apply(lambda x: x.astype('category'))

# Setting up K-fold cross-validation
kf = KFold(n_splits=5)
gbdt_feature_importances = []
goss_feature_importances = []

# Training and importance extraction
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # GBDT model training
    gbdt_model = lgb.LGBMRegressor(boosting_type='gbdt', num_leaves=10)
    gbdt_model.fit(X_train, y_train)
    gbdt_feature_importances.append(gbdt_model.feature_importances_)

    # GOSS model training
    goss_model = lgb.LGBMRegressor(boosting_type='goss', num_leaves=10)
    goss_model.fit(X_train, y_train)
    goss_feature_importances.append(goss_model.feature_importances_)

# Average feature importance
avg_gbdt_importance = np.mean(gbdt_feature_importances, axis=0)
avg_goss_importance = np.mean(goss_feature_importances, axis=0)

# Convert to DataFrame
feat_imp_gbdt = pd.DataFrame({'Feature': X.columns, 'Importance': avg_gbdt_importance})
feat_imp_goss = pd.DataFrame({'Feature': X.columns, 'Importance': avg_goss_importance})

# Top 10 features
top_gbdt_features = feat_imp_gbdt.sort_values(by='Importance', ascending=False).head(10)
top_goss_features = feat_imp_goss.sort_values(by='Importance', ascending=False).head(10)

# Plotting feature importance
plt.figure(figsize=(16, 12))
plt.subplot(1, 2, 1)
sns.barplot(data=top_gbdt_features, y='Feature', x='Importance', orient='h', palette='viridis')
plt.title('Top 10 LightGBM GBDT Features', fontsize=18)
plt.xlabel('Importance', fontsize=16)
plt.ylabel('Feature', fontsize=16)

plt.subplot(1, 2, 2)
sns.barplot(data=top_goss_features, y='Feature', x='Importance', orient='h', palette='viridis')
plt.title('Top 10 LightGBM GOSS Features', fontsize=18)
plt.xlabel('Importance', fontsize=16)
plt.ylabel('Feature', fontsize=16)

plt.tight_layout()
plt.show()

By maintaining consistency with our previous experiments, we apply a k-fold cross-validation approach, emphasizing the extraction and analysis of feature importance.

The analysis reveals intriguing patterns in feature significance between models. Both GBDT and GOSS prioritize “GrLivArea” and “LotArea,” signifying property size’s critical role in influencing home prices. Additionally, “Neighborhood” ranks high for both, demonstrating location’s impact in real estate.

Divergence begins with the fourth feature; GBDT favors “BsmtFinSF1,” highlighting finished basements’ value, while GOSS places greater emphasis on “OverallQual.”

Our findings indicate that differing feature importance between GBDT and GOSS offers essential insights into how each model interprets feature relevance in housing price prediction.

Summary

This article explored LightGBM’s capabilities, emphasizing its distinct features and practical applications using the Ames Housing dataset. By comparing GBDT and GOSS boosting strategies, and analyzing feature importance, we uncovered insights illustrating LightGBM’s efficiency and adaptability to complex datasets.

Key takeaways:

Comparison of model variants: Insights gained from contrasting GBDT with GOSS reveal how different strategies can be effective based on data characteristics.
Experimentation with leaf-wise strategy: Adjusting num_leaves influences model performance, where an optimal setting strikes a balance between complexity and accuracy.
Visualization of feature importance: Understanding and visualizing influential features enhances model interpretation, improves transparency, and builds trust by revealing key impact factors.

Feel free to let me know if you need further modifications or additional content!