Wednesday, April 30, 2025

Understanding SHAP in XAI: Game Theory

What Are SHAP Values in Explainable AI?

SHAP (SHapley Additive exPlanations) values are a cornerstone of Explainable AI (XAI), offering a transparent way to interpret complex model predictions. SHAP leverages game theory to fairly attribute the impact of each feature on a model’s output, making it invaluable for GenAI validation, regulatory compliance, and building trust in AI systems.


The Game Theory Behind SHAP: Shapley Value Origins

The mathematical foundation of SHAP comes from the Shapley value, introduced by Lloyd Shapley in 1951. In cooperative game theory, the Shapley value provides a fair way to distribute the total "payout" among players based on their individual contributions. SHAP adapts this by treating each feature as a "player" and the model prediction as the "payout," distributing credit for the prediction among all features.

  • Efficiency: Total contributions sum to the prediction.
  • Symmetry: Identical features get equal attribution.
  • Dummy: Features with no impact get zero.
  • Additivity: Attributions combine logically across models.


How Does SHAP Work? (With Example)

SHAP explains an individual prediction by quantifying how much each feature contributed to moving the model's output from the baseline (average prediction) to the actual prediction for that instance.

Example: Loan Default Prediction
  • Base value: 0.45 (average default risk across all applicants)
  • High debt-to-income ratio: +0.25 (increases risk)
  • Low credit score: +0.15 (increases risk)
  • High income: -0.05 (decreases risk)
  • Total SHAP contribution: +0.35
  • Final model score: 0.45 + 0.35 = 0.80 (80% default probability)

This breakdown makes the model's decision transparent, showing exactly how each feature pushed the prediction higher or lower.

SHAP in Practice: Python Example


import xgboost as xgb
import shap

# Train your model (example with XGBoost)
model = xgb.XGBClassifier()
model.fit(X_train, y_train)

# Explain predictions with SHAP
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Visualize global feature importance
shap.summary_plot(shap_values, X_test, plot_type="bar")

# Explain a single prediction
shap.force_plot(explainer.expected_value, shap_values[0,:], X_test[0,:])
                

Types of SHAP Explainers

Explainer Best For
TreeExplainer Tree-based models (XGBoost, LightGBM, CatBoost)
DeepExplainer Neural networks (Deep Learning)
KernelExplainer Any model (model-agnostic, slower)
PermutationExplainer Exact SHAP for small feature sets

SHAP vs. Other XAI Methods

Method Approach Pros Cons
SHAP Game theory, fair attribution Consistent, local/global, direction & magnitude Computationally intensive
LIME Local surrogate models Model-agnostic, easy to use Less consistent, less global insight
Feature Importance Global ranking Simple, fast No direction, no local insight

Real-World Applications of SHAP

  • Finance: Credit risk, loan approvals
  • Healthcare: Disease risk prediction
  • Customer Analytics: Churn, segmentation
  • Fraud Detection: Transaction analysis

Quick Insights: Chaos Theory, Fibonacci, and Trimmed Mean

Chaos Theory

Chaos theory studies systems that are highly sensitive to initial conditions, leading to seemingly random but deterministic behavior. Famous for the "butterfly effect," chaos theory helps explain unpredictable patterns in weather forecasting, stock market modeling, population dynamics, encryption, and signal processing. Despite the randomness, these systems follow mathematical rules and display hidden patterns, like the unique structure of snowflakes.

Fibonacci Sequence

The Fibonacci sequence (0, 1, 1, 2, 3, 5, 8, ...) appears in nature (sunflowers, pinecones), art, and finance. Each number is the sum of the two preceding ones, and the ratio between numbers approaches the golden ratio (~1.618), which is often associated with aesthetically pleasing proportions.

Trimmed Mean

The trimmed mean is a robust statistical measure where a fixed percentage of the highest and lowest values are removed before calculating the mean. This approach reduces the impact of outliers and is widely used in sports judging, economic indicators, and data analysis for more reliable averages.

Conclusion: Why SHAP and Math Matter in AI

SHAP values bridge advanced mathematics and real-world AI, making black-box models transparent and trustworthy. Understanding foundational concepts like game theory, chaos, and robust statistics empowers data scientists to build better, more explainable AI systems.

As AI continues to shape critical decisions, explainability and mathematical rigor will remain at the heart of responsible, impactful innovation.

Monday, April 21, 2025

How to Handle Categorical Variables in Machine Learning

How to Handle Categorical Variables in Machine Learning

How to Handle Categorical Variables in Machine Learning

In machine learning, many algorithms require numerical input. Since categorical variables (e.g., 'red', 'blue', 'green') are non-numeric, we need to transform them before feeding them into models.

1. Why Convert Categorical Variables?

Models like linear regression, logistic regression, SVM, and neural networks require numerical input. Using raw text labels causes errors or misleading results, as the model might treat them as ordinal or continuous.

2. Basic Encoding with pd.get_dummies()

import pandas as pd

df = pd.DataFrame({'color': ['red', 'green', 'blue']})
pd.get_dummies(df)

This will return:

   color_blue  color_green  color_red
0           0            0          1
1           0            1          0
2           1            0          0

Downsides: High cardinality can create dimensionality issues, and unseen categories in the test set cause problems.

3. Using Scikit-Learn's Encoders

OneHotEncoder

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
X_encoded = encoder.fit_transform(df[['color']])

handle_unknown='ignore' prevents errors from unseen categories during prediction.

OrdinalEncoder

from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
X_encoded = encoder.fit_transform(df[['color']])

Warning: This assumes order exists between categories, which can mislead models.

4. Advanced Encoders (category_encoders)

Using target encoding, binary encoding, or leave-one-out encoding can improve results:

import category_encoders as ce

encoder = ce.TargetEncoder()
X_transformed = encoder.fit_transform(df[['color']], [1, 0, 1])

5. Automatic Categorical Handling

LightGBM

import lightgbm as lgb
df['color'] = df['color'].astype('category')  # Critical step!

train_data = lgb.Dataset(X_train, label=y_train)
lgb.train(params, train_data)

Important: Even though LightGBM internally maps categories to integers, it does not treat them as ordered. It splits categories by groupings, not by numerical size.

XGBoost (v1.3+)

import xgboost as xgb
df['color'] = df['color'].astype('category')

model = xgb.XGBClassifier(tree_method='hist', enable_categorical=True)
model.fit(X_train, y_train)

tree_method='hist' and enable_categorical=True are mandatory for native categorical support.

CatBoost

CatBoost natively supports categorical variables, including string values, and uses powerful techniques like target encoding internally.

6. CatBoost Categorical Handling vs OneHot Comparison

Experiment Setup:

- Dataset: Titanic (with 'Sex' and 'Embarked' as categorical features)
- Models: CatBoost with:
  • Auto categorical handling
  • One-hot encoded inputs

Results:

CatBoost (auto cat handling):  ~83.2% accuracy  
CatBoost (one-hot encoded):   ~78.9% accuracy

Conclusion: CatBoost’s native categorical handling often performs better due to target-based encoding and internal optimizations.

7. Summary Table

Model Auto Categorical Support Setup Need Encoding?
LightGBM astype('category')
XGBoost ✅ (v1.3+) tree_method='hist', enable_categorical=True
CatBoost ✅ (best) cat_features list or raw string columns
Scikit-Learn OneHotEncoder, OrdinalEncoder

✅ Final Takeaways

  • For linear models → Use OneHotEncoder
  • For tree models → Use native handling in LightGBM, XGBoost, or CatBoost
  • For many categories → Try TargetEncoder or CatBoost
  • Always avoid using OrdinalEncoder unless the variable has real order

About SeanKimLab

Welcome to SeanhKimLab

I’m Seunghyun Kim, an AI/ML practitioner and model governance expert with over 19 years of experience in financial services, auto finance, and emerging tech. This blog is my curated space to bridge industry insights with academic curiosity.

This blog serves multiple purposes:

  • Sharing real-world insights in Machine Learning, MLOps, and Data Engineering
  • Documenting research ideas to prepare for Ph.D. programs at top AI institutions
  • Serving as a portfolio hub for global opportunities, especially U.S.-based collaborations

What You'll Find Here

  • End-to-end ML pipeline examples with Airflow, Docker, and Kubernetes
  • Experimental research and thoughts on recent AI/ML papers
  • Practical case studies: credit risk, forecasting, NLP, and fraud detection
  • Deep dives into model explainability (SHAP, LIME) and regulatory alignment (SR 11-7, CECL)

Whether you're an AI researcher, practitioner, or a decision-maker looking for practical, scalable solutions, I hope this blog offers both clarity and inspiration.

Portfolio: github.com/seankim0
Contact: seunghyk@tepper.cmu.edu
Location: Irvine, California | Open to research collaboration

Gradient Boosting Decision Trees Showdown: Comparing Top Performers for Real-World Tasks

Gradient Boosting Decision Trees Showdown: Comparing Top Performers for Real-World Tasks Gradient Boosting Decisio...