Sean Kim Lab: April 2025

Wednesday, April 30, 2025

Understanding SHAP in XAI: Game Theory

What Are SHAP Values in Explainable AI?

SHAP (SHapley Additive exPlanations) values are a cornerstone of Explainable AI (XAI), offering a transparent way to interpret complex model predictions. SHAP leverages game theory to fairly attribute the impact of each feature on a model’s output, making it invaluable for GenAI validation, regulatory compliance, and building trust in AI systems.

The Game Theory Behind SHAP: Shapley Value Origins

The mathematical foundation of SHAP comes from the Shapley value, introduced by Lloyd Shapley in 1951. In cooperative game theory, the Shapley value provides a fair way to distribute the total "payout" among players based on their individual contributions. SHAP adapts this by treating each feature as a "player" and the model prediction as the "payout," distributing credit for the prediction among all features.

Efficiency: Total contributions sum to the prediction.
Symmetry: Identical features get equal attribution.
Dummy: Features with no impact get zero.
Additivity: Attributions combine logically across models.

How Does SHAP Work? (With Example)

SHAP explains an individual prediction by quantifying how much each feature contributed to moving the model's output from the baseline (average prediction) to the actual prediction for that instance.

                    Example: Loan Default Prediction
                    Base value: 0.45 (average default risk across all applicants)
High debt-to-income ratio: +0.25 (increases risk)
Low credit score: +0.15 (increases risk)
High income: -0.05 (decreases risk)
Total SHAP contribution: +0.35
Final model score: 0.45 + 0.35 = 0.80 (80% default probability)

                

This breakdown makes the model's decision transparent, showing exactly how each feature pushed the prediction higher or lower.

SHAP in Practice: Python Example


import xgboost as xgb
import shap

# Train your model (example with XGBoost)
model = xgb.XGBClassifier()
model.fit(X_train, y_train)

# Explain predictions with SHAP
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Visualize global feature importance
shap.summary_plot(shap_values, X_test, plot_type="bar")

# Explain a single prediction
shap.force_plot(explainer.expected_value, shap_values[0,:], X_test[0,:])

Types of SHAP Explainers

Explainer	Best For
TreeExplainer	Tree-based models (XGBoost, LightGBM, CatBoost)
DeepExplainer	Neural networks (Deep Learning)
KernelExplainer	Any model (model-agnostic, slower)
PermutationExplainer	Exact SHAP for small feature sets

SHAP vs. Other XAI Methods

Method	Approach	Pros	Cons
SHAP	Game theory, fair attribution	Consistent, local/global, direction & magnitude	Computationally intensive
LIME	Local surrogate models	Model-agnostic, easy to use	Less consistent, less global insight
Feature Importance	Global ranking	Simple, fast	No direction, no local insight

Real-World Applications of SHAP

Finance: Credit risk, loan approvals
Healthcare: Disease risk prediction
Customer Analytics: Churn, segmentation
Fraud Detection: Transaction analysis

Quick Insights: Chaos Theory, Fibonacci, and Trimmed Mean

Chaos Theory

Chaos theory studies systems that are highly sensitive to initial conditions, leading to seemingly random but deterministic behavior. Famous for the "butterfly effect," chaos theory helps explain unpredictable patterns in weather forecasting, stock market modeling, population dynamics, encryption, and signal processing. Despite the randomness, these systems follow mathematical rules and display hidden patterns, like the unique structure of snowflakes.

Fibonacci Sequence

The Fibonacci sequence (0, 1, 1, 2, 3, 5, 8, ...) appears in nature (sunflowers, pinecones), art, and finance. Each number is the sum of the two preceding ones, and the ratio between numbers approaches the golden ratio (~1.618), which is often associated with aesthetically pleasing proportions.

Trimmed Mean

The trimmed mean is a robust statistical measure where a fixed percentage of the highest and lowest values are removed before calculating the mean. This approach reduces the impact of outliers and is widely used in sports judging, economic indicators, and data analysis for more reliable averages.

Conclusion: Why SHAP and Math Matter in AI

SHAP values bridge advanced mathematics and real-world AI, making black-box models transparent and trustworthy. Understanding foundational concepts like game theory, chaos, and robust statistics empowers data scientists to build better, more explainable AI systems.

As AI continues to shape critical decisions, explainability and mathematical rigor will remain at the heart of responsible, impactful innovation.

Monday, April 21, 2025

How to Handle Categorical Variables in Machine Learning

In machine learning, many algorithms require numerical input. Since categorical variables (e.g., 'red', 'blue', 'green') are non-numeric, we need to transform them before feeding them into models.

1. Why Convert Categorical Variables?

Models like linear regression, logistic regression, SVM, and neural networks require numerical input. Using raw text labels causes errors or misleading results, as the model might treat them as ordinal or continuous.

2. Basic Encoding with `pd.get_dummies()`

import pandas as pd

df = pd.DataFrame({'color': ['red', 'green', 'blue']})
pd.get_dummies(df)

This will return:

   color_blue  color_green  color_red
0           0            0          1
1           0            1          0
2           1            0          0

Downsides: High cardinality can create dimensionality issues, and unseen categories in the test set cause problems.

3. Using Scikit-Learn's Encoders

OneHotEncoder

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
X_encoded = encoder.fit_transform(df[['color']])

handle_unknown='ignore' prevents errors from unseen categories during prediction.

OrdinalEncoder

from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
X_encoded = encoder.fit_transform(df[['color']])

Warning: This assumes order exists between categories, which can mislead models.

4. Advanced Encoders (category_encoders)

Using target encoding, binary encoding, or leave-one-out encoding can improve results:

import category_encoders as ce

encoder = ce.TargetEncoder()
X_transformed = encoder.fit_transform(df[['color']], [1, 0, 1])

5. Automatic Categorical Handling

LightGBM

import lightgbm as lgb
df['color'] = df['color'].astype('category')  # Critical step!

train_data = lgb.Dataset(X_train, label=y_train)
lgb.train(params, train_data)

Important: Even though LightGBM internally maps categories to integers, it does not treat them as ordered. It splits categories by groupings, not by numerical size.

XGBoost (v1.3+)

import xgboost as xgb
df['color'] = df['color'].astype('category')

model = xgb.XGBClassifier(tree_method='hist', enable_categorical=True)
model.fit(X_train, y_train)

tree_method='hist' and enable_categorical=True are mandatory for native categorical support.

CatBoost

CatBoost natively supports categorical variables, including string values, and uses powerful techniques like target encoding internally.

6. CatBoost Categorical Handling vs OneHot Comparison

Experiment Setup:

- Dataset: Titanic (with 'Sex' and 'Embarked' as categorical features)
- Models: CatBoost with:

Auto categorical handling
One-hot encoded inputs

Results:

CatBoost (auto cat handling):  ~83.2% accuracy  
CatBoost (one-hot encoded):   ~78.9% accuracy

Conclusion: CatBoost’s native categorical handling often performs better due to target-based encoding and internal optimizations.

7. Summary Table

Model	Auto Categorical Support	Setup	Need Encoding?
LightGBM	✅	`astype('category')`	❌
XGBoost	✅ (v1.3+)	`tree_method='hist'`, `enable_categorical=True`	❌
CatBoost	✅ (best)	`cat_features` list or raw string columns	❌
Scikit-Learn	❌	`OneHotEncoder`, `OrdinalEncoder`	✅

✅ Final Takeaways

For linear models → Use OneHotEncoder
For tree models → Use native handling in LightGBM, XGBoost, or CatBoost
For many categories → Try TargetEncoder or CatBoost
Always avoid using OrdinalEncoder unless the variable has real order

About SeanKimLab

Welcome to SeanhKimLab

I’m Seunghyun Kim, an AI/ML practitioner and model governance expert with over 19 years of experience in financial services, auto finance, and emerging tech. This blog is my curated space to bridge industry insights with academic curiosity.

This blog serves multiple purposes:

Sharing real-world insights in Machine Learning, MLOps, and Data Engineering
Documenting research ideas to prepare for Ph.D. programs at top AI institutions
Serving as a portfolio hub for global opportunities, especially U.S.-based collaborations

What You'll Find Here

End-to-end ML pipeline examples with Airflow, Docker, and Kubernetes
Experimental research and thoughts on recent AI/ML papers
Practical case studies: credit risk, forecasting, NLP, and fraud detection
Deep dives into model explainability (SHAP, LIME) and regulatory alignment (SR 11-7, CECL)

Whether you're an AI researcher, practitioner, or a decision-maker looking for practical, scalable solutions, I hope this blog offers both clarity and inspiration.

Portfolio: github.com/seankim0
Contact: seunghyk@tepper.cmu.edu
Location: Irvine, California | Open to research collaboration

Sean Kim Lab

Wednesday, April 30, 2025

Understanding SHAP in XAI: Game Theory

What Are SHAP Values in Explainable AI?

The Game Theory Behind SHAP: Shapley Value Origins

How Does SHAP Work? (With Example)

SHAP in Practice: Python Example

Types of SHAP Explainers

SHAP vs. Other XAI Methods

Real-World Applications of SHAP

Quick Insights: Chaos Theory, Fibonacci, and Trimmed Mean

Chaos Theory

Fibonacci Sequence

Trimmed Mean

Conclusion: Why SHAP and Math Matter in AI

Monday, April 21, 2025

How to Handle Categorical Variables in Machine Learning

How to Handle Categorical Variables in Machine Learning

1. Why Convert Categorical Variables?

2. Basic Encoding with `pd.get_dummies()`

3. Using Scikit-Learn's Encoders

OneHotEncoder

OrdinalEncoder

4. Advanced Encoders (category_encoders)

5. Automatic Categorical Handling

LightGBM

XGBoost (v1.3+)

CatBoost

6. CatBoost Categorical Handling vs OneHot Comparison

Experiment Setup:

Results:

7. Summary Table

✅ Final Takeaways

About SeanKimLab

Welcome to SeanhKimLab

What You'll Find Here

Gradient Boosting Decision Trees Showdown: Comparing Top Performers for Real-World Tasks

Labels

Blog Archive

Wednesday, April 30, 2025

Understanding SHAP in XAI: Game Theory

What Are SHAP Values in Explainable AI?

The Game Theory Behind SHAP: Shapley Value Origins

How Does SHAP Work? (With Example)

SHAP in Practice: Python Example

Types of SHAP Explainers

SHAP vs. Other XAI Methods

Real-World Applications of SHAP

Quick Insights: Chaos Theory, Fibonacci, and Trimmed Mean

Chaos Theory

Fibonacci Sequence

Trimmed Mean

Conclusion: Why SHAP and Math Matter in AI

Monday, April 21, 2025

How to Handle Categorical Variables in Machine Learning

How to Handle Categorical Variables in Machine Learning

1. Why Convert Categorical Variables?

2. Basic Encoding with pd.get_dummies()

3. Using Scikit-Learn's Encoders

OneHotEncoder

OrdinalEncoder

4. Advanced Encoders (category_encoders)

5. Automatic Categorical Handling

LightGBM

XGBoost (v1.3+)

CatBoost

6. CatBoost Categorical Handling vs OneHot Comparison

Experiment Setup:

Results:

7. Summary Table

✅ Final Takeaways

About SeanKimLab

Welcome to SeanhKimLab

What You'll Find Here

Gradient Boosting Decision Trees Showdown: Comparing Top Performers for Real-World Tasks

2. Basic Encoding with `pd.get_dummies()`