Monday, April 21, 2025

How to Handle Categorical Variables in Machine Learning

How to Handle Categorical Variables in Machine Learning

How to Handle Categorical Variables in Machine Learning

In machine learning, many algorithms require numerical input. Since categorical variables (e.g., 'red', 'blue', 'green') are non-numeric, we need to transform them before feeding them into models.

1. Why Convert Categorical Variables?

Models like linear regression, logistic regression, SVM, and neural networks require numerical input. Using raw text labels causes errors or misleading results, as the model might treat them as ordinal or continuous.

2. Basic Encoding with pd.get_dummies()

import pandas as pd

df = pd.DataFrame({'color': ['red', 'green', 'blue']})
pd.get_dummies(df)

This will return:

   color_blue  color_green  color_red
0           0            0          1
1           0            1          0
2           1            0          0

Downsides: High cardinality can create dimensionality issues, and unseen categories in the test set cause problems.

3. Using Scikit-Learn's Encoders

OneHotEncoder

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
X_encoded = encoder.fit_transform(df[['color']])

handle_unknown='ignore' prevents errors from unseen categories during prediction.

OrdinalEncoder

from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
X_encoded = encoder.fit_transform(df[['color']])

Warning: This assumes order exists between categories, which can mislead models.

4. Advanced Encoders (category_encoders)

Using target encoding, binary encoding, or leave-one-out encoding can improve results:

import category_encoders as ce

encoder = ce.TargetEncoder()
X_transformed = encoder.fit_transform(df[['color']], [1, 0, 1])

5. Automatic Categorical Handling

LightGBM

import lightgbm as lgb
df['color'] = df['color'].astype('category')  # Critical step!

train_data = lgb.Dataset(X_train, label=y_train)
lgb.train(params, train_data)

Important: Even though LightGBM internally maps categories to integers, it does not treat them as ordered. It splits categories by groupings, not by numerical size.

XGBoost (v1.3+)

import xgboost as xgb
df['color'] = df['color'].astype('category')

model = xgb.XGBClassifier(tree_method='hist', enable_categorical=True)
model.fit(X_train, y_train)

tree_method='hist' and enable_categorical=True are mandatory for native categorical support.

CatBoost

CatBoost natively supports categorical variables, including string values, and uses powerful techniques like target encoding internally.

6. CatBoost Categorical Handling vs OneHot Comparison

Experiment Setup:

- Dataset: Titanic (with 'Sex' and 'Embarked' as categorical features)
- Models: CatBoost with:
  • Auto categorical handling
  • One-hot encoded inputs

Results:

CatBoost (auto cat handling):  ~83.2% accuracy  
CatBoost (one-hot encoded):   ~78.9% accuracy

Conclusion: CatBoost’s native categorical handling often performs better due to target-based encoding and internal optimizations.

7. Summary Table

Model Auto Categorical Support Setup Need Encoding?
LightGBM astype('category')
XGBoost ✅ (v1.3+) tree_method='hist', enable_categorical=True
CatBoost ✅ (best) cat_features list or raw string columns
Scikit-Learn OneHotEncoder, OrdinalEncoder

✅ Final Takeaways

  • For linear models → Use OneHotEncoder
  • For tree models → Use native handling in LightGBM, XGBoost, or CatBoost
  • For many categories → Try TargetEncoder or CatBoost
  • Always avoid using OrdinalEncoder unless the variable has real order

Gradient Boosting Decision Trees Showdown: Comparing Top Performers for Real-World Tasks

Gradient Boosting Decision Trees Showdown: Comparing Top Performers for Real-World Tasks Gradient Boosting Decisio...