How to Handle Categorical Variables in Machine Learning
In machine learning, many algorithms require numerical input. Since categorical variables (e.g., 'red', 'blue', 'green') are non-numeric, we need to transform them before feeding them into models.
1. Why Convert Categorical Variables?
Models like linear regression, logistic regression, SVM, and neural networks require numerical input. Using raw text labels causes errors or misleading results, as the model might treat them as ordinal or continuous.
2. Basic Encoding with pd.get_dummies()
import pandas as pd
df = pd.DataFrame({'color': ['red', 'green', 'blue']})
pd.get_dummies(df)
This will return:
color_blue color_green color_red
0 0 0 1
1 0 1 0
2 1 0 0
Downsides: High cardinality can create dimensionality issues, and unseen categories in the test set cause problems.
3. Using Scikit-Learn's Encoders
OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
X_encoded = encoder.fit_transform(df[['color']])
handle_unknown='ignore' prevents errors from unseen categories during prediction.
OrdinalEncoder
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
X_encoded = encoder.fit_transform(df[['color']])
Warning: This assumes order exists between categories, which can mislead models.
4. Advanced Encoders (category_encoders)
Using target encoding, binary encoding, or leave-one-out encoding can improve results:
import category_encoders as ce
encoder = ce.TargetEncoder()
X_transformed = encoder.fit_transform(df[['color']], [1, 0, 1])
5. Automatic Categorical Handling
LightGBM
import lightgbm as lgb
df['color'] = df['color'].astype('category') # Critical step!
train_data = lgb.Dataset(X_train, label=y_train)
lgb.train(params, train_data)
Important: Even though LightGBM internally maps categories to integers, it does not treat them as ordered. It splits categories by groupings, not by numerical size.
XGBoost (v1.3+)
import xgboost as xgb
df['color'] = df['color'].astype('category')
model = xgb.XGBClassifier(tree_method='hist', enable_categorical=True)
model.fit(X_train, y_train)
tree_method='hist' and enable_categorical=True are mandatory for native categorical support.
CatBoost
CatBoost natively supports categorical variables, including string values, and uses powerful techniques like target encoding internally.
6. CatBoost Categorical Handling vs OneHot Comparison
Experiment Setup:
- Dataset: Titanic (with 'Sex' and 'Embarked' as categorical features)- Models: CatBoost with:
- Auto categorical handling
- One-hot encoded inputs
Results:
CatBoost (auto cat handling): ~83.2% accuracy
CatBoost (one-hot encoded): ~78.9% accuracy
Conclusion: CatBoost’s native categorical handling often performs better due to target-based encoding and internal optimizations.
7. Summary Table
Model | Auto Categorical Support | Setup | Need Encoding? |
---|---|---|---|
LightGBM | ✅ | astype('category') |
❌ |
XGBoost | ✅ (v1.3+) | tree_method='hist' , enable_categorical=True |
❌ |
CatBoost | ✅ (best) | cat_features list or raw string columns |
❌ |
Scikit-Learn | ❌ | OneHotEncoder , OrdinalEncoder |
✅ |
✅ Final Takeaways
- For linear models → Use OneHotEncoder
- For tree models → Use native handling in LightGBM, XGBoost, or CatBoost
- For many categories → Try TargetEncoder or CatBoost
- Always avoid using OrdinalEncoder unless the variable has real order