- Removed irrelevant columns and handled missing values.
- Converted Yes/No categorical columns into
bool
type. - Reduced memory usage by approximately 80%, making the dataset more efficient for modeling.
- X (features): all columns except the target.
- y (target): the
Churn
column.
- Used
train_test_split
to split the dataset into training and testing sets. - Applied
stratify=y
to maintain the same distribution of churn labels in both sets.
- Built a pipeline with a ColumnTransformer and a classifier.
- Preprocessing:
OneHotEncoder
for categorical features.StandardScaler
for numeric features.
- Evaluated 4 classifiers with 5-fold cross-validation:
- RandomForestClassifier
- LogisticRegression (
liblinear
) - GradientBoostingClassifier
- Support Vector Machine (SVM)
- Applied
class_weight="balanced"
(except GradientBoosting) due to imbalanced churn classes. - Optimized for F1 Score since accuracy alone would be misleading.
- Result: Logistic Regression achieved the best performance in the initial stage.
- Reported metrics: Accuracy, Precision, Recall, and F1 Score.
- Visualized results for better interpretability.
- Extracted the top 8 most influential features.
- Removed less relevant features to improve efficiency and reduce noise.
- Re-ran preprocessing and GridSearchCV on the reduced feature set.
- Result: RandomForestClassifier (200 estimators, max depth = 8) delivered the best performance.
- GradientBoostingClassifier had the longest average training time.
- Between Logistic Regression and Random Forest:
- Random Forest required more training time.
- But achieved ~5% higher F1 Score.
- Final Takeaway: RandomForestClassifier provided the best trade-off between accuracy, robustness, and performance.
- Baseline Best Model: Logistic Regression (liblinear)
- Final Best Model after Feature Selection: RandomForestClassifier (200 estimators, max_depth=8)