This project detects fraudulent transactions using the Kaggle credit card fraud dataset, which contains highly imbalanced data — with fraud accounting for only ~0.17% of all records.
To build a machine learning pipeline that maximizes recall, catching as many fraud cases as possible.
Rather than oversampling or generating synthetic data, this project uses:
class_weight='balanced'
in Logistic Regressionscale_pos_weight
in XGBoost
These approaches adjust model focus without introducing noise or overfitting risks.
Feature | Description |
---|---|
Hour |
Hour of transaction (0–23) |
TimeBucket |
6-hour time segments (e.g., 0–6 AM) |
IsNight |
Binary flag for night hours (22:00–06:00) |
TimeSinceLastTx |
Time delta since the previous transaction |
TxInPastHour |
Count of transactions in the past hour (rolling) |
DayPart |
Categorical: Morning, Afternoon, Evening, Night |
PCA features V1–V28
, Time
, and Amount
were also used as-is.
-
EDA & Feature Engineering
- Visual patterns in fraud by time
- Bimodal distribution in transaction frequency
- Nighttime fraud concentration observed
-
Train/Test Split
- Stratified 70/30 split
- Scaled with
StandardScaler
-
Model Training
- ✅ Logistic Regression (
class_weight='balanced'
) - ✅ XGBoost (
scale_pos_weight
) - ✅ Isolation Forest (unsupervised)
- ✅ Ensemble: Logistic Regression + XGBoost
- ✅ Logistic Regression (
-
Evaluation Metrics
- Recall (priority)
- Precision, F1 Score, ROC AUC
- Confusion Matrices
- ROC & Precision-Recall curves
Model | Precision | Recall | F1 Score | ROC AUC |
---|---|---|---|---|
Logistic Regression | 6.3% | 87.8% | 0.12 | 0.9666 |
XGBoost | 87.8% | 77.7% | 0.83 | 0.9654 |
Isolation Forest | 29.7% | 29.7% | 0.30 | 0.6480 |
- Data loading and exploration
- Handling class imbalance (SMOTE, class weights)
- Feature engineering (time-based and PCA features)
- Correlation analysis and visualization
- Model training (Logistic Regression, XGBoost, Isolation Forest)
- Model evaluation (Confusion Matrix, ROC AUC, F1, Recall)
- Threshold tuning using Precision-Recall curve
- Ensemble model with soft voting (LR + XGBoost)
- Interpretability with XGBoost feature importances
credit-card-fraud-detection/ ├── data/ # Input dataset (e.g., creditcard.csv) ├── notebooks/ # Jupyter notebooks (EDA, modeling) ├── outputs/ # Visualizations, metrics, exports ├── credit-card-fraud-detection.ipynb ├── README.md # Project overview └── LICENSE # License file
- Source: Kaggle - Credit Card Fraud Detection
- Type: PCA-transformed numeric features + engineered time-based features
- Imbalance: 492 frauds out of 284,807 transactions (~0.17%)
Gleidy R.
LinkedIn
Feel free to fork, contribute, or share feedback!