A production-grade ML system that detects credit card fraud and monitors model performance degradation over time — built to simulate real-world MLOps workflows.
Run locally with streamlit run app/streamlit_app.py
Watch Demo Video: https://youtu.be/aPIeh-j6ELk
Credit card fraud detection is a critical imbalanced classification problem. With only 0.17% fraud rate (492 of 284,807 transactions), standard accuracy metrics are meaningless — a model predicting "legit" for everything achieves 99.83% accuracy while catching zero fraud. This project solves that.
| Model | PR-AUC | Precision | Recall |
|---|---|---|---|
| Logistic Regression (baseline) | 0.743 | 0.83 | 0.64 |
| LR + SMOTE | 0.725 | 0.06 | 0.92 |
| XGBoost + Optuna (final) | 0.885 | 0.87 | 0.85 |
| Layer | Tool | Why |
|---|---|---|
| Core ML | XGBoost | Best performance on tabular imbalanced data |
| Tuning | Optuna | Efficient hyperparameter search vs GridSearch |
| Imbalance | scale_pos_weight=577 | Native XGBoost handling, no oversampling artifacts |
| Explainability | SHAP | Feature-level explanations for regulatory compliance |
| Experiment Tracking | MLflow | Reproducible runs, parameter logging |
| Drift Detection | Evidently AI + PSI | Production distribution shift monitoring |
| Dashboard | Streamlit | Interactive 5-page monitoring interface |
Why PR-AUC over ROC-AUC? ROC-AUC is optimistic under class imbalance. PR-AUC focuses on the minority class (fraud) and better reflects real-world detection performance.
Why XGBoost over neural networks? Tabular data with 30 features doesn't benefit from deep learning. XGBoost with scale_pos_weight handles imbalance natively and trains in seconds vs hours.
Why Optuna over GridSearch? Optuna uses TPE (Tree-structured Parzen Estimator) sampling — smarter than exhaustive grid search for continuous hyperparameter spaces.
Why scale_pos_weight over SMOTE? SMOTE increased recall to 92% but collapsed precision to 6% — 94% of fraud alerts became false positives. scale_pos_weight achieved both high precision (87%) and recall (85%) without generating synthetic samples.
fraud-detection-monitor/
├── notebooks/
│ ├── 01_eda.ipynb # Exploratory analysis, 3 hypotheses
│ ├── 02_baseline.ipynb # Logistic regression baseline
│ ├── 03_xgboost_mlflow.ipynb # XGBoost + Optuna + MLflow
│ ├── 04_shap_explainability.ipynb # SHAP feature importance
│ ├── 05_drift_detection.ipynb # PSI drift detection
│ └── 06_evidently.ipynb # Evidently AI reports
├── app/
│ └── streamlit_app.py # 5-page monitoring dashboard
├── reports/ # Generated plots and HTML reports
├── requirements.txt
└── README.md
- V14 is the dominant fraud signal (SHAP=2.57) — when V14 drops below -6, fraud probability increases sharply
- Fraud is disproportionate at night — fraud rate per transaction is 2x higher between 0-4AM despite lower total volume
- V3 shows catastrophic drift by Week 4 (PSI=1.55) — would trigger automatic retraining alert in production
- Model remains stable despite drift — XGBoost compensates using V14 and V4 when V3 drifts
git clone https://github.com/Harshi06-code/fraud-detection-monitor.git
cd fraud-detection-monitor
pip install -r requirements.txt
# Add creditcard.csv to data/ folder from Kaggle
streamlit run app/streamlit_app.pyCredit Card Fraud Detection — 284,807 transactions, 492 fraud cases (0.17%)
Harshitha | B.Tech Computer Science | Amrita Vishwa Vidyapeetham