🛠️ GitHub Anomaly Detection Pipeline

💡 Motivation & Use Case

GitHub hosts an enormous amount of user activity, including pull requests, issues, forks, and stars. Monitoring this activity in real-time is essential for identifying unusual or malicious behavior — such as bots, misuse, or suspicious spikes in contributions.

This project aims to build a production-grade anomaly detection system to:

Detect abnormal GitHub user behavior (e.g., excessive PRs, bot-like stars)
Alert maintainers and admins in real time via Slack or email
Serve anomaly scores via API and support continuous retraining
Visualize trends, drift, and recent activity using an interactive dashboard

A production-grade anomaly detection system for GitHub user behavior using:

Apache Airflow for orchestration
Pandas + Scikit-learn (Isolation Forest) for modeling and anomaly detection
Alerts: Email & Slack alerting mechanisms for anomaly spikes and data drift
FastAPI for real-time inference
Pytest, Black, Flake8 for testing and linting
Pre-commit + GitHub Actions for CI/CD and code quality
Streamlit UI for visualization
Terraform for infrastructure-as-code provisioning (MLflow)
AWS S3 for optional cloud-based storage of features, models, and predictions

The full architecture of this GitHub anomaly detection pipeline is illustrated in the diagram below.

A quick guide for evaluators to verify all requirements and navigate the implementation easily.

🤖 Too lazy for copy-pasting commands?

If you're like me and hate typing out commands... good news!
Just use the Makefile to do all the boring stuff for you:

make help

See full Makefile usage here — from setup to linting, testing, API, Airflow, and Terraform infra!

📦 Project Structure

.
├── dags/                    ← Airflow DAGs for data pipeline and retraining
├── data/                    ← Input datasets (raw, features, processed)
├── models/                  ← Trained ML models (e.g., Isolation Forest)
├── mlruns/                  ← MLflow experiment tracking artifacts
├── infra/                   ← Terraform IaC for provisioning MLflow container
├── github_pipeline/         ← Feature engineering, inference, monitoring scripts
├── tests/                   ← Pytest-based unit/integration tests
├── reports/                 ← Data drift reports (JSON/HTML) from Evidently
├── alerts/                  ← Alert log dumps (e.g., triggered drift/anomaly alerts)
├── notebooks/               ← Jupyter notebooks for exploration & experimentation
├── assets/                  ← Images and architecture diagrams for README
├── .github/workflows/       ← GitHub Actions CI/CD pipelines
├── streamlit_app.py         ← Realtime dashboard for monitoring
├── serve_model.py           ← FastAPI inference service
├── Dockerfile.*             ← Dockerfiles for API and Streamlit services
├── docker-compose.yaml      ← Compose file to run Airflow and supporting services
├── Makefile                 ← Task automation: setup, test, Airflow, Terraform, etc.
├── requirements.txt         ← Python dependencies for Airflow containers
├── Pipfile / Pipfile.lock   ← Python project environment (via Pipenv)
├── .env                     ← Environment variables (Slack, Email, Airflow UID, S3 support flag)
└── README.md                ← 📘 You are here

⚙️ Setup Instructions

1. Clone and install dependencies

git clone https://github.com/rajat116/github-anomaly-project.git
cd github-anomaly-project
pipenv install --dev
pipenv shell

Or install using pip:

pip install -r requirements.txt

📄 .env Configuration (Required)

Before running Airflow, you must create a .env file in the project root with at least following content:

AIRFLOW_UID=50000
USE_S3=false

This is required for Docker to set correct permissions inside the Airflow containers.

🔄 USE_S3 Flag

Set this flag to control where your pipeline reads/writes files:

USE_S3=false: All files will be stored locally (default, for development and testing)
USE_S3=true: Files will be written to and read from AWS S3

✅ Required When USE_S3=true

If you enable S3 support, also provide your AWS credentials in the .env:

AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret
AWS_REGION=us-east-1
S3_BUCKET_NAME=github-anomaly-logs

💡 Tip for Contributors

If you're testing locally or don't have AWS credentials, just keep:

USE_S3=false

This will disable all cloud storage usage and allow you to run the full pipeline locally.

Optional (For Email & Slack Alerts)

If you'd like to enable alerts, you can also include the following variables:

# Slack Alerts
SLACK_API_TOKEN=xoxb-...
SLACK_CHANNEL=#your-channel

# Email Alerts
EMAIL_SENDER=your_email@example.com
EMAIL_PASSWORD=your_email_app_password
EMAIL_RECEIVER=receiver@example.com
EMAIL_SMTP=smtp.gmail.com
EMAIL_PORT=587

2. ⚙️ Airflow + 📈 MLflow Integration

This project uses Apache Airflow to orchestrate a real-time ML pipeline and MLflow to track model training, metrics, and artifacts.

🚀 1. Start Airflow & MLflow via Docker

🛠️ Build & Launch

docker compose build airflow
docker compose up airflow

Once up, access:

Airflow UI: http://localhost:8080 (Login: airflow / airflow)
MLflow UI: http://localhost:5000

⏱️ 2. Airflow DAGs Overview

daily_github_inference: Download → Feature Engineering → Inference
daily_monitoring_dag: Drift checks, cleanup, alerting
retraining_dag: Triggers model training weekly and logs it to MLflow

📈 3. MLflow Experiment Tracking

Model training is handled by:

github_pipeline/train_model.py

Each run logs the following:

✅ Parameters:

timestamp — Training batch timestamp
model_type — Algorithm used (IsolationForest)
n_estimators — Number of trees

📊 Metrics

mean_anomaly_score
num_anomalies
num_total
anomaly_rate

📦 Artifacts

isolation_forest.pkl — Trained model
actor_predictions_.parquet
MLflow Model Registry entry

All experiments are stored in the mlruns/ volume:

volumes:
  - ./mlruns:/opt/airflow/mlruns

You can explore experiment runs and models in the MLflow UI.

3. 🧠 Model Training

The model (Isolation Forest) is trained on actor-wise event features:

python github_pipeline/train_model.py

The latest parquet file is used automatically. Model and scaler are saved to models/.

4. 🚀 FastAPI Inference

Build & Run

docker build -t github-anomaly-inference -f Dockerfile.inference .
docker run -p 8000:8000 github-anomaly-inference

Test the API

curl -X POST http://localhost:8000/predict \
     -H "Content-Type: application/json" \
     -d '{"features": [12, 0, 1, 0, 4]}'

5. 📣 Alerts: Email & Slack

This project includes automated alerting mechanisms for anomaly spikes and data drift, integrated into the daily_monitoring_dag DAG.

✅ Triggers for Alerts

🔺 Anomaly Rate Alert: If anomaly rate exceeds a threshold (e.g. >10% of actors).
🔁 Drift Detection Alert: If feature distributions change significantly over time.

🔔 Notification Channels

Email alerts (via smtplib)
Slack alerts (via Slack Incoming Webhooks)

🔧 Configuration

Set the following environment variables in your Airflow setup:

# .env or Airflow environment
ALERT_EMAIL_FROM=your_email@example.com
ALERT_EMAIL_TO=recipient@example.com
ALERT_EMAIL_PASSWORD=your_email_app_password
ALERT_EMAIL_SMTP=smtp.gmail.com
ALERT_EMAIL_PORT=587

SLACK_WEBHOOK_URL=https://hooks.slack.com/services/XXX/YYY/ZZZ

🛡️ Email app passwords are recommended over actual passwords for Gmail or Outlook.

📁 Alert Script

Logic is handled inside:

github_pipeline/monitor.py
alerts/alerting.py

These generate alert messages and send them through email and Slack if thresholds are breached.

6. ✅ CI/CD with GitHub Actions

The .github/workflows/ci.yml file runs on push:

✅ black --check
✅ flake8 (E501,W503 ignored)
✅ pytest
✅ (optional) Docker build

7. 🔍 Code Quality

Pre-commit hooks ensure style and linting:

pre-commit install
pre-commit run --all-files

Configured via:

.pre-commit-config.yaml
.flake8 (ignore = E501)

8. 🧪 Testing

This project includes both unit tests and a full integration test to ensure end-to-end pipeline functionality.

Run all tests:

PYTHONPATH=. pytest

All tests will automatically run under tests/ including:

Unit tests for:

Inference API (serve_model.py)
Feature engineering (feature_engineering.py)
Model training logic (train_model.py)

Integration test (test_pipeline_integration.py) for:

End-to-end flow using latest available local data:
processed → feature engineering → model inference

💡 These tests are also automatically run via pre-commit and GitHub Actions.

9. 📊 Streamlit Dashboard

The project includes an optional interactive Streamlit dashboard to visualize:

✅ Latest anomaly predictions
📈 Data drift metrics from the Evidently report
🧑‍💻 Top actors based on GitHub activity
⏱️ Activity summary over the last 48 hours

🔧 How to Run Locally

Make sure you have installed all dependencies via Pipenv, then launch the Streamlit app:

streamlit run streamlit_app.py

Once it starts, open the dashboard in your browser at:

http://localhost:8501

The app will automatically load:

The latest prediction file from data/features/
The latest drift report from reports/

Note: If these files do not exist, the dashboard will show a warning or empty state. You can generate them by running the Airflow pipeline or the monitoring scripts manually.

🐳 Optional: Run via Docker

You can also build and run the dashboard as a container (if desired):

Build the image:

docker build -t github-anomaly-dashboard -f Dockerfile.streamlit .

Run the container:

docker run -p 8501:8501 \
  -v $(pwd)/data:/app/data \
  -v $(pwd)/reports:/app/reports \
  github-anomaly-dashboard

Then open your browser at http://localhost:8501.

11. ☁️ Infrastructure as Code (IaC): MLflow Server with Terraform

This Terraform module provisions a Docker-based MLflow tracking server, matching the setup used in docker-compose.yaml, but on a different port (5050) to avoid conflicts.

📁 Directory Structure

infra/main.tf # Terraform configuration
README.md # This file

⚙️ Requirements

🚀 How to Use:

1. Navigate to the `infra/` folder

cd infra

2. Initialize Terraform

terraform init

3. Apply the infrastructure

terraform apply # Confirm with yes when prompted.

4. 🔎 Verify

MLflow server will be available at:

http://localhost:5050

All artifacts will be stored in your project’s mlruns/ directory.

5. ❌ To Clean Up

terraform destroy

This removes the MLflow container provisioned by Terraform.

12. 🧹 Clean Code

All code follows:

PEP8 formatting via Black
Linting with Flake8 + Bugbear
Pre-commit hook enforcement

13. 🛠️ Makefile Usage

This project includes a Makefile that simplifies formatting, testing, building Docker containers, and running Airflow or the FastAPI inference app.

You can run all commands with or without activating the Pipenv shell. For example:

make lint

🔧 Setup Commands

make install # Install all dependencies via Pipenv (both runtime and dev)
make create-env   # Create .env file with AIRFLOW_UID, alert placeholders, and S3 support flag
make clean # Remove all __pycache__ folders and .pyc files

🧪 Code Quality & Testing

make format # Format code using Black
make lint # Lint code using Flake8
make test # Run tests using Pytest
make check # Run all of the above together

📊 Streamlit Dashboard

make streamlit  # Launch the Streamlit dashboard at http://localhost:8501

🐳 FastAPI Inference App

make docker-build # Build the Docker image for FastAPI app
make docker-run # Run the Docker container on port 8000
make api-test	# Send a test prediction request using curl

After running make docker-run, open another terminal and run make api-test.

⏱️ Airflow Pipeline

make airflow-up # Start Airflow services (scheduler, UI, etc.)
make airflow-down	Stop all Airflow containers

Once up, access:

Airflow UI: http://localhost:8080 (Login: airflow / airflow)
MLflow UI: http://localhost:5000

MLflow Server with Terraform

make install-terraform # Install Terraform CLI if not present
make terraform-init    # Initialize Terraform config
make terraform-apply   # Provision MLflow container (port 5050)
make terraform-destroy # Tear down MLflow container
make terraform-status  # Show current infra state

📋 View All Commands

make help # Prints a summary of all available targets and their descriptions.

14. 🙌 Credits

Built by Rajat Gupta as part of an MLOps portfolio. Inspired by real-time event pipelines and anomaly detection architectures used in production.

15. 📝 License

16. ✅ Evaluation Criteria for MLOps Zoomcamp

Each criterion below links to the relevant section of this README to help evaluators verify the implementation easily.

🧠 Problem Description — 2 points

The project clearly defines the problem of detecting anomalous GitHub activity using real-time machine learning. See here

☁️ Cloud — 4 points

The project runs in GitHub Codespaces and supports AWS S3 with a USE_S3 toggle. See here

📈 Experiment Tracking & Model Registry — 4 points

MLflow is fully integrated to track experiments and register models. See here

🛠️ Workflow Orchestration — 4 points

Uses Apache Airflow with 3 deployed DAGs for inference, monitoring, and retraining. See here

🚀 Model Deployment — 4 points

Model is served via FastAPI and fully containerized for deployment. See here

📊 Model Monitoring — 4 points

Implements drift detection, anomaly thresholding, and sends alerts via Slack and Email. See here

♻️ Reproducibility — 4 points

The project is fully reproducible with clear instructions, dependency locking, and data structure. See here

✅ Best Practices — 7 points

Unit tests: Pytest-based unit tests on core components. See here
Integration test: Full integration test to validate the entire pipeline. See here
Linter & Code formatter: Uses Black and Flake8 with Makefile targets and pre-commit hooks. See here
Makefile: Includes targets for install, lint, test, format, build, and airflow. See here
Pre-commit hooks: Automatically formats and checks code before commits. See here
CI/CD pipeline: GitHub Actions run tests, lint, and build containers on push. See here

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
alerts		alerts
assets		assets
dags		dags
data		data
github_pipeline		github_pipeline
infra		infra
logs		logs
mlflow		mlflow
mlruns		mlruns
models		models
notebooks		notebooks
reports		reports
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Dockerfile.inference		Dockerfile.inference
Dockerfile.streamlit		Dockerfile.streamlit
LICENSE.txt		LICENSE.txt
Makefile		Makefile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
docker-compose.yaml		docker-compose.yaml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
serve_model.py		serve_model.py
streamlit_app.py		streamlit_app.py

License

rajat116/github-anomaly-project

Folders and files

Latest commit

History

Repository files navigation