GitHub hosts an enormous amount of user activity, including pull requests, issues, forks, and stars. Monitoring this activity in real-time is essential for identifying unusual or malicious behavior β such as bots, misuse, or suspicious spikes in contributions.
This project aims to build a production-grade anomaly detection system to:
- Detect abnormal GitHub user behavior (e.g., excessive PRs, bot-like stars)
- Alert maintainers and admins in real time via Slack or email
- Serve anomaly scores via API and support continuous retraining
- Visualize trends, drift, and recent activity using an interactive dashboard
A production-grade anomaly detection system for GitHub user behavior using:
- Apache Airflow for orchestration
- Pandas + Scikit-learn (Isolation Forest) for modeling and anomaly detection
- Alerts: Email & Slack alerting mechanisms for anomaly spikes and data drift
- FastAPI for real-time inference
- Pytest, Black, Flake8 for testing and linting
- Pre-commit + GitHub Actions for CI/CD and code quality
- Streamlit UI for visualization
- Terraform for infrastructure-as-code provisioning (MLflow)
- AWS S3 for optional cloud-based storage of features, models, and predictions
The full architecture of this GitHub anomaly detection pipeline is illustrated in the diagram below.
A quick guide for evaluators to verify all requirements and navigate the implementation easily.
If you're like me and hate typing out commands... good news!
Just use the Makefile to do all the boring stuff for you:
make help
See full Makefile usage here β from setup to linting, testing, API, Airflow, and Terraform infra!
.
βββ dags/ β Airflow DAGs for data pipeline and retraining
βββ data/ β Input datasets (raw, features, processed)
βββ models/ β Trained ML models (e.g., Isolation Forest)
βββ mlruns/ β MLflow experiment tracking artifacts
βββ infra/ β Terraform IaC for provisioning MLflow container
βββ github_pipeline/ β Feature engineering, inference, monitoring scripts
βββ tests/ β Pytest-based unit/integration tests
βββ reports/ β Data drift reports (JSON/HTML) from Evidently
βββ alerts/ β Alert log dumps (e.g., triggered drift/anomaly alerts)
βββ notebooks/ β Jupyter notebooks for exploration & experimentation
βββ assets/ β Images and architecture diagrams for README
βββ .github/workflows/ β GitHub Actions CI/CD pipelines
βββ streamlit_app.py β Realtime dashboard for monitoring
βββ serve_model.py β FastAPI inference service
βββ Dockerfile.* β Dockerfiles for API and Streamlit services
βββ docker-compose.yaml β Compose file to run Airflow and supporting services
βββ Makefile β Task automation: setup, test, Airflow, Terraform, etc.
βββ requirements.txt β Python dependencies for Airflow containers
βββ Pipfile / Pipfile.lock β Python project environment (via Pipenv)
βββ .env β Environment variables (Slack, Email, Airflow UID, S3 support flag)
βββ README.md β π You are here
git clone https://github.com/rajat116/github-anomaly-project.git
cd github-anomaly-project
pipenv install --dev
pipenv shell
pip install -r requirements.txt
Before running Airflow, you must create a .env
file in the project root with at least following content:
AIRFLOW_UID=50000
USE_S3=false
This is required for Docker to set correct permissions inside the Airflow containers.
Set this flag to control where your pipeline reads/writes files:
- USE_S3=false: All files will be stored locally (default, for development and testing)
- USE_S3=true: Files will be written to and read from AWS S3
β Required When USE_S3=true
If you enable S3 support, also provide your AWS credentials in the .env:
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret
AWS_REGION=us-east-1
S3_BUCKET_NAME=github-anomaly-logs
π‘ Tip for Contributors
If you're testing locally or don't have AWS credentials, just keep:
USE_S3=false
This will disable all cloud storage usage and allow you to run the full pipeline locally.
If you'd like to enable alerts, you can also include the following variables:
# Slack Alerts
SLACK_API_TOKEN=xoxb-...
SLACK_CHANNEL=#your-channel
# Email Alerts
EMAIL_SENDER=your_email@example.com
EMAIL_PASSWORD=your_email_app_password
EMAIL_RECEIVER=receiver@example.com
EMAIL_SMTP=smtp.gmail.com
EMAIL_PORT=587
This project uses Apache Airflow to orchestrate a real-time ML pipeline and MLflow to track model training, metrics, and artifacts.
π οΈ Build & Launch
docker compose build airflow
docker compose up airflow
Once up, access:
- Airflow UI: http://localhost:8080 (Login: airflow / airflow)
- MLflow UI: http://localhost:5000
- daily_github_inference: Download β Feature Engineering β Inference
- daily_monitoring_dag: Drift checks, cleanup, alerting
- retraining_dag: Triggers model training weekly and logs it to MLflow
Model training is handled by:
github_pipeline/train_model.py
Each run logs the following:
β Parameters:
- timestamp β Training batch timestamp
- model_type β Algorithm used (IsolationForest)
- n_estimators β Number of trees
π Metrics
- mean_anomaly_score
- num_anomalies
- num_total
- anomaly_rate
π¦ Artifacts
- isolation_forest.pkl β Trained model
- actor_predictions_.parquet
- MLflow Model Registry entry
All experiments are stored in the mlruns/ volume:
volumes:
- ./mlruns:/opt/airflow/mlruns
You can explore experiment runs and models in the MLflow UI.
The model (Isolation Forest) is trained on actor-wise event features:
python github_pipeline/train_model.py
The latest parquet file is used automatically. Model and scaler are saved to models/.
docker build -t github-anomaly-inference -f Dockerfile.inference .
docker run -p 8000:8000 github-anomaly-inference
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"features": [12, 0, 1, 0, 4]}'
This project includes automated alerting mechanisms for anomaly spikes and data drift, integrated into the daily_monitoring_dag DAG.
- πΊ Anomaly Rate Alert: If anomaly rate exceeds a threshold (e.g. >10% of actors).
- π Drift Detection Alert: If feature distributions change significantly over time.
- Email alerts (via smtplib)
- Slack alerts (via Slack Incoming Webhooks)
Set the following environment variables in your Airflow setup:
# .env or Airflow environment
ALERT_EMAIL_FROM=your_email@example.com
ALERT_EMAIL_TO=recipient@example.com
ALERT_EMAIL_PASSWORD=your_email_app_password
ALERT_EMAIL_SMTP=smtp.gmail.com
ALERT_EMAIL_PORT=587
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/XXX/YYY/ZZZ
π‘οΈ Email app passwords are recommended over actual passwords for Gmail or Outlook.
Logic is handled inside:
github_pipeline/monitor.py
alerts/alerting.py
These generate alert messages and send them through email and Slack if thresholds are breached.
The .github/workflows/ci.yml file runs on push:
- β black --check
- β flake8 (E501,W503 ignored)
- β pytest
- β (optional) Docker build
Pre-commit hooks ensure style and linting:
pre-commit install
pre-commit run --all-files
Configured via:
- .pre-commit-config.yaml
- .flake8 (ignore = E501)
This project includes both unit tests and a full integration test to ensure end-to-end pipeline functionality.
Run all tests:
PYTHONPATH=. pytest
Unit tests for:
- Inference API (serve_model.py)
- Feature engineering (feature_engineering.py)
- Model training logic (train_model.py)
Integration test (test_pipeline_integration.py) for:
- End-to-end flow using latest available local data:
- processed β feature engineering β model inference
π‘ These tests are also automatically run via pre-commit and GitHub Actions.
The project includes an optional interactive Streamlit dashboard to visualize:
- β Latest anomaly predictions
- π Data drift metrics from the Evidently report
- π§βπ» Top actors based on GitHub activity
- β±οΈ Activity summary over the last 48 hours
Make sure you have installed all dependencies via Pipenv, then launch the Streamlit app:
streamlit run streamlit_app.py
Once it starts, open the dashboard in your browser at:
http://localhost:8501
The app will automatically load:
- The latest prediction file from data/features/
- The latest drift report from reports/
Note: If these files do not exist, the dashboard will show a warning or empty state. You can generate them by running the Airflow pipeline or the monitoring scripts manually.
You can also build and run the dashboard as a container (if desired):
Build the image:
docker build -t github-anomaly-dashboard -f Dockerfile.streamlit .
Run the container:
docker run -p 8501:8501 \
-v $(pwd)/data:/app/data \
-v $(pwd)/reports:/app/reports \
github-anomaly-dashboard
Then open your browser at http://localhost:8501.
This Terraform module provisions a Docker-based MLflow tracking server, matching the setup used in docker-compose.yaml
, but on a different port (5050) to avoid conflicts.
- infra/main.tf # Terraform configuration
- README.md # This file
cd infra
terraform init
terraform apply # Confirm with yes when prompted.
MLflow server will be available at:
http://localhost:5050
All artifacts will be stored in your projectβs mlruns/ directory.
terraform destroy
This removes the MLflow container provisioned by Terraform.
All code follows:
- PEP8 formatting via Black
- Linting with Flake8 + Bugbear
- Pre-commit hook enforcement
This project includes a Makefile that simplifies formatting, testing, building Docker containers, and running Airflow or the FastAPI inference app.
You can run all commands with or without activating the Pipenv shell. For example:
make lint
make install # Install all dependencies via Pipenv (both runtime and dev)
make create-env # Create .env file with AIRFLOW_UID, alert placeholders, and S3 support flag
make clean # Remove all __pycache__ folders and .pyc files
make format # Format code using Black
make lint # Lint code using Flake8
make test # Run tests using Pytest
make check # Run all of the above together
make streamlit # Launch the Streamlit dashboard at http://localhost:8501
make docker-build # Build the Docker image for FastAPI app
make docker-run # Run the Docker container on port 8000
make api-test # Send a test prediction request using curl
After running make docker-run, open another terminal and run make api-test.
make airflow-up # Start Airflow services (scheduler, UI, etc.)
make airflow-down Stop all Airflow containers
Once up, access:
- Airflow UI: http://localhost:8080 (Login: airflow / airflow)
- MLflow UI: http://localhost:5000
make install-terraform # Install Terraform CLI if not present
make terraform-init # Initialize Terraform config
make terraform-apply # Provision MLflow container (port 5050)
make terraform-destroy # Tear down MLflow container
make terraform-status # Show current infra state
make help # Prints a summary of all available targets and their descriptions.
Built by Rajat Gupta as part of an MLOps portfolio. Inspired by real-time event pipelines and anomaly detection architectures used in production.
Each criterion below links to the relevant section of this README to help evaluators verify the implementation easily.
The project clearly defines the problem of detecting anomalous GitHub activity using real-time machine learning. See here
The project runs in GitHub Codespaces and supports AWS S3 with a USE_S3 toggle. See here
MLflow is fully integrated to track experiments and register models. See here
Uses Apache Airflow with 3 deployed DAGs for inference, monitoring, and retraining. See here
Model is served via FastAPI and fully containerized for deployment. See here
Implements drift detection, anomaly thresholding, and sends alerts via Slack and Email. See here
The project is fully reproducible with clear instructions, dependency locking, and data structure. See here
- Unit tests: Pytest-based unit tests on core components. See here
- Integration test: Full integration test to validate the entire pipeline. See here
- Linter & Code formatter: Uses Black and Flake8 with Makefile targets and pre-commit hooks. See here
- Makefile: Includes targets for install, lint, test, format, build, and airflow. See here
- Pre-commit hooks: Automatically formats and checks code before commits. See here
- CI/CD pipeline: GitHub Actions run tests, lint, and build containers on push. See here