This repository contains the code and configuration for the LLM Legal Document Summarization project, deployed on Chameleon Cloud. Follow the sections below to understand the system lifecycle from data ingestion to production serving, and see links to the specific implementation files.
Target Customer: Legal analysts at corporate law firms who need fast, accurate summaries of incoming legal documents to accelerate review.
- Customer Details:
- Receives >100 documents/day (.pdf, .docx)
- Needs to look up previous judgements and their summaries by searching key words
- Requires summary within minutes of upload
- Ground-truth labels (expert summaries) available after review
Design Influences: Data size, latency requirements, retraining frequency.
- Offline Data: 10 GB raw documents (~20 K files) - Zenodo data source, 3.3k files containing case data: Kaggle data source deepcontractor/supreme-court-judgment-prediction
- Model Size: Fine-tuned Llama-2-7B; training takes uses 2×A100 GPUs
- Deployment Throughput: ~500 inference requests/day (~1 req/min)
Provisioning and configuration via Terraform and Ansible:
- Terraform:
Terraform configurations, variables, setting - DAY 0
- Ansible Playbooks:
Ansible notebooks
- Argo CD:
Argo CD notebooks for 3 environments
On Chameleon:
- Object Store:
Structure and contents:
├── production.jsonl
├── test.jsonl
└── train.jsonl
- Block Volume: Notebook with instructions to create, partition, add file system, access, run containers on block volume
We created a block volume of 50 GiB initially, but extended to 100 GiB to store our ONNX model, RAG data, etc.
Structure and contents:
block-persist-project33
├── minio_data
├── mlflow-artifacts
└── ray
├── postgres_data
└── rag_data
├── model_rag
├── index_to_doc.pkl
└── legal-facts.index
└── rag_chunks
Mlflow artifacts folder contains all the artifacts generated during serving and training.
Ray folder contains as ray train related checkpoints
Postgress folder contains
Rag Data folder contains all data related to our RAG model sentence-transformers/all-MiniLM-L6-v2
which includes the data chunks, vector db, mapping info.
We use the Zenodo Indian & UK Legal Judgments Dataset containing ~20K court cases and corresponding human-written summaries.
- Sources:
IN-Abs
,UK-Abs
, andIN-Ext
- Data Size: ~10 GB total, over 20,000 legal documents and associated summaries - from Zenodo.
- Format: Paired
.txt
files for full judgments and summaries
{
"filename": "UKCiv2012.txt",
"judgement": "The claimant seeks damages following breach of contract. The court heard evidence from both parties. After reviewing the statutory framework and case law precedent, the court finds that the defendant did not fulfill their obligations...",
"summary": "The defendant breached the contract. The court awarded damages to the claimant.",
"meta": {
"doc_words": 2176,
"sum_words": 132,
"ratio": 0.06
}
}
Our target user (a legal analyst at a law firm) regularly deals with such long-form judicial decisions. The Zenodo dataset closely mirrors their real-world workflow: • They review lengthy judgment documents daily. • They generate or consume summaries internally for client reporting. • Our model mimics this process by learning from historic summaries.
Production samples (the 10% test set): • Contain no ground-truth summaries at inference time. • In a deployed setting, these would represent new unseen judgments uploaded by users. • Once reviewed by a human expert, feedback summaries could be used to retrain the model thus closing the feedback loop.
Steps handled in data_preprocessing.py
:
- Ingestion: Load documents from raw folders.
- Merging: Combine segment-wise summaries if full summary not available.
- Cleaning: Normalize unicode, remove extra whitespace, lowercase.
- Sanity checks: Remove empty/duplicate/missing files.
- Filtering: Retain samples with 50–1500 summary words and acceptable doc:summary ratios.
- Split: 70% train, 20% test, 10% production — written to
*.jsonl
.
┌──────────────────────────────┐
│ Raw Zenodo Dataset │
│ (/data/raw/* subfolders) │
└────────────┬─────────────────┘
│
▼
┌──────────────────────────────┐
│ Ingestion & File Loading │
│ - Load judgment + summary │
│ - Handle IN-Abs, UK-Abs, │
│ IN-Ext variants │
└────────────┬─────────────────┘
│
▼
┌──────────────────────────────┐
│ Merging Segment-wise │
│ - Combine partial summaries │
│ (facts, statute, etc.) │
└────────────┬─────────────────┘
│
▼
┌──────────────────────────────┐
│ Cleaning Text │
│ - Unicode normalization │
│ - Lowercasing │
│ - Remove extra whitespace │
└────────────┬─────────────────┘
│
▼
┌──────────────────────────────┐
│ Sanity Checks │
│ - Remove empty/missing files │
│ - Check for duplicates │
└────────────┬─────────────────┘
│
▼
┌──────────────────────────────┐
│ Statistical Filter │
│ - 50–1500 summary words │
│ - Ratio: 1–50% of doc length │
└────────────┬─────────────────┘
│
▼
┌──────────────────────────────┐
│ Split & Dump │
│ - 70% train │
│ - 20% test │
│ - 10% production │
│ → Output as `.jsonl` files │
└──────────────────────────────┘
-
Download & Extract
- Use Kaggle API to pull
deepcontractor/supreme-court-judgment-prediction
into/mnt/block/rag_data
and unzip.
- Use Kaggle API to pull
-
Load & Inspect
- Read
justice.csv
with Pandas to verify row count and columns (name
,facts
, etc.).
- Read
-
Clean & Serialize
- Normalize newlines, strip empty lines, and write each case’s
facts
torag_txt/{idx}_{safe_name}.txt
.
- Normalize newlines, strip empty lines, and write each case’s
-
Chunk Documents
- Tokenize with
sentence-transformers/all-MiniLM-L6-v2
(512-token window, 64-token overlap). - Save each piece to
rag_chunks/{original}_chunkXXX.txt
.
- Tokenize with
-
Embed & Index
- Encode chunks via
SentenceTransformer
. - Build a FAISS L2 index over the vectors.
- Persist
model_rag/legal-facts.index
andmodel_rag/index_to_doc.pkl
.
- Encode chunks via
-
Query‐Time Retrieval
- Embed user query, FAISS search → top-K chunks.
- Load snippets, assemble prompt, send to fine-tuned Llama-2 for final summary.
-
We spin up our Ray head and worker nodes (each with 1×A100 GPU) using a small Jupyter notebook:
Ray-Train/start_ray
-
We use the following notebook to submit our Ray job:
Ray-Train/submit_ray
-
Training script:
Ray-Train/sft_train_llama
-
Frameworks: PyTorch Lightning, Ray Train (DDP + fault‐tolerance), PEFT (LoRA), MLflow for experiment tracking
-
Checkpointing:
- We save both the best
val_loss
and the last epoch into./checkpoints/
via Lightning’sModelCheckpoint(save_top_k=1, save_last=True)
callback. - On worker restarts, Ray will supply the last checkpoint directory and Lightning will resume from
checkpoints/last.ckpt
.
- We save both the best
-
Logging:
- Metrics (train/val loss, epochs) are automatically logged to MLflow via the
MLFlowLogger
.
- Metrics (train/val loss, epochs) are automatically logged to MLflow via the
- Compare runs in
mlruns/
- Merged the trained LoRA adapters into the Llama-2-7b base and exported the combined model as an FP16 ONNX file.
- Ran ONNX Runtime on CPU, CUDA, and TensorRT providers, then selected the fastest execution path.
Code to this
- Registered the resulting model in MLflow as a checkpoint, which the FastAPI endpoint then pulls for inference.
Code where we are creating the Fast API
- Dockerfile:
Dockerfile
- Input: User Prompt appended with RAG output
- Output: summary text
- Ran the PyTest script (tests/test_offline_eval.py) to validate end-to-end preprocessing, inference, and summary format on sample inputs.
Monitoring_and_Evaluation /1_Setup_ModelEvalAndMonitoring.ipynb
- Executed the finalized model on the held-out test set to compute ROUGE metrics, then log all scores to MLflow against the checkpoint registered in Section 9.1.
- Ran a Locust simulation against the /generate endpoint while monitoring throughput, latency, and errors in Grafana’s “FastAPI Load Test” dashboard
notebook for load testing
- Evaluation plan:
docs/business_eval.md
- Staging deployment:
Staging deployment workflow
- Monitoring Dashboards: Grafana config
- Closing the feedback loop:: LabelStudio
- Prometheus Dashboard: Dashboard
- Grafana Dashboard: Dashboard
- GitHub Actions workflow:
CI git merge test
- Triggers: push to
main
→ tests → build Docker images → deploy to staging - Flask App: We have a flask app, which takes input from user, looks up on RAG, appends it to the user promt, sends the request with the new promt to our ONNX model through FastAPI, which then returns the summary. The summary is then appended to the UI, and user has the option to download the summary text. Code