LLM Legal Document Summarization

This repository contains the code and configuration for the LLM Legal Document Summarization project, deployed on Chameleon Cloud. Follow the sections below to understand the system lifecycle from data ingestion to production serving, and see links to the specific implementation files.

1. Value Proposition

Target Customer: Legal analysts at corporate law firms who need fast, accurate summaries of incoming legal documents to accelerate review.

Customer Details:
- Receives >100 documents/day (.pdf, .docx)
- Needs to look up previous judgements and their summaries by searching key words
- Requires summary within minutes of upload
- Ground-truth labels (expert summaries) available after review

Design Influences: Data size, latency requirements, retraining frequency.

2. Scale

Offline Data: 10 GB raw documents (~20 K files) - Zenodo data source, 3.3k files containing case data: Kaggle data source deepcontractor/supreme-court-judgment-prediction
Model Size: Fine-tuned Llama-2-7B; training takes uses 2×A100 GPUs
Deployment Throughput: ~500 inference requests/day (~1 req/min)

3. Architecture Diagram

4. Infrastructure & IaC

Provisioning and configuration via Terraform and Ansible:

Terraform: Terraform configurations, variables, setting - DAY 0
Ansible Playbooks: Ansible notebooks
Argo CD: Argo CD notebooks for 3 environments

5. Persistent Storage

On Chameleon:

Object Store:

Notebook with instructions to create and access object store, code to download dataset, preprocess, partition and store in the object store created

Structure and contents:

object-persist-project33

├── production.jsonl
├── test.jsonl
└── train.jsonl

Block Volume: Notebook with instructions to create, partition, add file system, access, run containers on block volume

We created a block volume of 50 GiB initially, but extended to 100 GiB to store our ONNX model, RAG data, etc.

Structure and contents:

block-persist-project33

├── minio_data
  ├── mlflow-artifacts
  └── ray
├── postgres_data
└── rag_data
  ├── model_rag
    ├── index_to_doc.pkl
    └── legal-facts.index
  └── rag_chunks

Mlflow artifacts folder contains all the artifacts generated during serving and training. Ray folder contains as ray train related checkpoints Postgress folder contains Rag Data folder contains all data related to our RAG model sentence-transformers/all-MiniLM-L6-v2 which includes the data chunks, vector db, mapping info.

6. Offline Data

Training Dataset & Data Lineage

We use the Zenodo Indian & UK Legal Judgments Dataset containing ~20K court cases and corresponding human-written summaries.

Sources: IN-Abs, UK-Abs, and IN-Ext
Data Size: ~10 GB total, over 20,000 legal documents and associated summaries - from Zenodo.
Format: Paired .txt files for full judgments and summaries

Example Sample (`train.jsonl`)

{
  "filename": "UKCiv2012.txt",
  "judgement": "The claimant seeks damages following breach of contract. The court heard evidence from both parties. After reviewing the statutory framework and case law precedent, the court finds that the defendant did not fulfill their obligations...",
  "summary": "The defendant breached the contract. The court awarded damages to the claimant.",
  "meta": {
    "doc_words": 2176,
    "sum_words": 132,
    "ratio": 0.06
  }
}

Relation to Customer Use Case

Our target user (a legal analyst at a law firm) regularly deals with such long-form judicial decisions. The Zenodo dataset closely mirrors their real-world workflow: • They review lengthy judgment documents daily. • They generate or consume summaries internally for client reporting. • Our model mimics this process by learning from historic summaries.

About Production Samples

Production samples (the 10% test set): • Contain no ground-truth summaries at inference time. • In a deployed setting, these would represent new unseen judgments uploaded by users. • Once reviewed by a human expert, feedback summaries could be used to retrain the model thus closing the feedback loop.

7. Data Pipeline

Processing Pipeline

Steps handled in data_preprocessing.py:

Ingestion: Load documents from raw folders.
Merging: Combine segment-wise summaries if full summary not available.
Cleaning: Normalize unicode, remove extra whitespace, lowercase.
Sanity checks: Remove empty/duplicate/missing files.
Filtering: Retain samples with 50–1500 summary words and acceptable doc:summary ratios.
Split: 70% train, 20% test, 10% production — written to *.jsonl.

Data Pipeline Overview

         ┌──────────────────────────────┐
         │    Raw Zenodo Dataset        │
         │  (/data/raw/* subfolders)    │
         └────────────┬─────────────────┘
                      │
                      ▼
         ┌──────────────────────────────┐
         │   Ingestion & File Loading   │
         │ - Load judgment + summary    │
         │ - Handle IN-Abs, UK-Abs,     │
         │   IN-Ext variants            │
         └────────────┬─────────────────┘
                      │
                      ▼
         ┌──────────────────────────────┐
         │     Merging Segment-wise     │
         │ - Combine partial summaries  │
         │   (facts, statute, etc.)     │
         └────────────┬─────────────────┘
                      │
                      ▼
         ┌──────────────────────────────┐
         │         Cleaning Text        │
         │ - Unicode normalization      │
         │ - Lowercasing                │
         │ - Remove extra whitespace    │
         └────────────┬─────────────────┘
                      │
                      ▼
         ┌──────────────────────────────┐
         │       Sanity Checks          │
         │ - Remove empty/missing files │
         │ - Check for duplicates       │
         └────────────┬─────────────────┘
                      │
                      ▼
         ┌──────────────────────────────┐
         │       Statistical Filter     │
         │ - 50–1500 summary words      │
         │ - Ratio: 1–50% of doc length │
         └────────────┬─────────────────┘
                      │
                      ▼
         ┌──────────────────────────────┐
         │        Split & Dump          │
         │ - 70% train                  │
         │ - 20% test                   │
         │ - 10% production             │
         │ → Output as `.jsonl` files   │
         └──────────────────────────────┘

RAG Pipeline: Supreme Court Judgment Summarization

Download & Extract
- Use Kaggle API to pull deepcontractor/supreme-court-judgment-prediction into /mnt/block/rag_data and unzip.
Load & Inspect
- Read justice.csv with Pandas to verify row count and columns (name, facts, etc.).
Clean & Serialize
- Normalize newlines, strip empty lines, and write each case’s facts to rag_txt/{idx}_{safe_name}.txt.
Chunk Documents
- Tokenize with sentence-transformers/all-MiniLM-L6-v2 (512-token window, 64-token overlap).
- Save each piece to rag_chunks/{original}_chunkXXX.txt.
Embed & Index
- Encode chunks via SentenceTransformer.
- Build a FAISS L2 index over the vectors.
- Persist model_rag/legal-facts.index and model_rag/index_to_doc.pkl.
Query‐Time Retrieval
- Embed user query, FAISS search → top-K chunks.
- Load snippets, assemble prompt, send to fine-tuned Llama-2 for final summary.

8. Model Training

8.1 Provisioning our resources and Jupyter container

We spin up our Ray head and worker nodes (each with 1×A100 GPU) using a small Jupyter notebook: Ray-Train/start_ray
We use the following notebook to submit our Ray job: Ray-Train/submit_ray

8.2 Fine-tuning with LoRA + Ray Train + Lightning + MLflow

Training script: Ray-Train/sft_train_llama
Frameworks: PyTorch Lightning, Ray Train (DDP + fault‐tolerance), PEFT (LoRA), MLflow for experiment tracking
Checkpointing:
- We save both the best val_loss and the last epoch into ./checkpoints/ via Lightning’s ModelCheckpoint(save_top_k=1, save_last=True) callback.
- On worker restarts, Ray will supply the last checkpoint directory and Lightning will resume from checkpoints/last.ckpt.
Logging:
- Metrics (train/val loss, epochs) are automatically logged to MLflow via the MLFlowLogger.

8.3 Experiment Tracking

Compare runs in mlruns/

8.4 Retrain Code

Retrain Yaml: train.yml
Retrain-code: train.py

9. Model Serving & Evaluation

9.1 Serving and API Endpoint

Merged the trained LoRA adapters into the Llama-2-7b base and exported the combined model as an FP16 ONNX file.
Ran ONNX Runtime on CPU, CUDA, and TensorRT providers, then selected the fastest execution path. Code to this
Registered the resulting model in MLflow as a checkpoint, which the FastAPI endpoint then pulls for inference. Code where we are creating the Fast API
Dockerfile: Dockerfile
Input: User Prompt appended with RAG output
Output: summary text

9.2 Offline Evaluation

Ran the PyTest script (tests/test_offline_eval.py) to validate end-to-end preprocessing, inference, and summary format on sample inputs.Monitoring_and_Evaluation /1_Setup_ModelEvalAndMonitoring.ipynb
Executed the finalized model on the held-out test set to compute ROUGE metrics, then log all scores to MLflow against the checkpoint registered in Section 9.1.

9.3 Load Testing

Ran a Locust simulation against the /generate endpoint while monitoring throughput, latency, and errors in Grafana’s “FastAPI Load Test” dashboard notebook for load testing

9.4 Business-Specific Evaluation

Evaluation plan: docs/business_eval.md

9.5 Staged Deployment

Staging deployment: Staging deployment workflow

10. Online Data & Monitoring

Monitoring Dashboards: Grafana config
Closing the feedback loop:: LabelStudio
Prometheus Dashboard: Dashboard
Grafana Dashboard: Dashboard

11. CI/CD & Continuous Training

GitHub Actions workflow: CI git merge test
Triggers: push to main → tests → build Docker images → deploy to staging
Flask App: We have a flask app, which takes input from user, looks up on RAG, appends it to the user promt, sends the request with the new promt to our ONNX model through FastAPI, which then returns the summary. The summary is then appended to the UI, and user has the option to download the summary text. Code

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
.github/workflows		.github/workflows
1_data_pipeline		1_data_pipeline
2_vector_db		2_vector_db
Monitoring_and_Evaluation		Monitoring_and_Evaluation
Ray-Train		Ray-Train
Serving		Serving
app_flask		app_flask
ci-cd		ci-cd
doc		doc
docker		docker
fine_tuned_lora_model		fine_tuned_lora_model
llama2-legal-merged		llama2-legal-merged
llama2-legal-onnx		llama2-legal-onnx
.DS_Store		.DS_Store
.gitignore		.gitignore
.gitmodules		.gitmodules
1_create_server_nvidia.ipynb		1_create_server_nvidia.ipynb
8_submit_ray.ipynb		8_submit_ray.ipynb
README.md		README.md
README_Proposal.md		README_Proposal.md
llama_inference.py		llama_inference.py
merged_dataset.zip		merged_dataset.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Legal Document Summarization

1. Value Proposition

2. Scale

3. Architecture Diagram

4. Infrastructure & IaC

5. Persistent Storage

6. Offline Data

Training Dataset & Data Lineage

Example Sample (`train.jsonl`)

Relation to Customer Use Case

About Production Samples

7. Data Pipeline

Processing Pipeline

Data Pipeline Overview

RAG Pipeline: Supreme Court Judgment Summarization

8. Model Training

8.1 Provisioning our resources and Jupyter container

8.2 Fine-tuning with LoRA + Ray Train + Lightning + MLflow

8.3 Experiment Tracking

8.4 Retrain Code

9. Model Serving & Evaluation

9.1 Serving and API Endpoint

9.2 Offline Evaluation

9.3 Load Testing

9.4 Business-Specific Evaluation

9.5 Staged Deployment

10. Online Data & Monitoring

11. CI/CD & Continuous Training

About

Uh oh!

Releases

Packages

Languages

Navya0203/LLM_LegalDocSummarization

Folders and files

Latest commit

History

Repository files navigation

LLM Legal Document Summarization

1. Value Proposition

2. Scale

3. Architecture Diagram

4. Infrastructure & IaC

5. Persistent Storage

6. Offline Data

Training Dataset & Data Lineage

Example Sample (train.jsonl)

Relation to Customer Use Case

About Production Samples

7. Data Pipeline

Processing Pipeline

Data Pipeline Overview

RAG Pipeline: Supreme Court Judgment Summarization

8. Model Training

8.1 Provisioning our resources and Jupyter container

8.2 Fine-tuning with LoRA + Ray Train + Lightning + MLflow

8.3 Experiment Tracking

8.4 Retrain Code

9. Model Serving & Evaluation

9.1 Serving and API Endpoint

9.2 Offline Evaluation

9.3 Load Testing

9.4 Business-Specific Evaluation

9.5 Staged Deployment

10. Online Data & Monitoring

11. CI/CD & Continuous Training

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Example Sample (`train.jsonl`)

Packages