Skip to content

Ranking Large Language Models using the Principle of Least Action! Built during my time at Knit Space, Hubbali under the guidance Prof. Prakash Hegade.

License

Notifications You must be signed in to change notification settings

C-you-know/Action-Based-LLM-Testing-Harness

Repository files navigation

PyPI version Python Versions

KnitSpace LLM Ranker: Automated LLM Testing Harness

KnitSpace is an automated testing harness designed to evaluate and compare the capabilities of various Large Language Models (LLMs) across a diverse set of tasks. It provides a comprehensive framework for researchers and developers to assess LLM performance in areas such as problem-solving, knowledge retrieval, coding proficiency, and safety.

Standard benchmarks for evaluating large language models often suffer from leakage and fail to account for computational efficiency. We propose Efficiency of Language Output (ELO), a new evaluation framework grounded in the Euler–Lagrange formulation of the principle of least action, which ranks models by how accurately and efficiently they solve language tasks. Each task belongs to a small set of abstract, verifiable formats and is instantiated at runtime to ensure novelty and prevent memorization. Model outputs are treated as trajectories through embedding space, formed by concatenating token vectors during generation. We define a single action functional over this trajectory, composed of a kinetic term (based on embedding transitions and model FLOPs) and a potential term (proportional to negative log-likelihood). These components are not evaluated separately, but combined into one scalar score that reflects the effort and confidence required to produce an answer. The final ELO rating is computed as the reciprocal of a weighted sum of action scores across tasks, where weights reflect task importance. While current experiments use GPT-2 embeddings and FLOPs as a public proxy, the framework generalizes to any autoregressive architecture. The full system is open-sourced and installable via pip install ks-llm-ranker, enabling model comparison, task extension, and reproducible evaluation. ELO offers a scalable, theoretically principled alternative to conventional LLM benchmarks.

🔑 Key Features

  • Multi-LLM Support: Integrates with OpenAI, Google, Cohere, Mistral, and more.
  • Diverse Test Suite: Includes mathematical reasoning, coding tasks, knowledge tests (MMLU), long-context, instruction-following, and obfuscation-based tests.
  • Elo Rating System: Scores models using task difficulty and a cognitive cost metric ("S-value") for nuanced benchmarking.
  • Secure Code Execution: Uses Docker containers to safely execute LLM-generated Python/JS code.
  • Text Obfuscation: Tests reasoning under character-mapped distortions.
  • Interactive Review: Launch a web-based viewer for test results.
  • Extensible: Easily add new LLM providers and new types of tests.

⚙️ Total Action-Based Evaluation

We model text generation as a physical system traversing high-dimensional semantic space. Each step contributes kinetic and potential energy, and the Total Action quantifies overall generation effort:

LLM Physics Equations 📄 Read the full design doc here (PDF)


⚙️ Setup

1. Prerequisites

  • Python 3.8+
  • Docker (for coding tasks)
  • Git

2. Installation

git clone [<repository_url>](https://github.com/C-you-know/Action-Based-LLM-Testing-Harness)
cd KnitSpace-LLM-Ranker

python -m venv venv
source venv/bin/activate  # (Windows: venv\Scripts\activate)

pip install -r requirements.txt  # Or manually install dependencies

3. API Key Setup

Set the following environment variables based on the providers you wish to use:

export OPENAI_API_KEY="..."
export GEMINI_API_KEY="..."
export MISTRAL_API_KEY="..."
export COHERE_API_KEY="..."
# Cloudflare-specific
export CLOUDFLARE_API_KEY="..."
export CLOUDFLARE_ACCOUNT_ID="..."

🚀 Running Tests

Run via verify-auto.py

  1. Configure:

    • Choose model/provider in verify-auto.py
    • Select tests in test_cases list
  2. Run:

    python verify-auto.py
  3. View:

    • Console logs test stats
    • Web UI opens at http://localhost:8000

Debug Test Inputs (optional)

Use QA-test.py to inspect generated test data without invoking an LLM:

python QA-test.py

🔌 Extending the Harness

➕ Adding New LLM Providers

  1. Subclass Model in knit_space/models.py

  2. Implement:

    • _initialize_client()
    • inference(...)
  3. Update:

    • PROVIDER_CLASS_MAP
    • _get_api_key_for_provider() and optionally _list_api_models()

🧪 Adding New Test Types

  1. Create a new file in knit_space/tests/
  2. Subclass AbstractQATest
  3. Implement generate() to yield QAItems
  4. Optionally register using @register_test()

📦 Install as a Package

You can also install this project as a pip package (once published):

pip install ks-llm-ranker

About

Ranking Large Language Models using the Principle of Least Action! Built during my time at Knit Space, Hubbali under the guidance Prof. Prakash Hegade.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages