KnitSpace LLM Ranker: Automated LLM Testing Harness

KnitSpace is an automated testing harness designed to evaluate and compare the capabilities of various Large Language Models (LLMs) across a diverse set of tasks. It provides a comprehensive framework for researchers and developers to assess LLM performance in areas such as problem-solving, knowledge retrieval, coding proficiency, and safety.

Standard benchmarks for evaluating large language models often suffer from leakage and fail to account for computational efficiency. We propose Efficiency of Language Output (ELO), a new evaluation framework grounded in the Euler–Lagrange formulation of the principle of least action, which ranks models by how accurately and efficiently they solve language tasks. Each task belongs to a small set of abstract, verifiable formats and is instantiated at runtime to ensure novelty and prevent memorization. Model outputs are treated as trajectories through embedding space, formed by concatenating token vectors during generation. We define a single action functional over this trajectory, composed of a kinetic term (based on embedding transitions and model FLOPs) and a potential term (proportional to negative log-likelihood). These components are not evaluated separately, but combined into one scalar score that reflects the effort and confidence required to produce an answer. The final ELO rating is computed as the reciprocal of a weighted sum of action scores across tasks, where weights reflect task importance. While current experiments use GPT-2 embeddings and FLOPs as a public proxy, the framework generalizes to any autoregressive architecture. The full system is open-sourced and installable via pip install ks-llm-ranker, enabling model comparison, task extension, and reproducible evaluation. ELO offers a scalable, theoretically principled alternative to conventional LLM benchmarks.

🔑 Key Features

Multi-LLM Support: Integrates with OpenAI, Google, Cohere, Mistral, and more.
Diverse Test Suite: Includes mathematical reasoning, coding tasks, knowledge tests (MMLU), long-context, instruction-following, and obfuscation-based tests.
Elo Rating System: Scores models using task difficulty and a cognitive cost metric ("S-value") for nuanced benchmarking.
Secure Code Execution: Uses Docker containers to safely execute LLM-generated Python/JS code.
Text Obfuscation: Tests reasoning under character-mapped distortions.
Interactive Review: Launch a web-based viewer for test results.
Extensible: Easily add new LLM providers and new types of tests.

⚙️ Total Action-Based Evaluation

We model text generation as a physical system traversing high-dimensional semantic space. Each step contributes kinetic and potential energy, and the Total Action quantifies overall generation effort:

📄 Read the full design doc here (PDF)

⚙️ Setup

1. Prerequisites

Python 3.8+
Docker (for coding tasks)
Git

2. Installation

git clone [<repository_url>](https://github.com/C-you-know/Action-Based-LLM-Testing-Harness)
cd KnitSpace-LLM-Ranker

python -m venv venv
source venv/bin/activate  # (Windows: venv\Scripts\activate)

pip install -r requirements.txt  # Or manually install dependencies

3. API Key Setup

Set the following environment variables based on the providers you wish to use:

export OPENAI_API_KEY="..."
export GEMINI_API_KEY="..."
export MISTRAL_API_KEY="..."
export COHERE_API_KEY="..."
# Cloudflare-specific
export CLOUDFLARE_API_KEY="..."
export CLOUDFLARE_ACCOUNT_ID="..."

🚀 Running Tests

Run via `verify-auto.py`

Configure:
- Choose model/provider in verify-auto.py
- Select tests in test_cases list
Run:
```
python verify-auto.py
```
View:
- Console logs test stats
- Web UI opens at http://localhost:8000

Debug Test Inputs (optional)

Use QA-test.py to inspect generated test data without invoking an LLM:

python QA-test.py

🔌 Extending the Harness

➕ Adding New LLM Providers

Subclass Model in knit_space/models.py
Implement:
- _initialize_client()
- inference(...)
Update:
- PROVIDER_CLASS_MAP
- _get_api_key_for_provider() and optionally _list_api_models()

🧪 Adding New Test Types

Create a new file in knit_space/tests/
Subclass AbstractQATest
Implement generate() to yield QAItems
Optionally register using @register_test()

📦 Install as a Package

You can also install this project as a pip package (once published):

pip install ks-llm-ranker

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
knit_space		knit_space
.gitattributes		.gitattributes
.gitignore		.gitignore
Equations.png		Equations.png
LICENSE		LICENSE
ReadME.md		ReadME.md
ks_llm_ranker.pdf		ks_llm_ranker.pdf
pyproject.toml		pyproject.toml
sentence_obfuscation_char_map.txt		sentence_obfuscation_char_map.txt
setup.cfg		setup.cfg
verify-auto.py		verify-auto.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KnitSpace LLM Ranker: Automated LLM Testing Harness

🔑 Key Features

⚙️ Total Action-Based Evaluation

⚙️ Setup

1. Prerequisites

2. Installation

3. API Key Setup

🚀 Running Tests

Run via `verify-auto.py`

Debug Test Inputs (optional)

🔌 Extending the Harness

➕ Adding New LLM Providers

🧪 Adding New Test Types

📦 Install as a Package

About

Uh oh!

Releases 1

Packages

Languages

License

C-you-know/Action-Based-LLM-Testing-Harness

Folders and files

Latest commit

History

Repository files navigation

KnitSpace LLM Ranker: Automated LLM Testing Harness

🔑 Key Features

⚙️ Total Action-Based Evaluation

⚙️ Setup

1. Prerequisites

2. Installation

3. API Key Setup

🚀 Running Tests

Run via verify-auto.py

Debug Test Inputs (optional)

🔌 Extending the Harness

➕ Adding New LLM Providers

🧪 Adding New Test Types

📦 Install as a Package

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Run via `verify-auto.py`

Packages