KnitSpace is an automated testing harness designed to evaluate and compare the capabilities of various Large Language Models (LLMs) across a diverse set of tasks. It provides a comprehensive framework for researchers and developers to assess LLM performance in areas such as problem-solving, knowledge retrieval, coding proficiency, and safety.
Standard benchmarks for evaluating large language models often suffer from leakage and fail to account for computational efficiency. We propose Efficiency of Language Output (ELO), a new evaluation framework grounded in the Euler–Lagrange formulation of the principle of least action, which ranks models by how accurately and efficiently they solve language tasks. Each task belongs to a small set of abstract, verifiable formats and is instantiated at runtime to ensure novelty and prevent memorization. Model outputs are treated as trajectories through embedding space, formed by concatenating token vectors during generation. We define a single action functional over this trajectory, composed of a kinetic term (based on embedding transitions and model FLOPs) and a potential term (proportional to negative log-likelihood). These components are not evaluated separately, but combined into one scalar score that reflects the effort and confidence required to produce an answer. The final ELO rating is computed as the reciprocal of a weighted sum of action scores across tasks, where weights reflect task importance. While current experiments use GPT-2 embeddings and FLOPs as a public proxy, the framework generalizes to any autoregressive architecture. The full system is open-sourced and installable via pip install ks-llm-ranker, enabling model comparison, task extension, and reproducible evaluation. ELO offers a scalable, theoretically principled alternative to conventional LLM benchmarks.
- Multi-LLM Support: Integrates with OpenAI, Google, Cohere, Mistral, and more.
- Diverse Test Suite: Includes mathematical reasoning, coding tasks, knowledge tests (MMLU), long-context, instruction-following, and obfuscation-based tests.
- Elo Rating System: Scores models using task difficulty and a cognitive cost metric ("S-value") for nuanced benchmarking.
- Secure Code Execution: Uses Docker containers to safely execute LLM-generated Python/JS code.
- Text Obfuscation: Tests reasoning under character-mapped distortions.
- Interactive Review: Launch a web-based viewer for test results.
- Extensible: Easily add new LLM providers and new types of tests.
We model text generation as a physical system traversing high-dimensional semantic space. Each step contributes kinetic and potential energy, and the Total Action quantifies overall generation effort:
📄 Read the full design doc here (PDF)
- Python 3.8+
- Docker (for coding tasks)
- Git
git clone [<repository_url>](https://github.com/C-you-know/Action-Based-LLM-Testing-Harness)
cd KnitSpace-LLM-Ranker
python -m venv venv
source venv/bin/activate # (Windows: venv\Scripts\activate)
pip install -r requirements.txt # Or manually install dependencies
Set the following environment variables based on the providers you wish to use:
export OPENAI_API_KEY="..."
export GEMINI_API_KEY="..."
export MISTRAL_API_KEY="..."
export COHERE_API_KEY="..."
# Cloudflare-specific
export CLOUDFLARE_API_KEY="..."
export CLOUDFLARE_ACCOUNT_ID="..."
-
Configure:
- Choose model/provider in
verify-auto.py
- Select tests in
test_cases
list
- Choose model/provider in
-
Run:
python verify-auto.py
-
View:
- Console logs test stats
- Web UI opens at
http://localhost:8000
Use QA-test.py
to inspect generated test data without invoking an LLM:
python QA-test.py
-
Subclass
Model
inknit_space/models.py
-
Implement:
_initialize_client()
inference(...)
-
Update:
PROVIDER_CLASS_MAP
_get_api_key_for_provider()
and optionally_list_api_models()
- Create a new file in
knit_space/tests/
- Subclass
AbstractQATest
- Implement
generate()
to yieldQAItem
s - Optionally register using
@register_test()
You can also install this project as a pip package (once published):
pip install ks-llm-ranker