LLM benchmark for evaluating models on previously asked programming questions
Top 37.2% on sourcepulse
This benchmark evaluates language models on real-world programming tasks the author has encountered. It's designed for developers and researchers who need to assess LLM capabilities beyond standard academic metrics, focusing on practical problem-solving and code generation.
How It Works
The benchmark utilizes a custom dataflow Domain Specific Language (DSL) to chain operations: prompt an LLM, execute the generated code (within a Docker container), and evaluate the output. This approach allows for complex, multi-step evaluations, including using another LLM to judge code output or comparing generated images against reference solutions. The DSL is designed for ease of adding new, realistic tests.
Quick Start & Requirements
pip install -r requirements.txt
and optionally pip install -r requirements-extra.txt
.config.json
. OpenAI API keys are required for secondary evaluations.python main.py --model <model_name> --run-tests --generate-report
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The benchmark is explicitly not intended for rigorous academic comparison or determining which model is "better" overall, as prompts are not optimized and test cases may be ambiguous or rely on recent knowledge. Failing a test provides limited insight, whereas passing demonstrates specific, verifiable capabilities.
3 months ago
1 day