Discover and explore top open-source AI tools and projects—updated daily.
the-crypt-keeperAI coding model evaluation framework
Top 54.6% on SourcePulse
This repository provides a framework for evaluating the coding capabilities of AI models through self-assessing interviews. It targets AI researchers and developers seeking to benchmark LLMs on coding tasks, offering a standardized method to measure performance across various models and prompting strategies.
How It Works
The system uses human-written interview questions in YAML format, which are then transformed into model-specific prompts. AI models generate code responses, which are executed within a secure Docker sandbox. Evaluation is automated using predefined checks that assert expected behaviors and outputs, allowing for objective performance measurement.
Quick Start & Requirements
pip install streamlit==1.23streamlit run app.py or streamlit run compare-app.pyHighlighted Details
junior-v2 (Python, JavaScript) and humaneval (Python)..ndjson format, enabling iterative evaluation and comparison.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
4 months ago
1 week
huybery
groq
TheAgentCompany
openai
LiveCodeBench
openai
SWE-bench
AntonOsika