AI coding model evaluation framework
Top 55.7% on sourcepulse
This repository provides a framework for evaluating the coding capabilities of AI models through self-assessing interviews. It targets AI researchers and developers seeking to benchmark LLMs on coding tasks, offering a standardized method to measure performance across various models and prompting strategies.
How It Works
The system uses human-written interview questions in YAML format, which are then transformed into model-specific prompts. AI models generate code responses, which are executed within a secure Docker sandbox. Evaluation is automated using predefined checks that assert expected behaviors and outputs, allowing for objective performance measurement.
Quick Start & Requirements
pip install streamlit==1.23
streamlit run app.py
or streamlit run compare-app.py
Highlighted Details
junior-v2
(Python, JavaScript) and humaneval
(Python)..ndjson
format, enabling iterative evaluation and comparison.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 month ago
1 week