can-ai-code  by the-crypt-keeper

AI coding model evaluation framework

Created 2 years ago
597 stars

Top 54.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a framework for evaluating the coding capabilities of AI models through self-assessing interviews. It targets AI researchers and developers seeking to benchmark LLMs on coding tasks, offering a standardized method to measure performance across various models and prompting strategies.

How It Works

The system uses human-written interview questions in YAML format, which are then transformed into model-specific prompts. AI models generate code responses, which are executed within a secure Docker sandbox. Evaluation is automated using predefined checks that assert expected behaviors and outputs, allowing for objective performance measurement.

Quick Start & Requirements

  • Install: pip install streamlit==1.23
  • Run webapps: streamlit run app.py or streamlit run compare-app.py
  • Dependencies: Python 3.x, Docker. CUDA-enabled runtimes (vLLM, ExLlama2, etc.) require specific GPU drivers and libraries.
  • Docs: HF Spaces Leaderboard, HF Spaces Comparisons

Highlighted Details

  • Supports multiple interview suites: junior-v2 (Python, JavaScript) and humaneval (Python).
  • Integrates with various inference runtimes: LiteLLM (API), OobaBooga, Huggingface Inference, Gradio, and numerous CUDA-accelerated backends (GGUF, GPTQ, EXL2, AWQ, FP16 via vLLM, Transformers, etc.).
  • Outputs results in .ndjson format, enabling iterative evaluation and comparison.
  • Includes Streamlit apps for local exploration of results and model comparisons.

Maintenance & Community

  • Active development with recent evaluations of models like GPT 4.1, Gemma 3, and phi-4-mini.
  • Links to external benchmarks and related projects are provided.

Licensing & Compatibility

  • The repository itself appears to be under an unspecified license, but it utilizes and references other projects with various licenses. Users should verify compatibility for commercial use.

Limitations & Caveats

  • The "senior" test suite is marked as Work In Progress (WIP).
  • Some CUDA runtimes (AutoGPTQ, CTranslate2) have compatibility issues with the provided CUDA 12 Modal wrapper.
  • Selecting specific models or runtimes within the Modal wrapper requires script modification.
Health Check
Last Commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Jared Palmer Jared Palmer(Ex-VP AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
3 more.

human-eval by openai

0.4%
3k
Evaluation harness for LLMs trained on code
Created 4 years ago
Updated 8 months ago
Starred by Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
14 more.

SWE-bench by SWE-bench

2.3%
4k
Benchmark for evaluating LLMs on real-world GitHub issues
Created 1 year ago
Updated 18 hours ago
Feedback? Help us improve.