can-ai-code  by the-crypt-keeper

AI coding model evaluation framework

created 2 years ago
592 stars

Top 55.7% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a framework for evaluating the coding capabilities of AI models through self-assessing interviews. It targets AI researchers and developers seeking to benchmark LLMs on coding tasks, offering a standardized method to measure performance across various models and prompting strategies.

How It Works

The system uses human-written interview questions in YAML format, which are then transformed into model-specific prompts. AI models generate code responses, which are executed within a secure Docker sandbox. Evaluation is automated using predefined checks that assert expected behaviors and outputs, allowing for objective performance measurement.

Quick Start & Requirements

  • Install: pip install streamlit==1.23
  • Run webapps: streamlit run app.py or streamlit run compare-app.py
  • Dependencies: Python 3.x, Docker. CUDA-enabled runtimes (vLLM, ExLlama2, etc.) require specific GPU drivers and libraries.
  • Docs: HF Spaces Leaderboard, HF Spaces Comparisons

Highlighted Details

  • Supports multiple interview suites: junior-v2 (Python, JavaScript) and humaneval (Python).
  • Integrates with various inference runtimes: LiteLLM (API), OobaBooga, Huggingface Inference, Gradio, and numerous CUDA-accelerated backends (GGUF, GPTQ, EXL2, AWQ, FP16 via vLLM, Transformers, etc.).
  • Outputs results in .ndjson format, enabling iterative evaluation and comparison.
  • Includes Streamlit apps for local exploration of results and model comparisons.

Maintenance & Community

  • Active development with recent evaluations of models like GPT 4.1, Gemma 3, and phi-4-mini.
  • Links to external benchmarks and related projects are provided.

Licensing & Compatibility

  • The repository itself appears to be under an unspecified license, but it utilizes and references other projects with various licenses. Users should verify compatibility for commercial use.

Limitations & Caveats

  • The "senior" test suite is marked as Work In Progress (WIP).
  • Some CUDA runtimes (AutoGPTQ, CTranslate2) have compatibility issues with the provided CUDA 12 Modal wrapper.
  • Selecting specific models or runtimes within the Modal wrapper requires script modification.
Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Travis Fischer Travis Fischer(Founder of Agentic).

LiveCodeBench by LiveCodeBench

0.8%
606
Benchmark for holistic LLM code evaluation
created 1 year ago
updated 2 weeks ago
Feedback? Help us improve.