can-ai-code by the-crypt-keeper

AI coding model evaluation framework

Created 2 years ago

598 stars

Top 54.6% on SourcePulse

View on GitHub

4 Experts Love This Project

Ishaan Jaffer

Cofounder of LiteLLM

Maxime Labonne

Head of Post-Training at Liquid AI

Pawel Garbacki

Cofounder of Fireworks AI

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

This repository provides a framework for evaluating the coding capabilities of AI models through self-assessing interviews. It targets AI researchers and developers seeking to benchmark LLMs on coding tasks, offering a standardized method to measure performance across various models and prompting strategies.

How It Works

The system uses human-written interview questions in YAML format, which are then transformed into model-specific prompts. AI models generate code responses, which are executed within a secure Docker sandbox. Evaluation is automated using predefined checks that assert expected behaviors and outputs, allowing for objective performance measurement.

Quick Start & Requirements

Install: pip install streamlit==1.23
Run webapps: streamlit run app.py or streamlit run compare-app.py
Dependencies: Python 3.x, Docker. CUDA-enabled runtimes (vLLM, ExLlama2, etc.) require specific GPU drivers and libraries.
Docs: HF Spaces Leaderboard, HF Spaces Comparisons

Highlighted Details

Supports multiple interview suites: junior-v2 (Python, JavaScript) and humaneval (Python).
Integrates with various inference runtimes: LiteLLM (API), OobaBooga, Huggingface Inference, Gradio, and numerous CUDA-accelerated backends (GGUF, GPTQ, EXL2, AWQ, FP16 via vLLM, Transformers, etc.).
Outputs results in .ndjson format, enabling iterative evaluation and comparison.
Includes Streamlit apps for local exploration of results and model comparisons.

Maintenance & Community

Active development with recent evaluations of models like GPT 4.1, Gemma 3, and phi-4-mini.
Links to external benchmarks and related projects are provided.

Licensing & Compatibility

The repository itself appears to be under an unspecified license, but it utilizes and references other projects with various licenses. Users should verify compatibility for commercial use.

Limitations & Caveats

The "senior" test suite is marked as Work In Progress (WIP).
Some CUDA runtimes (AutoGPTQ, CTranslate2) have compatibility issues with the provided CUDA 12 Modal wrapper.
Selecting specific models or runtimes within the Modal wrapper requires script modification.

Health Check

Last Commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days