LLM code evaluation framework for rigorous testing
Top 27.6% on sourcepulse
EvalPlus is a comprehensive evaluation framework designed for Large Language Models (LLMs) focused on code generation. It addresses the need for more rigorous and extensive testing beyond standard benchmarks like HumanEval and MBPP, offering expanded datasets (HumanEval+, MBPP+) with significantly more test cases and introducing EvalPerf for evaluating code efficiency. The target audience includes LLM developers, researchers, and teams aiming to benchmark and improve their code-generating models.
How It Works
EvalPlus enhances existing benchmarks by providing larger, more diverse test suites (HumanEval+ and MBPP+) to uncover code fragility and improve evaluation rigor. It also introduces EvalPerf, a dataset specifically designed to measure the performance and efficiency of LLM-generated code. The framework supports various LLM backends (HuggingFace, vLLM, OpenAI, Anthropic, Gemini, etc.) and offers safe code execution within Docker containers, ensuring reproducible and secure evaluations.
Quick Start & Requirements
pip install --upgrade "evalplus[vllm]"
or pip install "evalplus[perf,vllm]" --upgrade
for performance evaluation.perf
(for EvalPerf, requires sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid'
).evalplus.evaluate --model <model_name> --dataset [humaneval|mbpp|evalperf] --backend <backend_name>
.Highlighted Details
Maintenance & Community
The project has seen active development with releases like v0.3.1 (Oct 2024) adding EvalPerf and broader backend support. It is used by major LLM teams including Meta Llama, Allen AI, and DeepSeek. Links to papers, leaderboards, and documentation are provided.
Licensing & Compatibility
The repository does not explicitly state a license in the README. This requires further investigation for commercial use or closed-source integration.
Limitations & Caveats
The license is not specified, which may pose a barrier to commercial adoption. The EvalPerf component requires Linux (*nix only
) and specific permissions for perf_event_paranoid
. Some dataset upgrades (e.g., MBPP+ v0.2.0) have involved removing broken tasks.
4 weeks ago
1 day