evalplus  by evalplus

LLM code evaluation framework for rigorous testing

created 2 years ago
1,536 stars

Top 27.6% on sourcepulse

GitHubView on GitHub
Project Summary

EvalPlus is a comprehensive evaluation framework designed for Large Language Models (LLMs) focused on code generation. It addresses the need for more rigorous and extensive testing beyond standard benchmarks like HumanEval and MBPP, offering expanded datasets (HumanEval+, MBPP+) with significantly more test cases and introducing EvalPerf for evaluating code efficiency. The target audience includes LLM developers, researchers, and teams aiming to benchmark and improve their code-generating models.

How It Works

EvalPlus enhances existing benchmarks by providing larger, more diverse test suites (HumanEval+ and MBPP+) to uncover code fragility and improve evaluation rigor. It also introduces EvalPerf, a dataset specifically designed to measure the performance and efficiency of LLM-generated code. The framework supports various LLM backends (HuggingFace, vLLM, OpenAI, Anthropic, Gemini, etc.) and offers safe code execution within Docker containers, ensuring reproducible and secure evaluations.

Quick Start & Requirements

  • Installation: pip install --upgrade "evalplus[vllm]" or pip install "evalplus[perf,vllm]" --upgrade for performance evaluation.
  • Prerequisites: Python, vLLM (for specific backends), Docker (for safe execution), perf (for EvalPerf, requires sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid').
  • Usage: evalplus.evaluate --model <model_name> --dataset [humaneval|mbpp|evalperf] --backend <backend_name>.
  • Documentation: EvalPlus Commands, EvalPerf Documentation.

Highlighted Details

  • HumanEval+ offers 80x more tests than original HumanEval.
  • MBPP+ offers 35x more tests than original MBPP.
  • EvalPerf evaluates code efficiency using performance-exercising tasks.
  • Supports a wide range of LLM backends including HuggingFace, vLLM, OpenAI, Anthropic, Gemini, and more.

Maintenance & Community

The project has seen active development with releases like v0.3.1 (Oct 2024) adding EvalPerf and broader backend support. It is used by major LLM teams including Meta Llama, Allen AI, and DeepSeek. Links to papers, leaderboards, and documentation are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires further investigation for commercial use or closed-source integration.

Limitations & Caveats

The license is not specified, which may pose a barrier to commercial adoption. The EvalPerf component requires Linux (*nix only) and specific permissions for perf_event_paranoid. Some dataset upgrades (e.g., MBPP+ v0.2.0) have involved removing broken tasks.

Health Check
Last commit

4 weeks ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
7
Star History
84 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Travis Fischer Travis Fischer(Founder of Agentic).

LiveCodeBench by LiveCodeBench

0.8%
606
Benchmark for holistic LLM code evaluation
created 1 year ago
updated 2 weeks ago
Feedback? Help us improve.