evalplus by evalplus

LLM code evaluation framework for rigorous testing

Created 2 years ago

1,666 stars

Top 25.2% on SourcePulse

View on GitHub

5 Experts Love This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Binyuan Hui

Research Scientist at Alibaba Qwen

Meng Zhang

Cofounder of TabbyML

and 1 more!

Project Summary

EvalPlus is a comprehensive evaluation framework designed for Large Language Models (LLMs) focused on code generation. It addresses the need for more rigorous and extensive testing beyond standard benchmarks like HumanEval and MBPP, offering expanded datasets (HumanEval+, MBPP+) with significantly more test cases and introducing EvalPerf for evaluating code efficiency. The target audience includes LLM developers, researchers, and teams aiming to benchmark and improve their code-generating models.

How It Works

EvalPlus enhances existing benchmarks by providing larger, more diverse test suites (HumanEval+ and MBPP+) to uncover code fragility and improve evaluation rigor. It also introduces EvalPerf, a dataset specifically designed to measure the performance and efficiency of LLM-generated code. The framework supports various LLM backends (HuggingFace, vLLM, OpenAI, Anthropic, Gemini, etc.) and offers safe code execution within Docker containers, ensuring reproducible and secure evaluations.

Quick Start & Requirements

Installation: pip install --upgrade "evalplus[vllm]" or pip install "evalplus[perf,vllm]" --upgrade for performance evaluation.
Prerequisites: Python, vLLM (for specific backends), Docker (for safe execution), perf (for EvalPerf, requires sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid').
Usage: evalplus.evaluate --model <model_name> --dataset [humaneval|mbpp|evalperf] --backend <backend_name>.
Documentation: EvalPlus Commands, EvalPerf Documentation.

Highlighted Details

HumanEval+ offers 80x more tests than original HumanEval.
MBPP+ offers 35x more tests than original MBPP.
EvalPerf evaluates code efficiency using performance-exercising tasks.
Supports a wide range of LLM backends including HuggingFace, vLLM, OpenAI, Anthropic, Gemini, and more.

Maintenance & Community

The project has seen active development with releases like v0.3.1 (Oct 2024) adding EvalPerf and broader backend support. It is used by major LLM teams including Meta Llama, Allen AI, and DeepSeek. Links to papers, leaderboards, and documentation are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires further investigation for commercial use or closed-source integration.

Limitations & Caveats

The license is not specified, which may pose a barrier to commercial adoption. The EvalPerf component requires Linux (*nix only) and specific permissions for perf_event_paranoid. Some dataset upgrades (e.g., MBPP+ v0.2.0) have involved removing broken tasks.

Health Check

Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

17 stars in the last 30 days