evalplus  by evalplus

LLM code evaluation framework for rigorous testing

Created 2 years ago
1,583 stars

Top 26.4% on SourcePulse

GitHubView on GitHub
Project Summary

EvalPlus is a comprehensive evaluation framework designed for Large Language Models (LLMs) focused on code generation. It addresses the need for more rigorous and extensive testing beyond standard benchmarks like HumanEval and MBPP, offering expanded datasets (HumanEval+, MBPP+) with significantly more test cases and introducing EvalPerf for evaluating code efficiency. The target audience includes LLM developers, researchers, and teams aiming to benchmark and improve their code-generating models.

How It Works

EvalPlus enhances existing benchmarks by providing larger, more diverse test suites (HumanEval+ and MBPP+) to uncover code fragility and improve evaluation rigor. It also introduces EvalPerf, a dataset specifically designed to measure the performance and efficiency of LLM-generated code. The framework supports various LLM backends (HuggingFace, vLLM, OpenAI, Anthropic, Gemini, etc.) and offers safe code execution within Docker containers, ensuring reproducible and secure evaluations.

Quick Start & Requirements

  • Installation: pip install --upgrade "evalplus[vllm]" or pip install "evalplus[perf,vllm]" --upgrade for performance evaluation.
  • Prerequisites: Python, vLLM (for specific backends), Docker (for safe execution), perf (for EvalPerf, requires sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid').
  • Usage: evalplus.evaluate --model <model_name> --dataset [humaneval|mbpp|evalperf] --backend <backend_name>.
  • Documentation: EvalPlus Commands, EvalPerf Documentation.

Highlighted Details

  • HumanEval+ offers 80x more tests than original HumanEval.
  • MBPP+ offers 35x more tests than original MBPP.
  • EvalPerf evaluates code efficiency using performance-exercising tasks.
  • Supports a wide range of LLM backends including HuggingFace, vLLM, OpenAI, Anthropic, Gemini, and more.

Maintenance & Community

The project has seen active development with releases like v0.3.1 (Oct 2024) adding EvalPerf and broader backend support. It is used by major LLM teams including Meta Llama, Allen AI, and DeepSeek. Links to papers, leaderboards, and documentation are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires further investigation for commercial use or closed-source integration.

Limitations & Caveats

The license is not specified, which may pose a barrier to commercial adoption. The EvalPerf component requires Linux (*nix only) and specific permissions for perf_event_paranoid. Some dataset upgrades (e.g., MBPP+ v0.2.0) have involved removing broken tasks.

Health Check
Last Commit

4 weeks ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
1
Star History
24 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Edward Z. Yang Edward Z. Yang(Research Engineer at Meta; Maintainer of PyTorch), and
5 more.

yet-another-applied-llm-benchmark by carlini

0.2%
1k
LLM benchmark for evaluating models on previously asked programming questions
Created 1 year ago
Updated 4 months ago
Starred by Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
14 more.

SWE-bench by SWE-bench

2.3%
4k
Benchmark for evaluating LLMs on real-world GitHub issues
Created 1 year ago
Updated 18 hours ago
Feedback? Help us improve.