LiveCodeBench  by LiveCodeBench

Benchmark for holistic LLM code evaluation

created 1 year ago
606 stars

Top 54.8% on sourcepulse

GitHubView on GitHub
Project Summary

LiveCodeBench provides a holistic and contamination-free evaluation framework for Large Language Models (LLMs) focused on coding capabilities. It targets researchers and developers evaluating LLMs, offering continuous updates with new problems from competitive programming platforms and assessing a broader range of skills beyond code generation, such as self-repair and test output prediction.

How It Works

The benchmark continuously collects coding problems from LeetCode, AtCoder, and CodeForces, ensuring up-to-date evaluation data. It supports multiple scenarios including code generation, code execution, and test output prediction. For code generation, it leverages vLLM for efficient inference and allows configuration for parallelization across GPUs. Evaluation metrics include pass@1 and pass@5, using a modified and improved checker from the APPS benchmark to handle edge cases.

Quick Start & Requirements

  • Install dependencies using uv and Python 3.11:
    uv venv --python 3.11
    source .venv/bin/activate
    uv pip install -e .
    
  • Run evaluations with:
    python -m lcb_runner.runner.main --model {model_name} --scenario codegeneration --evaluate --release_version {release_version}
    
  • Supports various dataset versions (release_v1 to release_v6).
  • Official documentation and leaderboard available at livecodebench.github.io.

Highlighted Details

  • Continuously updated dataset with problems released between May 2023 and April 2025 (release_v6).
  • Evaluates multiple coding capabilities: code generation, code execution, test output prediction, and self-repair.
  • Designed to mitigate data contamination by tracking problem release dates.
  • Demonstrates that models performing well on HumanEval do not necessarily perform well on LiveCodeBench, highlighting the need for holistic evaluation.

Maintenance & Community

The project is actively maintained, with multiple releases of the dataset. Submissions for the leaderboard are accepted via pull requests. Further details and updates can be found on their website.

Licensing & Compatibility

The repository is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The README notes that time limits can cause slight variations in metric computation. Users are advised to adjust num_process_evaluate or timeout flags if significant performance variations are observed. Some issues regarding erroneous tests are documented in ERRATA.md.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
5
Issues (30d)
14
Star History
162 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Wei-Lin Chiang Wei-Lin Chiang(Cofounder of LMArena).

evalplus by evalplus

0.5%
2k
LLM code evaluation framework for rigorous testing
created 2 years ago
updated 4 weeks ago
Feedback? Help us improve.