LiveCodeBench by LiveCodeBench

Benchmark for holistic LLM code evaluation

Created 1 year ago

756 stars

Top 46.1% on SourcePulse

View on GitHub

5 Experts Love This Project

Binyuan Hui

Research Scientist at Alibaba Qwen

Pawel Garbacki

Cofounder of Fireworks AI

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Junyang Lin

Core Maintainer at Alibaba Qwen

and 1 more!

Project Summary

LiveCodeBench provides a holistic and contamination-free evaluation framework for Large Language Models (LLMs) focused on coding capabilities. It targets researchers and developers evaluating LLMs, offering continuous updates with new problems from competitive programming platforms and assessing a broader range of skills beyond code generation, such as self-repair and test output prediction.

How It Works

The benchmark continuously collects coding problems from LeetCode, AtCoder, and CodeForces, ensuring up-to-date evaluation data. It supports multiple scenarios including code generation, code execution, and test output prediction. For code generation, it leverages vLLM for efficient inference and allows configuration for parallelization across GPUs. Evaluation metrics include pass@1 and pass@5, using a modified and improved checker from the APPS benchmark to handle edge cases.

Quick Start & Requirements

Install dependencies using uv and Python 3.11:

uv venv --python 3.11
source .venv/bin/activate
uv pip install -e .

Run evaluations with:

python -m lcb_runner.runner.main --model {model_name} --scenario codegeneration --evaluate --release_version {release_version}

Supports various dataset versions (release_v1 to release_v6).
Official documentation and leaderboard available at livecodebench.github.io.

Highlighted Details

Continuously updated dataset with problems released between May 2023 and April 2025 (release_v6).
Evaluates multiple coding capabilities: code generation, code execution, test output prediction, and self-repair.
Designed to mitigate data contamination by tracking problem release dates.
Demonstrates that models performing well on HumanEval do not necessarily perform well on LiveCodeBench, highlighting the need for holistic evaluation.

Maintenance & Community

The project is actively maintained, with multiple releases of the dataset. Submissions for the leaderboard are accepted via pull requests. Further details and updates can be found on their website.

Licensing & Compatibility

The repository is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The README notes that time limits can cause slight variations in metric computation. Users are advised to adjust num_process_evaluate or timeout flags if significant performance variations are observed. Some issues regarding erroneous tests are documented in ERRATA.md.

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

24 stars in the last 30 days