Benchmark for holistic LLM code evaluation
Top 54.8% on sourcepulse
LiveCodeBench provides a holistic and contamination-free evaluation framework for Large Language Models (LLMs) focused on coding capabilities. It targets researchers and developers evaluating LLMs, offering continuous updates with new problems from competitive programming platforms and assessing a broader range of skills beyond code generation, such as self-repair and test output prediction.
How It Works
The benchmark continuously collects coding problems from LeetCode, AtCoder, and CodeForces, ensuring up-to-date evaluation data. It supports multiple scenarios including code generation, code execution, and test output prediction. For code generation, it leverages vLLM for efficient inference and allows configuration for parallelization across GPUs. Evaluation metrics include pass@1 and pass@5, using a modified and improved checker from the APPS benchmark to handle edge cases.
Quick Start & Requirements
uv
and Python 3.11:
uv venv --python 3.11
source .venv/bin/activate
uv pip install -e .
python -m lcb_runner.runner.main --model {model_name} --scenario codegeneration --evaluate --release_version {release_version}
release_v1
to release_v6
).Highlighted Details
Maintenance & Community
The project is actively maintained, with multiple releases of the dataset. Submissions for the leaderboard are accepted via pull requests. Further details and updates can be found on their website.
Licensing & Compatibility
The repository is licensed under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The README notes that time limits can cause slight variations in metric computation. Users are advised to adjust num_process_evaluate
or timeout
flags if significant performance variations are observed. Some issues regarding erroneous tests are documented in ERRATA.md
.
2 weeks ago
1 day