LiveCodeBench  by LiveCodeBench

Benchmark for holistic LLM code evaluation

Created 1 year ago
659 stars

Top 50.8% on SourcePulse

GitHubView on GitHub
Project Summary

LiveCodeBench provides a holistic and contamination-free evaluation framework for Large Language Models (LLMs) focused on coding capabilities. It targets researchers and developers evaluating LLMs, offering continuous updates with new problems from competitive programming platforms and assessing a broader range of skills beyond code generation, such as self-repair and test output prediction.

How It Works

The benchmark continuously collects coding problems from LeetCode, AtCoder, and CodeForces, ensuring up-to-date evaluation data. It supports multiple scenarios including code generation, code execution, and test output prediction. For code generation, it leverages vLLM for efficient inference and allows configuration for parallelization across GPUs. Evaluation metrics include pass@1 and pass@5, using a modified and improved checker from the APPS benchmark to handle edge cases.

Quick Start & Requirements

  • Install dependencies using uv and Python 3.11:
    uv venv --python 3.11
    source .venv/bin/activate
    uv pip install -e .
    
  • Run evaluations with:
    python -m lcb_runner.runner.main --model {model_name} --scenario codegeneration --evaluate --release_version {release_version}
    
  • Supports various dataset versions (release_v1 to release_v6).
  • Official documentation and leaderboard available at livecodebench.github.io.

Highlighted Details

  • Continuously updated dataset with problems released between May 2023 and April 2025 (release_v6).
  • Evaluates multiple coding capabilities: code generation, code execution, test output prediction, and self-repair.
  • Designed to mitigate data contamination by tracking problem release dates.
  • Demonstrates that models performing well on HumanEval do not necessarily perform well on LiveCodeBench, highlighting the need for holistic evaluation.

Maintenance & Community

The project is actively maintained, with multiple releases of the dataset. Submissions for the leaderboard are accepted via pull requests. Further details and updates can be found on their website.

Licensing & Compatibility

The repository is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The README notes that time limits can cause slight variations in metric computation. Users are advised to adjust num_process_evaluate or timeout flags if significant performance variations are observed. Some issues regarding erroneous tests are documented in ERRATA.md.

Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
1
Star History
30 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Edward Z. Yang Edward Z. Yang(Research Engineer at Meta; Maintainer of PyTorch), and
5 more.

yet-another-applied-llm-benchmark by carlini

0.2%
1k
LLM benchmark for evaluating models on previously asked programming questions
Created 1 year ago
Updated 4 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Travis Fischer Travis Fischer(Founder of Agentic), and
6 more.

AlphaCodium by Codium-ai

0.1%
4k
Code generation research paper implementation
Created 1 year ago
Updated 9 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Jared Palmer Jared Palmer(Ex-VP AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
3 more.

human-eval by openai

0.4%
3k
Evaluation harness for LLMs trained on code
Created 4 years ago
Updated 8 months ago
Starred by Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
14 more.

SWE-bench by SWE-bench

2.3%
4k
Benchmark for evaluating LLMs on real-world GitHub issues
Created 1 year ago
Updated 18 hours ago
Feedback? Help us improve.