DyCodeEval  by SeekingDream

Dynamic benchmarking for code LLMs

Created 8 months ago
256 stars

Top 98.5% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

DyCodeEval tackles data contamination in Code LLM evaluation by introducing a novel dynamic benchmarking framework. It generates semantically equivalent, diverse, and non-deterministic programming problems at evaluation time, offering a more robust assessment of LLM reasoning capabilities. This framework is crucial for researchers and practitioners seeking to mitigate benchmark overfitting and understand true model performance.

How It Works

The system employs a multi-agent cooperation strategy to dynamically rewrite existing benchmarks. This approach generates new problem variants that preserve original semantics while enhancing diversity and non-determinism. This dynamic generation is key to circumventing data contamination, which can inflate static benchmark scores, thus providing a more accurate measure of an LLM's reasoning and problem-solving skills.

Quick Start & Requirements

Installation requires pip install requirement.txt. Users must set up commercial LLM accounts via LiteLLM for model invocation. Dynamic benchmark generation is initiated with python gen_problem.py, specifying agent and seed data IDs. DyPass@K metric computation involves a multi-step process: gen_problem.py, gen_code.py, and eval_pass_K.py. Pre-generated HumanEval and MBPP datasets are available on Hugging Face, loadable using utils.load_unique_dataset.

Highlighted Details

  • Official implementation for the ICML 2025 paper “Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination.”
  • Provides pre-generated HumanEval and MBPP datasets on Hugging Face.
  • Supports LLM inference via vLLM and LiteLLM abstractions for open-source and commercial models.
  • Dynamic generation ensures problems are semantically equivalent, diverse, and non-deterministic.

Maintenance & Community

No specific details on maintainers, community channels (e.g., Discord, Slack), or active development signals beyond the upcoming ICML 2025 publication were present in the provided text.

Licensing & Compatibility

The license type and any associated compatibility notes for commercial use or closed-source linking were not explicitly stated in the provided README content.

Limitations & Caveats

Current dynamic benchmark generation relies on commercial LLM APIs; future features aim to support fine-tuning open-source models. DyPass@K evaluation scripts are described as fragmented and slated for simplification. The project's association with an upcoming conference paper suggests it may still be in active development.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
60 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
15 more.

SWE-bench by SWE-bench

0.9%
4k
Benchmark for evaluating LLMs on real-world GitHub issues
Created 2 years ago
Updated 6 days ago
Feedback? Help us improve.