DyCodeEval by SeekingDream

Dynamic benchmarking for code LLMs

Created 8 months ago

256 stars

Top 98.5% on SourcePulse

Project Summary

Summary

DyCodeEval tackles data contamination in Code LLM evaluation by introducing a novel dynamic benchmarking framework. It generates semantically equivalent, diverse, and non-deterministic programming problems at evaluation time, offering a more robust assessment of LLM reasoning capabilities. This framework is crucial for researchers and practitioners seeking to mitigate benchmark overfitting and understand true model performance.

How It Works

The system employs a multi-agent cooperation strategy to dynamically rewrite existing benchmarks. This approach generates new problem variants that preserve original semantics while enhancing diversity and non-determinism. This dynamic generation is key to circumventing data contamination, which can inflate static benchmark scores, thus providing a more accurate measure of an LLM's reasoning and problem-solving skills.

Quick Start & Requirements

Installation requires pip install requirement.txt. Users must set up commercial LLM accounts via LiteLLM for model invocation. Dynamic benchmark generation is initiated with python gen_problem.py, specifying agent and seed data IDs. DyPass@K metric computation involves a multi-step process: gen_problem.py, gen_code.py, and eval_pass_K.py. Pre-generated HumanEval and MBPP datasets are available on Hugging Face, loadable using utils.load_unique_dataset.

Highlighted Details

Official implementation for the ICML 2025 paper “Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination.”
Provides pre-generated HumanEval and MBPP datasets on Hugging Face.
Supports LLM inference via vLLM and LiteLLM abstractions for open-source and commercial models.
Dynamic generation ensures problems are semantically equivalent, diverse, and non-deterministic.

Maintenance & Community

No specific details on maintainers, community channels (e.g., Discord, Slack), or active development signals beyond the upcoming ICML 2025 publication were present in the provided text.

Licensing & Compatibility

The license type and any associated compatibility notes for commercial use or closed-source linking were not explicitly stated in the provided README content.

Limitations & Caveats

Current dynamic benchmark generation relies on commercial LLM APIs; future features aim to support fine-tuning open-source models. DyPass@K evaluation scripts are described as fragmented and slated for simplification. The project's association with an upcoming conference paper suggests it may still be in active development.

DyCodeEval by SeekingDream

Explore Similar Projects

ML-Bench by gersteinlab

llm-debugger-vscode-extension by mohsen1

llm-verified-with-monte-carlo-tree-search by namin

code-eval by abacaj

continuous-eval by relari-ai

bench by arthur-ai

Static-to-Dynamic-LLMEval by SeekingDream

MultiPL-E by nuprl

bigcodebench by bigcode-project

cwm by facebookresearch

LiveCodeBench by LiveCodeBench

SWE-bench by SWE-bench