Discover and explore top open-source AI tools and projects—updated daily.
SeekingDreamDynamic benchmarking for code LLMs
Top 98.5% on SourcePulse
Summary
DyCodeEval tackles data contamination in Code LLM evaluation by introducing a novel dynamic benchmarking framework. It generates semantically equivalent, diverse, and non-deterministic programming problems at evaluation time, offering a more robust assessment of LLM reasoning capabilities. This framework is crucial for researchers and practitioners seeking to mitigate benchmark overfitting and understand true model performance.
How It Works
The system employs a multi-agent cooperation strategy to dynamically rewrite existing benchmarks. This approach generates new problem variants that preserve original semantics while enhancing diversity and non-determinism. This dynamic generation is key to circumventing data contamination, which can inflate static benchmark scores, thus providing a more accurate measure of an LLM's reasoning and problem-solving skills.
Quick Start & Requirements
Installation requires pip install requirement.txt. Users must set up commercial LLM accounts via LiteLLM for model invocation. Dynamic benchmark generation is initiated with python gen_problem.py, specifying agent and seed data IDs. DyPass@K metric computation involves a multi-step process: gen_problem.py, gen_code.py, and eval_pass_K.py. Pre-generated HumanEval and MBPP datasets are available on Hugging Face, loadable using utils.load_unique_dataset.
Highlighted Details
Maintenance & Community
No specific details on maintainers, community channels (e.g., Discord, Slack), or active development signals beyond the upcoming ICML 2025 publication were present in the provided text.
Licensing & Compatibility
The license type and any associated compatibility notes for commercial use or closed-source linking were not explicitly stated in the provided README content.
Limitations & Caveats
Current dynamic benchmark generation relies on commercial LLM APIs; future features aim to support fine-tuning open-source models. DyPass@K evaluation scripts are described as fragmented and slated for simplification. The project's association with an upcoming conference paper suggests it may still be in active development.
2 months ago
Inactive
facebookresearch
LiveCodeBench
SWE-bench