Discover and explore top open-source AI tools and projects—updated daily.
Tencent-HunyuanLarge-scale code generation benchmarks and training data
Top 80.0% on SourcePulse
<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> AutoCodeBench provides an automated workflow for generating large-scale, high-difficulty, multilingual code generation benchmarks and training datasets. It addresses limitations of prior benchmarks by leveraging LLM-Sandbox interaction, benefiting researchers and engineers aiming to evaluate and enhance LLM code generation capabilities.
How It Works
The AutoCodeGen workflow employs LLM-Sandbox Interaction, where LLMs dynamically generate test inputs, and a multi-language sandbox provides corresponding outputs. This process creates scalable, high-quality code generation datasets, offering a novel approach to benchmark creation that overcomes the imbalanced language distributions and simplistic difficulty of previous efforts.
Quick Start & Requirements
Setup involves pulling and running the hunyuansandbox/multi-language-sandbox:v1 Docker image. Evaluation scripts (Python) are provided to run inference outputs through the sandbox for scoring. Prerequisites include Docker and a Python environment. Links to HuggingFace datasets are available.
Highlighted Details
ACB-Full (3,920 problems, 20 languages, high difficulty), ACB-Lite (1,586 problems, refined for consistent solvability), and ACB-Complete (1,000 problems, 3-shot completion style).Maintenance & Community
The project is developed by the Hunyuan Team, Tencent. While specific community channels are not detailed, the project leverages and references advanced LLMs like DeepSeek-V3-0324 and Qwen2.5 for dataset refinement.
Licensing & Compatibility
The repository is licensed under the terms of its LICENSE file. Specific license type and compatibility for commercial use are not detailed in the provided README snippet.
Limitations & Caveats
No explicit limitations are stated regarding unsupported platforms or known bugs. The tiered benchmark approach (Lite, Complete) suggests a focus on solvability and difficulty refinement. The sandbox supports "over 30 programming languages," implying potential limitations for others.
1 week ago
Inactive
huybery
src-d
deepseek-ai