AutoCodeBenchmark  by Tencent-Hunyuan

Large-scale code generation benchmarks and training data

Created 6 months ago
349 stars

Top 80.0% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> AutoCodeBench provides an automated workflow for generating large-scale, high-difficulty, multilingual code generation benchmarks and training datasets. It addresses limitations of prior benchmarks by leveraging LLM-Sandbox interaction, benefiting researchers and engineers aiming to evaluate and enhance LLM code generation capabilities.

How It Works

The AutoCodeGen workflow employs LLM-Sandbox Interaction, where LLMs dynamically generate test inputs, and a multi-language sandbox provides corresponding outputs. This process creates scalable, high-quality code generation datasets, offering a novel approach to benchmark creation that overcomes the imbalanced language distributions and simplistic difficulty of previous efforts.

Quick Start & Requirements

Setup involves pulling and running the hunyuansandbox/multi-language-sandbox:v1 Docker image. Evaluation scripts (Python) are provided to run inference outputs through the sandbox for scoring. Prerequisites include Docker and a Python environment. Links to HuggingFace datasets are available.

Highlighted Details

  • AutoCodeBench Series: Offers ACB-Full (3,920 problems, 20 languages, high difficulty), ACB-Lite (1,586 problems, refined for consistent solvability), and ACB-Complete (1,000 problems, 3-shot completion style).
  • AutoCodeInstruct: A multilingual dataset for RL/SFT training, derived from DeepSeek-V3-0324 and filtered by Qwen2.5 models.
  • MultiLanguageSandbox: A robust, secure sandbox supporting compilation and execution for over 30 programming languages.
  • AutoCodeBench-V2: An updated version featuring 1,000 higher-quality problems, iteratively refined using proprietary models and the sandbox.

Maintenance & Community

The project is developed by the Hunyuan Team, Tencent. While specific community channels are not detailed, the project leverages and references advanced LLMs like DeepSeek-V3-0324 and Qwen2.5 for dataset refinement.

Licensing & Compatibility

The repository is licensed under the terms of its LICENSE file. Specific license type and compatibility for commercial use are not detailed in the provided README snippet.

Limitations & Caveats

No explicit limitations are stated regarding unsupported platforms or known bugs. The tiered benchmark approach (Lite, Complete) suggests a focus on solvability and difficulty refinement. The sandbox supports "over 30 programming languages," implying potential limitations for others.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
3
Star History
116 stars in the last 30 days

Explore Similar Projects

Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), and
6 more.

awesome-machine-learning-on-source-code by src-d

0.1%
7k
Curated list of ML applied to source code (MLonCode)
Created 8 years ago
Updated 5 years ago
Feedback? Help us improve.