AutoCodeBenchmark by Tencent-Hunyuan

Large-scale code generation benchmarks and training data

Created 9 months ago

414 stars

Top 70.2% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> AutoCodeBench provides an automated workflow for generating large-scale, high-difficulty, multilingual code generation benchmarks and training datasets. It addresses limitations of prior benchmarks by leveraging LLM-Sandbox interaction, benefiting researchers and engineers aiming to evaluate and enhance LLM code generation capabilities.

How It Works

The AutoCodeGen workflow employs LLM-Sandbox Interaction, where LLMs dynamically generate test inputs, and a multi-language sandbox provides corresponding outputs. This process creates scalable, high-quality code generation datasets, offering a novel approach to benchmark creation that overcomes the imbalanced language distributions and simplistic difficulty of previous efforts.

Quick Start & Requirements

Setup involves pulling and running the hunyuansandbox/multi-language-sandbox:v1 Docker image. Evaluation scripts (Python) are provided to run inference outputs through the sandbox for scoring. Prerequisites include Docker and a Python environment. Links to HuggingFace datasets are available.

Highlighted Details

AutoCodeBench Series: Offers ACB-Full (3,920 problems, 20 languages, high difficulty), ACB-Lite (1,586 problems, refined for consistent solvability), and ACB-Complete (1,000 problems, 3-shot completion style).
AutoCodeInstruct: A multilingual dataset for RL/SFT training, derived from DeepSeek-V3-0324 and filtered by Qwen2.5 models.
MultiLanguageSandbox: A robust, secure sandbox supporting compilation and execution for over 30 programming languages.
AutoCodeBench-V2: An updated version featuring 1,000 higher-quality problems, iteratively refined using proprietary models and the sandbox.

Maintenance & Community

The project is developed by the Hunyuan Team, Tencent. While specific community channels are not detailed, the project leverages and references advanced LLMs like DeepSeek-V3-0324 and Qwen2.5 for dataset refinement.

Licensing & Compatibility

The repository is licensed under the terms of its LICENSE file. Specific license type and compatibility for commercial use are not detailed in the provided README snippet.

Limitations & Caveats

No explicit limitations are stated regarding unsupported platforms or known bugs. The tiered benchmark approach (Lite, Complete) suggests a focus on solvability and difficulty refinement. The sandbox supports "over 30 programming languages," implying potential limitations for others.

AutoCodeBenchmark by Tencent-Hunyuan

Explore Similar Projects

MultiPL-E by nuprl

naturalcc by CGCL-codes

Awesome-Code-LLM by huybery

awesome-ai-coding by wsxiaoys

Awesome-Code-LLM by codefuse-ai

FastCode by HKUDS

aiXcoder-7B by aixcoder-plugin

auto-dev by phodal

CodeGeeX2 by zai-org

CodeGeeX by zai-org

awesome-machine-learning-on-source-code by src-d

DeepSeek-Coder by deepseek-ai