Synthetic dataset generation for coding tasks
Top 99.3% on SourcePulse
KodCode is a synthetic dataset generation framework for creating diverse, challenging, and verifiable coding questions with solutions. It targets researchers and developers working on large language models for code, offering a large-scale, open-source dataset suitable for supervised fine-tuning (SFT) and reinforcement learning (RL) tuning.
How It Works
KodCode unifies multiple data sources, including zero-shot generation, human-written assessments, code snippets, and technical documentation, to produce high-quality coding questions. A key feature is its self-verification mechanism, which generates verifiable solutions and corresponding tests (supporting pytest and parallel execution) for each question, ensuring solution correctness and enabling robust model evaluation.
Quick Start & Requirements
conda
or uv
. Install dependencies via pip install -r requirements.txt
.parallel
. Alternatively, use the provided Docker image (zcxu/kodcode-test-environment:python3.10-cuda12.4-v0.1
), which requires the NVIDIA Container Toolkit for GPU support.Highlighted Details
Maintenance & Community
The project is actively maintained, with recent updates including integrated test pipelines and Dockerized execution. Further details on RL training can be found in the code-r1
repository. Contact is available via Zhangchen Xu or by raising an issue on GitHub.
Licensing & Compatibility
The dataset is licensed under CC BY-NC 4.0. This license prohibits commercial use and redistribution without permission, potentially limiting integration into proprietary or commercial software.
Limitations & Caveats
The current license restricts commercial use. While the framework supports generating diverse datasets, the README indicates ongoing work for a one-line command to generate KodCode, suggesting the generation process might still require manual configuration.
3 weeks ago
1 day