kodcode by KodCode-AI

Synthetic dataset generation for coding tasks

Created 1 year ago

312 stars

Top 86.7% on SourcePulse

Project Summary

KodCode is a synthetic dataset generation framework for creating diverse, challenging, and verifiable coding questions with solutions. It targets researchers and developers working on large language models for code, offering a large-scale, open-source dataset suitable for supervised fine-tuning (SFT) and reinforcement learning (RL) tuning.

How It Works

KodCode unifies multiple data sources, including zero-shot generation, human-written assessments, code snippets, and technical documentation, to produce high-quality coding questions. A key feature is its self-verification mechanism, which generates verifiable solutions and corresponding tests (supporting pytest and parallel execution) for each question, ensuring solution correctness and enabling robust model evaluation.

Quick Start & Requirements

Installation: Clone the repository and set up a Python 3.10 environment using either conda or uv. Install dependencies via pip install -r requirements.txt.
Code Execution Environment: For local testing, install parallel. Alternatively, use the provided Docker image (zcxu/kodcode-test-environment:python3.10-cuda12.4-v0.1), which requires the NVIDIA Container Toolkit for GPU support.
Resources: Setup involves environment creation and dependency installation. Generating the dataset may require significant computational resources.
Links: Project Website, Technical Report, GitHub Repo, HF Datasets.

Highlighted Details

Accepted to ACL 2025 and awarded Best Paper at DataWorld @ ICML 2025.
Contains 12 distinct subsets covering various domains and difficulty levels.
Supports conversion between different coding question styles.
Includes KodCode-Lite (10K samples) and KodCode-V1.1 (50K samples with stdin support).

Maintenance & Community

The project is actively maintained, with recent updates including integrated test pipelines and Dockerized execution. Further details on RL training can be found in the code-r1 repository. Contact is available via Zhangchen Xu or by raising an issue on GitHub.

Licensing & Compatibility

The dataset is licensed under CC BY-NC 4.0. This license prohibits commercial use and redistribution without permission, potentially limiting integration into proprietary or commercial software.

Limitations & Caveats

The current license restricts commercial use. While the framework supports generating diverse datasets, the README indicates ongoing work for a one-line command to generate KodCode, suggesting the generation process might still require manual configuration.

kodcode by KodCode-AI

Explore Similar Projects

depyler by paiml

selfcodealign by bigcode-project

DS-1000 by xlang-ai

RTL-Coder by hkust-zhiyao

AutoIF by QwenLM

loong by camel-ai

naturalcc by CGCL-codes

CodeTF by salesforce

OpenCoder-llm by OpenCoder-llm

LiveCodeBench by LiveCodeBench

AlphaCodium by Codium-ai

CodeT5 by salesforce