kodcode  by KodCode-AI

Synthetic dataset generation for coding tasks

Created 9 months ago
296 stars

Top 89.6% on SourcePulse

GitHubView on GitHub
Project Summary

KodCode is a synthetic dataset generation framework for creating diverse, challenging, and verifiable coding questions with solutions. It targets researchers and developers working on large language models for code, offering a large-scale, open-source dataset suitable for supervised fine-tuning (SFT) and reinforcement learning (RL) tuning.

How It Works

KodCode unifies multiple data sources, including zero-shot generation, human-written assessments, code snippets, and technical documentation, to produce high-quality coding questions. A key feature is its self-verification mechanism, which generates verifiable solutions and corresponding tests (supporting pytest and parallel execution) for each question, ensuring solution correctness and enabling robust model evaluation.

Quick Start & Requirements

  • Installation: Clone the repository and set up a Python 3.10 environment using either conda or uv. Install dependencies via pip install -r requirements.txt.
  • Code Execution Environment: For local testing, install parallel. Alternatively, use the provided Docker image (zcxu/kodcode-test-environment:python3.10-cuda12.4-v0.1), which requires the NVIDIA Container Toolkit for GPU support.
  • Resources: Setup involves environment creation and dependency installation. Generating the dataset may require significant computational resources.
  • Links: Project Website, Technical Report, GitHub Repo, HF Datasets.

Highlighted Details

  • Accepted to ACL 2025 and awarded Best Paper at DataWorld @ ICML 2025.
  • Contains 12 distinct subsets covering various domains and difficulty levels.
  • Supports conversion between different coding question styles.
  • Includes KodCode-Lite (10K samples) and KodCode-V1.1 (50K samples with stdin support).

Maintenance & Community

The project is actively maintained, with recent updates including integrated test pipelines and Dockerized execution. Further details on RL training can be found in the code-r1 repository. Contact is available via Zhangchen Xu or by raising an issue on GitHub.

Licensing & Compatibility

The dataset is licensed under CC BY-NC 4.0. This license prohibits commercial use and redistribution without permission, potentially limiting integration into proprietary or commercial software.

Limitations & Caveats

The current license restricts commercial use. While the framework supports generating diverse datasets, the README indicates ongoing work for a one-line command to generate KodCode, suggesting the generation process might still require manual configuration.

Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research) and Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

DS-1000 by xlang-ai

0.4%
259
Benchmark for data science code generation
Created 3 years ago
Updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Travis Fischer Travis Fischer(Founder of Agentic), and
6 more.

AlphaCodium by Codium-ai

0.1%
4k
Code generation research paper implementation
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.