kodcode  by KodCode-AI

Synthetic dataset generation for coding tasks

created 5 months ago
253 stars

Top 99.3% on SourcePulse

GitHubView on GitHub
Project Summary

KodCode is a synthetic dataset generation framework for creating diverse, challenging, and verifiable coding questions with solutions. It targets researchers and developers working on large language models for code, offering a large-scale, open-source dataset suitable for supervised fine-tuning (SFT) and reinforcement learning (RL) tuning.

How It Works

KodCode unifies multiple data sources, including zero-shot generation, human-written assessments, code snippets, and technical documentation, to produce high-quality coding questions. A key feature is its self-verification mechanism, which generates verifiable solutions and corresponding tests (supporting pytest and parallel execution) for each question, ensuring solution correctness and enabling robust model evaluation.

Quick Start & Requirements

  • Installation: Clone the repository and set up a Python 3.10 environment using either conda or uv. Install dependencies via pip install -r requirements.txt.
  • Code Execution Environment: For local testing, install parallel. Alternatively, use the provided Docker image (zcxu/kodcode-test-environment:python3.10-cuda12.4-v0.1), which requires the NVIDIA Container Toolkit for GPU support.
  • Resources: Setup involves environment creation and dependency installation. Generating the dataset may require significant computational resources.
  • Links: Project Website, Technical Report, GitHub Repo, HF Datasets.

Highlighted Details

  • Accepted to ACL 2025 and awarded Best Paper at DataWorld @ ICML 2025.
  • Contains 12 distinct subsets covering various domains and difficulty levels.
  • Supports conversion between different coding question styles.
  • Includes KodCode-Lite (10K samples) and KodCode-V1.1 (50K samples with stdin support).

Maintenance & Community

The project is actively maintained, with recent updates including integrated test pipelines and Dockerized execution. Further details on RL training can be found in the code-r1 repository. Contact is available via Zhangchen Xu or by raising an issue on GitHub.

Licensing & Compatibility

The dataset is licensed under CC BY-NC 4.0. This license prohibits commercial use and redistribution without permission, potentially limiting integration into proprietary or commercial software.

Limitations & Caveats

The current license restricts commercial use. While the framework supports generating diverse datasets, the README indicates ongoing work for a one-line command to generate KodCode, suggesting the generation process might still require manual configuration.

Health Check
Last commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Wing Lian Wing Lian(Founder of Axolotl AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

recurrent-pretraining by seal-rg

0.3%
812
Pretraining code for depth-recurrent language model research
created 6 months ago
updated 1 month ago
Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Ross Taylor Ross Taylor(Cofounder of General Reasoning; Creator of Papers with Code), and
9 more.

open-instruct by allenai

0.6%
3k
Training codebase for instruction-following language models
created 2 years ago
updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
16 more.

open-r1 by huggingface

0.3%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 5 days ago
Feedback? Help us improve.