octopack  by bigcode-project

Code LLM instruction tuning research paper

created 2 years ago
472 stars

Top 65.5% on sourcepulse

GitHubView on GitHub
Project Summary

OctoPack provides a comprehensive framework for instruction tuning code Large Language Models (LLMs), addressing the need for high-quality, instruction-following code generation. It offers curated datasets, model fine-tuning scripts, and an evaluation harness, targeting researchers and developers working on code LLMs.

How It Works

OctoPack leverages large-scale datasets derived from GitHub commits (CommitPack) and filtered for instruction-like quality (CommitPackFT). These datasets are used to fine-tune existing code LLMs like StarCoder and CodeGeeX2, creating models such as OctoCoder and OctoGeeX. The project also introduces HumanEvalPack, an extended evaluation suite for code LLMs across various tasks and languages.

Quick Start & Requirements

  • Data Creation: Requires BigQuery access for CommitPack, and significant compute resources for scraping GitHub.
  • Evaluation: Uses bigcode-evaluation-harness (install via pip install -q -r requirements.txt). Requires accelerate for distributed training/evaluation.
  • Model Fine-tuning: Scripts are provided for StarCoder and CodeGeeX2, requiring specific environments and potentially large datasets.

Highlighted Details

  • CommitPack: 4TB of GitHub commits across 350 programming languages.
  • CommitPackFT: Filtered dataset for high-quality, instruction-like commit messages.
  • HumanEvalPack: Extended evaluation suite covering 3 scenarios across 6 languages.
  • OctoCoder: StarCoder (16B) instruction-tuned on CommitPackFT + OASST.
  • OctoGeeX: CodeGeeX2 (6B) instruction-tuned on CommitPackFT + OASST.

Maintenance & Community

The project is part of the BigCode initiative, with contributions from multiple researchers. Links to relevant resources like videos and datasets are provided.

Licensing & Compatibility

  • Code, CommitPack, CommitPackFT, and HumanEvalPack are MIT licensed.
  • OctoCoder inherits StarCoder's license (Commercial, with restrictions on harmful use cases).
  • OctoGeeX inherits CodeGeeX2's license (Commercial, requires submission).
  • Individual data samples retain their original repository licenses, filtered for permissive use.

Limitations & Caveats

Reproducing the CommitPack dataset requires significant BigQuery resources. Fine-tuning and evaluation scripts may require substantial computational power and specific environment configurations. The exact evaluation results can vary based on Python version and batch size.

Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Travis Fischer Travis Fischer(Founder of Agentic).

LiveCodeBench by LiveCodeBench

0.8%
606
Benchmark for holistic LLM code evaluation
created 1 year ago
updated 2 weeks ago
Feedback? Help us improve.