octopack  by bigcode-project

Code LLM instruction tuning research paper

Created 2 years ago
471 stars

Top 64.8% on SourcePulse

GitHubView on GitHub
Project Summary

OctoPack provides a comprehensive framework for instruction tuning code Large Language Models (LLMs), addressing the need for high-quality, instruction-following code generation. It offers curated datasets, model fine-tuning scripts, and an evaluation harness, targeting researchers and developers working on code LLMs.

How It Works

OctoPack leverages large-scale datasets derived from GitHub commits (CommitPack) and filtered for instruction-like quality (CommitPackFT). These datasets are used to fine-tune existing code LLMs like StarCoder and CodeGeeX2, creating models such as OctoCoder and OctoGeeX. The project also introduces HumanEvalPack, an extended evaluation suite for code LLMs across various tasks and languages.

Quick Start & Requirements

  • Data Creation: Requires BigQuery access for CommitPack, and significant compute resources for scraping GitHub.
  • Evaluation: Uses bigcode-evaluation-harness (install via pip install -q -r requirements.txt). Requires accelerate for distributed training/evaluation.
  • Model Fine-tuning: Scripts are provided for StarCoder and CodeGeeX2, requiring specific environments and potentially large datasets.

Highlighted Details

  • CommitPack: 4TB of GitHub commits across 350 programming languages.
  • CommitPackFT: Filtered dataset for high-quality, instruction-like commit messages.
  • HumanEvalPack: Extended evaluation suite covering 3 scenarios across 6 languages.
  • OctoCoder: StarCoder (16B) instruction-tuned on CommitPackFT + OASST.
  • OctoGeeX: CodeGeeX2 (6B) instruction-tuned on CommitPackFT + OASST.

Maintenance & Community

The project is part of the BigCode initiative, with contributions from multiple researchers. Links to relevant resources like videos and datasets are provided.

Licensing & Compatibility

  • Code, CommitPack, CommitPackFT, and HumanEvalPack are MIT licensed.
  • OctoCoder inherits StarCoder's license (Commercial, with restrictions on harmful use cases).
  • OctoGeeX inherits CodeGeeX2's license (Commercial, requires submission).
  • Individual data samples retain their original repository licenses, filtered for permissive use.

Limitations & Caveats

Reproducing the CommitPack dataset requires significant BigQuery resources. Fine-tuning and evaluation scripts may require substantial computational power and specific environment configurations. The exact evaluation results can vary based on Python version and batch size.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Omar Khattab Omar Khattab(Coauthor of DSPy, ColBERT; Professor at MIT), and
5 more.

CodeXGLUE by microsoft

0.3%
2k
Benchmark for code intelligence tasks
Created 5 years ago
Updated 1 year ago
Feedback? Help us improve.