octopack by bigcode-project

Code LLM instruction tuning research paper

Created 2 years ago

478 stars

Top 64.0% on SourcePulse

View on GitHub

7 Experts Love This Project

Founder of Axolotl AI

and 3 more!

Project Summary

OctoPack provides a comprehensive framework for instruction tuning code Large Language Models (LLMs), addressing the need for high-quality, instruction-following code generation. It offers curated datasets, model fine-tuning scripts, and an evaluation harness, targeting researchers and developers working on code LLMs.

How It Works

OctoPack leverages large-scale datasets derived from GitHub commits (CommitPack) and filtered for instruction-like quality (CommitPackFT). These datasets are used to fine-tune existing code LLMs like StarCoder and CodeGeeX2, creating models such as OctoCoder and OctoGeeX. The project also introduces HumanEvalPack, an extended evaluation suite for code LLMs across various tasks and languages.

Quick Start & Requirements

Data Creation: Requires BigQuery access for CommitPack, and significant compute resources for scraping GitHub.
Evaluation: Uses bigcode-evaluation-harness (install via pip install -q -r requirements.txt). Requires accelerate for distributed training/evaluation.
Model Fine-tuning: Scripts are provided for StarCoder and CodeGeeX2, requiring specific environments and potentially large datasets.

Highlighted Details

CommitPack: 4TB of GitHub commits across 350 programming languages.
CommitPackFT: Filtered dataset for high-quality, instruction-like commit messages.
HumanEvalPack: Extended evaluation suite covering 3 scenarios across 6 languages.
OctoCoder: StarCoder (16B) instruction-tuned on CommitPackFT + OASST.
OctoGeeX: CodeGeeX2 (6B) instruction-tuned on CommitPackFT + OASST.

Maintenance & Community

The project is part of the BigCode initiative, with contributions from multiple researchers. Links to relevant resources like videos and datasets are provided.

Licensing & Compatibility

Code, CommitPack, CommitPackFT, and HumanEvalPack are MIT licensed.
OctoCoder inherits StarCoder's license (Commercial, with restrictions on harmful use cases).
OctoGeeX inherits CodeGeeX2's license (Commercial, requires submission).
Individual data samples retain their original repository licenses, filtered for permissive use.

Limitations & Caveats

Reproducing the CommitPack dataset requires significant BigQuery resources. Fine-tuning and evaluation scripts may require substantial computational power and specific environment configurations. The exact evaluation results can vary based on Python version and batch size.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days