Code LLM instruction tuning research paper
Top 65.5% on sourcepulse
OctoPack provides a comprehensive framework for instruction tuning code Large Language Models (LLMs), addressing the need for high-quality, instruction-following code generation. It offers curated datasets, model fine-tuning scripts, and an evaluation harness, targeting researchers and developers working on code LLMs.
How It Works
OctoPack leverages large-scale datasets derived from GitHub commits (CommitPack) and filtered for instruction-like quality (CommitPackFT). These datasets are used to fine-tune existing code LLMs like StarCoder and CodeGeeX2, creating models such as OctoCoder and OctoGeeX. The project also introduces HumanEvalPack, an extended evaluation suite for code LLMs across various tasks and languages.
Quick Start & Requirements
bigcode-evaluation-harness
(install via pip install -q -r requirements.txt
). Requires accelerate
for distributed training/evaluation.Highlighted Details
Maintenance & Community
The project is part of the BigCode initiative, with contributions from multiple researchers. Links to relevant resources like videos and datasets are provided.
Licensing & Compatibility
Limitations & Caveats
Reproducing the CommitPack dataset requires significant BigQuery resources. Fine-tuning and evaluation scripts may require substantial computational power and specific environment configurations. The exact evaluation results can vary based on Python version and batch size.
5 months ago
1 day