starcoder2  by bigcode-project

Code generation model family (3B, 7B, 15B) for code completion

created 1 year ago
1,947 stars

Top 23.0% on sourcepulse

GitHubView on GitHub
Project Summary

StarCoder2 is a family of large language models designed for code generation, supporting over 600 programming languages. It targets developers and researchers seeking advanced code completion and generation capabilities. The models offer significant improvements in code understanding and generation accuracy due to their extensive training data and architectural enhancements.

How It Works

StarCoder2 models utilize Grouped Query Attention and a 16,384 token context window with a 4,096 token sliding window attention mechanism. This architecture allows for processing longer code sequences and capturing more complex dependencies, leading to more coherent and contextually relevant code generation. The models are trained on over 3 trillion tokens of code and natural language data, providing a broad understanding of programming languages and software development patterns.

Quick Start & Requirements

  • Installation: pip install -r requirements.txt and pip install git+https://github.com/huggingface/transformers.git. Requires Hugging Face Hub token (export HF_TOKEN=xxx).
  • Prerequisites: Python, transformers library (from source), PyTorch (with CUDA 12.1 support recommended for fine-tuning).
  • Resources: StarCoder2-15B in full precision requires ~32GB VRAM. Quantized versions (8-bit, 4-bit) significantly reduce memory footprint to ~17GB and ~9GB respectively.
  • Links: Models & Datasets, Paper, Text-generation-inference.

Highlighted Details

  • Available in 3B, 7B, and 15B parameter sizes.
  • Trained on The Stack v2 dataset, covering 600+ programming languages.
  • Supports 16k context window with sliding window attention.
  • Fine-tuning examples provided using PEFT, bitsandbytes, and TRL for efficient adaptation.

Maintenance & Community

The project is part of the BigCode initiative, a collaboration involving Hugging Face and ServiceNow. Further resources and community discussions can be found via Hugging Face and related GitHub repositories.

Licensing & Compatibility

The models are released under the BigCode OpenRAIL-M license. This license permits commercial use but includes specific use-case restrictions to prevent misuse.

Limitations & Caveats

StarCoder2 models are primarily for code completion and may not perform well on instruction-following tasks. The README notes that some PRs for transformers might still need merging for full compatibility.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
58 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley).

DeepSeek-Coder-V2 by deepseek-ai

0.4%
6k
Open-source code language model comparable to GPT4-Turbo
created 1 year ago
updated 10 months ago
Feedback? Help us improve.