starcoder by bigcode-project

Code LM for code generation and instruction fine-tuning

Created 2 years ago

7,534 stars

Top 6.8% on SourcePulse

View on GitHub

17 Experts Love This Project

Chaoyu Yang

Founder of Bento

Pawel Garbacki

Cofounder of Fireworks AI

and 13 more!

Project Summary

StarCoder is a large language model trained on a diverse dataset of source code and natural language, designed for code generation and completion tasks. It targets developers and researchers seeking to leverage advanced AI for software development assistance.

How It Works

StarCoder utilizes a transformer-based architecture, trained on over 80 programming languages and natural language text from sources like GitHub issues and notebooks. This broad training enables it to understand and generate code, complete functions, and infer code sequences. The project provides tools for both inference and fine-tuning on custom datasets.

Quick Start & Requirements

Installation: pip install -r requirements.txt
Prerequisites: Python, Hugging Face Hub login (huggingface-cli login), CUDA (for GPU usage), bitsandbytes, wandb.
Inference: Requires ~30GB VRAM for FP16/BF16, or <20GB for 8-bit quantization.
Resources: Official documentation and a Hugging Face playground are available.

Highlighted Details

Supports code generation and completion via Hugging Face transformers pipeline.
Includes scripts for fine-tuning on custom datasets (e.g., Stack Exchange) using PEFT and bitsandbytes.
Offers a Docker image for streamlined inference deployment.
Provides a C++ implementation (starcoder.cpp) using ggml for broader hardware compatibility.

Maintenance & Community

The project is part of the BigCode initiative, a collaboration focused on responsible AI development for code. Further community engagement details are not explicitly listed in the README.

Licensing & Compatibility

The model requires accepting an agreement on Hugging Face before use. Specific licensing terms for the model weights themselves are not detailed in the README, but the repository code is likely under a permissive license.

Limitations & Caveats

Inference requires significant GPU memory, though 8-bit quantization mitigates this. Fine-tuning setup involves multiple dependencies and configuration steps. The model's performance on highly specialized or niche programming languages may vary.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

72 stars in the last 30 days