granite-code-models by ibm-granite

Code LLM for code generation, explanation, fixing, and translation

Created 1 year ago

1,243 stars

Top 31.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Edward Sun

Research Scientist at Meta Superintelligence Lab

Project Summary

The Granite Code Models are a family of open foundation models designed for a wide range of code intelligence tasks, including generation, explanation, and bug fixing. Targeting developers and researchers, these models offer state-of-the-art performance on diverse coding benchmarks while adhering to IBM's AI Ethics principles for enterprise-grade trustworthiness.

How It Works

Granite models are decoder-only LLMs trained on 3-4 trillion tokens of code data across 116 programming languages, supplemented with natural language and mathematical reasoning datasets. Training occurs in two phases: initial code-only pretraining, followed by a mix of code and general language data. Instruction-tuned variants are further fine-tuned on curated datasets including code commits, math problems, and various code-specific instruction sets. The models utilize Byte Pair Encoding (BPE) with the StarCoder tokenizer.

Quick Start & Requirements

Inference: Use transformers library. Example: from transformers import AutoModelForCausalLM, AutoTokenizer; model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-3b-code-base-2k", device_map="cuda"); tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3b-code-base-2k")
Prerequisites: Python, transformers, PyTorch. GPU recommended for inference.
Finetuning: Requires dolomite-engine (clone repo, modify configs, run scripts/finetune.sh).
Model Access: Models available on HuggingFace (e.g., ibm-granite/granite-3b-code-base-2k).
Docs: HuggingFace Collection

Highlighted Details

Available in 3B, 8B, 20B, and 34B parameter sizes.
Trained on license-permissible data, adhering to IBM's AI Ethics principles.
Outperforms other open-source models like Mistral-7B and Llama-3-8B on coding tasks.
Data preprocessing includes aggressive deduplication, PII redaction, and malware scanning.

Maintenance & Community

Models are hosted on HuggingFace.
Feedback and discussions are welcomed via HuggingFace community tabs or the project's GitHub discussions page.

Licensing & Compatibility

Distributed under the Apache 2.0 license, permitting research and commercial use.

Limitations & Caveats

The README does not specify hardware requirements for training or fine-tuning, beyond recommending GPUs for inference. Fine-tuning instructions point to an external repository (dolomite-engine) which may have its own dependencies and setup complexity.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days