granite-code-models  by ibm-granite

Code LLM for code generation, explanation, fixing, and translation

created 1 year ago
1,220 stars

Top 32.9% on sourcepulse

GitHubView on GitHub
Project Summary

The Granite Code Models are a family of open foundation models designed for a wide range of code intelligence tasks, including generation, explanation, and bug fixing. Targeting developers and researchers, these models offer state-of-the-art performance on diverse coding benchmarks while adhering to IBM's AI Ethics principles for enterprise-grade trustworthiness.

How It Works

Granite models are decoder-only LLMs trained on 3-4 trillion tokens of code data across 116 programming languages, supplemented with natural language and mathematical reasoning datasets. Training occurs in two phases: initial code-only pretraining, followed by a mix of code and general language data. Instruction-tuned variants are further fine-tuned on curated datasets including code commits, math problems, and various code-specific instruction sets. The models utilize Byte Pair Encoding (BPE) with the StarCoder tokenizer.

Quick Start & Requirements

  • Inference: Use transformers library. Example: from transformers import AutoModelForCausalLM, AutoTokenizer; model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-3b-code-base-2k", device_map="cuda"); tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3b-code-base-2k")
  • Prerequisites: Python, transformers, PyTorch. GPU recommended for inference.
  • Finetuning: Requires dolomite-engine (clone repo, modify configs, run scripts/finetune.sh).
  • Model Access: Models available on HuggingFace (e.g., ibm-granite/granite-3b-code-base-2k).
  • Docs: HuggingFace Collection

Highlighted Details

  • Available in 3B, 8B, 20B, and 34B parameter sizes.
  • Trained on license-permissible data, adhering to IBM's AI Ethics principles.
  • Outperforms other open-source models like Mistral-7B and Llama-3-8B on coding tasks.
  • Data preprocessing includes aggressive deduplication, PII redaction, and malware scanning.

Maintenance & Community

  • Models are hosted on HuggingFace.
  • Feedback and discussions are welcomed via HuggingFace community tabs or the project's GitHub discussions page.

Licensing & Compatibility

  • Distributed under the Apache 2.0 license, permitting research and commercial use.

Limitations & Caveats

  • The README does not specify hardware requirements for training or fine-tuning, beyond recommending GPUs for inference. Fine-tuning instructions point to an external repository (dolomite-engine) which may have its own dependencies and setup complexity.
Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
18 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alex Cheema Alex Cheema(Cofounder of EXO Labs), and
1 more.

recurrent-pretraining by seal-rg

0.1%
806
Pretraining code for depth-recurrent language model research
created 5 months ago
updated 2 weeks ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley).

DeepSeek-Coder-V2 by deepseek-ai

0.4%
6k
Open-source code language model comparable to GPT4-Turbo
created 1 year ago
updated 10 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Travis Fischer Travis Fischer(Founder of Agentic), and
6 more.

codellama by meta-llama

0.1%
16k
Inference code for CodeLlama models
created 1 year ago
updated 11 months ago
Feedback? Help us improve.