granite-code-models  by ibm-granite

Code LLM for code generation, explanation, fixing, and translation

Created 1 year ago
1,226 stars

Top 32.1% on SourcePulse

GitHubView on GitHub
Project Summary

The Granite Code Models are a family of open foundation models designed for a wide range of code intelligence tasks, including generation, explanation, and bug fixing. Targeting developers and researchers, these models offer state-of-the-art performance on diverse coding benchmarks while adhering to IBM's AI Ethics principles for enterprise-grade trustworthiness.

How It Works

Granite models are decoder-only LLMs trained on 3-4 trillion tokens of code data across 116 programming languages, supplemented with natural language and mathematical reasoning datasets. Training occurs in two phases: initial code-only pretraining, followed by a mix of code and general language data. Instruction-tuned variants are further fine-tuned on curated datasets including code commits, math problems, and various code-specific instruction sets. The models utilize Byte Pair Encoding (BPE) with the StarCoder tokenizer.

Quick Start & Requirements

  • Inference: Use transformers library. Example: from transformers import AutoModelForCausalLM, AutoTokenizer; model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-3b-code-base-2k", device_map="cuda"); tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3b-code-base-2k")
  • Prerequisites: Python, transformers, PyTorch. GPU recommended for inference.
  • Finetuning: Requires dolomite-engine (clone repo, modify configs, run scripts/finetune.sh).
  • Model Access: Models available on HuggingFace (e.g., ibm-granite/granite-3b-code-base-2k).
  • Docs: HuggingFace Collection

Highlighted Details

  • Available in 3B, 8B, 20B, and 34B parameter sizes.
  • Trained on license-permissible data, adhering to IBM's AI Ethics principles.
  • Outperforms other open-source models like Mistral-7B and Llama-3-8B on coding tasks.
  • Data preprocessing includes aggressive deduplication, PII redaction, and malware scanning.

Maintenance & Community

  • Models are hosted on HuggingFace.
  • Feedback and discussions are welcomed via HuggingFace community tabs or the project's GitHub discussions page.

Licensing & Compatibility

  • Distributed under the Apache 2.0 license, permitting research and commercial use.

Limitations & Caveats

  • The README does not specify hardware requirements for training or fine-tuning, beyond recommending GPUs for inference. Fine-tuning instructions point to an external repository (dolomite-engine) which may have its own dependencies and setup complexity.
Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Didier Lopes Didier Lopes(Founder of OpenBB), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

DeepSeek-Coder-V2 by deepseek-ai

0.3%
6k
Open-source code language model comparable to GPT4-Turbo
Created 1 year ago
Updated 11 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
15 more.

codellama by meta-llama

0.0%
16k
Inference code for CodeLlama models
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.