granite-code-models  by ibm-granite

Code LLM for code generation, explanation, fixing, and translation

Created 1 year ago
1,243 stars

Top 31.6% on SourcePulse

GitHubView on GitHub
Project Summary

The Granite Code Models are a family of open foundation models designed for a wide range of code intelligence tasks, including generation, explanation, and bug fixing. Targeting developers and researchers, these models offer state-of-the-art performance on diverse coding benchmarks while adhering to IBM's AI Ethics principles for enterprise-grade trustworthiness.

How It Works

Granite models are decoder-only LLMs trained on 3-4 trillion tokens of code data across 116 programming languages, supplemented with natural language and mathematical reasoning datasets. Training occurs in two phases: initial code-only pretraining, followed by a mix of code and general language data. Instruction-tuned variants are further fine-tuned on curated datasets including code commits, math problems, and various code-specific instruction sets. The models utilize Byte Pair Encoding (BPE) with the StarCoder tokenizer.

Quick Start & Requirements

  • Inference: Use transformers library. Example: from transformers import AutoModelForCausalLM, AutoTokenizer; model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-3b-code-base-2k", device_map="cuda"); tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3b-code-base-2k")
  • Prerequisites: Python, transformers, PyTorch. GPU recommended for inference.
  • Finetuning: Requires dolomite-engine (clone repo, modify configs, run scripts/finetune.sh).
  • Model Access: Models available on HuggingFace (e.g., ibm-granite/granite-3b-code-base-2k).
  • Docs: HuggingFace Collection

Highlighted Details

  • Available in 3B, 8B, 20B, and 34B parameter sizes.
  • Trained on license-permissible data, adhering to IBM's AI Ethics principles.
  • Outperforms other open-source models like Mistral-7B and Llama-3-8B on coding tasks.
  • Data preprocessing includes aggressive deduplication, PII redaction, and malware scanning.

Maintenance & Community

  • Models are hosted on HuggingFace.
  • Feedback and discussions are welcomed via HuggingFace community tabs or the project's GitHub discussions page.

Licensing & Compatibility

  • Distributed under the Apache 2.0 license, permitting research and commercial use.

Limitations & Caveats

  • The README does not specify hardware requirements for training or fine-tuning, beyond recommending GPUs for inference. Fine-tuning instructions point to an external repository (dolomite-engine) which may have its own dependencies and setup complexity.
Health Check
Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Didier Lopes Didier Lopes(Founder of OpenBB), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

DeepSeek-Coder-V2 by deepseek-ai

0.3%
6k
Open-source code language model comparable to GPT4-Turbo
Created 1 year ago
Updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
15 more.

codellama by meta-llama

0.0%
16k
Inference code for CodeLlama models
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.