Consistency_LLM  by hao-ai-lab

Parallel decoder for efficient LLM inference

Created 1 year ago
404 stars

Top 71.9% on SourcePulse

GitHubView on GitHub
Project Summary

Consistency Large Language Models (CLLMs) offer a novel approach to accelerate LLM inference by enabling parallel decoding of multiple tokens simultaneously. This method, Jacobi decoding, significantly reduces latency by mapping an $n$-token sequence to the correct output in fewer steps than traditional auto-regressive decoding, benefiting researchers and developers seeking faster LLM deployments.

How It Works

CLLMs are trained with a specific objective: to transform any randomly initialized $n$-token sequence into the same output as auto-regressive decoding with minimal steps. This is achieved through Jacobi decoding, an efficient parallel decoding strategy that avoids the need for draft models or architectural modifications, simplifying integration and maintenance.

Quick Start & Requirements

  • Install: Clone the repository, activate a Python 3.10 conda environment, and run pip install -r requirements.txt followed by pip install flash-attn==2.4.1.
  • Prerequisites: Python 3.10, CUDA 11.8+ (for flash-attn), and specific datasets may need to be downloaded or installed.
  • Resources: Training requires significant computational resources; inference speedups are demonstrated up to $3.4\times$.
  • Links: Paper, Blog, FastChat Integration.

Highlighted Details

  • Achieves $2.4\times$ to $3.4\times$ generation speed improvements across various tasks.
  • Seamless integration with existing LLMs without architectural changes.
  • Compatible with other inference techniques like Lookahead Decoding for further speedups.
  • Model checkpoints available on Huggingface Hub for 7B models fine-tuned on ShareGPT, GSM8K, Spider, and Code-Search-Net datasets.

Maintenance & Community

  • CLLMs have been integrated into FastChat.
  • Model checkpoints and paper are publicly available.
  • Community channels are not explicitly mentioned in the README.

Licensing & Compatibility

  • The repository does not explicitly state a license. Model weights are available on Huggingface, typically under their respective base model licenses.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not specify the license for the code or model weights, which may impact commercial adoption. Training CLLMs requires collecting or generating Jacobi trajectories, adding a data preparation step.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
3 more.

prompt-lookup-decoding by apoorvumang

0.2%
566
Decoding method for faster LLM generation
Created 1 year ago
Updated 1 year ago
Starred by Sasha Rush Sasha Rush(Research Scientist at Cursor; Professor at Cornell Tech) and Clément Renault Clément Renault(Cofounder of Meilisearch).

lm.rs by samuel-vitorino

0%
1k
Minimal LLM inference in Rust
Created 1 year ago
Updated 10 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Ying Sheng Ying Sheng(Coauthor of SGLang), and
2 more.

LookaheadDecoding by hao-ai-lab

0.2%
1k
Parallel decoding algorithm for faster LLM inference
Created 1 year ago
Updated 6 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Lei Zhang Lei Zhang(Director Engineering AI at AMD), and
23 more.

gpt-fast by meta-pytorch

0.2%
6k
PyTorch text generation for efficient transformer inference
Created 1 year ago
Updated 3 weeks ago
Feedback? Help us improve.