Consistency_LLM by hao-ai-lab

Parallel decoder for efficient LLM inference

Created 2 years ago

412 stars

Top 70.9% on SourcePulse

View on GitHub

4 Experts Love This Project

Cody Yu

Coauthor of vLLM; MTS at OpenAI

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Zhuohan Li

Coauthor of vLLM

Lianmin Zheng

Coauthor of SGLang, vLLM

Project Summary

Consistency Large Language Models (CLLMs) offer a novel approach to accelerate LLM inference by enabling parallel decoding of multiple tokens simultaneously. This method, Jacobi decoding, significantly reduces latency by mapping an $n$-token sequence to the correct output in fewer steps than traditional auto-regressive decoding, benefiting researchers and developers seeking faster LLM deployments.

How It Works

CLLMs are trained with a specific objective: to transform any randomly initialized $n$-token sequence into the same output as auto-regressive decoding with minimal steps. This is achieved through Jacobi decoding, an efficient parallel decoding strategy that avoids the need for draft models or architectural modifications, simplifying integration and maintenance.

Quick Start & Requirements

Install: Clone the repository, activate a Python 3.10 conda environment, and run pip install -r requirements.txt followed by pip install flash-attn==2.4.1.
Prerequisites: Python 3.10, CUDA 11.8+ (for flash-attn), and specific datasets may need to be downloaded or installed.
Resources: Training requires significant computational resources; inference speedups are demonstrated up to $3.4\times$.
Links: Paper, Blog, FastChat Integration.

Highlighted Details

Achieves $2.4\times$ to $3.4\times$ generation speed improvements across various tasks.
Seamless integration with existing LLMs without architectural changes.
Compatible with other inference techniques like Lookahead Decoding for further speedups.
Model checkpoints available on Huggingface Hub for 7B models fine-tuned on ShareGPT, GSM8K, Spider, and Code-Search-Net datasets.

Maintenance & Community

CLLMs have been integrated into FastChat.
Model checkpoints and paper are publicly available.
Community channels are not explicitly mentioned in the README.

Licensing & Compatibility

The repository does not explicitly state a license. Model weights are available on Huggingface, typically under their respective base model licenses.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not specify the license for the code or model weights, which may impact commercial adoption. Training CLLMs requires collecting or generating Jacobi trajectories, adding a data preparation step.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days