LookaheadDecoding  by hao-ai-lab

Parallel decoding algorithm for faster LLM inference

Created 1 year ago
1,295 stars

Top 30.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository introduces Lookahead Decoding, a novel parallel inference algorithm for Large Language Models (LLMs) that significantly accelerates generation speed without requiring a draft model or data store. It targets researchers and engineers seeking to optimize LLM inference latency, offering speedups of 1.5x to 2.3x.

How It Works

Lookahead Decoding adapts Jacobi Decoding, which treats LLM inference as solving nonlinear systems for simultaneous future token prediction. It enhances Jacobi's feasibility by caching and verifying n-grams generated from Jacobi iteration trajectories. The algorithm employs two parallel branches: a lookahead branch generates n-grams within a defined window (W, N parameters), and a verification branch uses string matching to identify and validate candidate n-grams via LLM forward passes, all optimized within a single attention mask.

Quick Start & Requirements

  • Install via pip: pip install lade
  • Install from source: git clone https://github.com/hao-ai-lab/LookaheadDecoding.git && cd LookaheadDecoding && pip install -r requirements.txt && pip install -e .
  • Dependencies: Python, PyTorch. FlashAttention v2.3.3 is recommended for optimal performance.
  • Demo: python minimal.py (with USE_LADE=1 LOAD_LADE=1)
  • Docs: Paper, Blog

Highlighted Details

  • Achieves 1.5x-2.3x latency reduction on various LLMs and datasets.
  • Eliminates sequential dependency without draft models or data stores.
  • Supports FlashAttention for further performance gains.
  • Integrates into existing code with minimal changes (3 LoCs).

Maintenance & Community

  • The project is associated with ICML 2024.
  • Core implementation is in decoding.py, with model adaptations in models/.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • Currently supports LLaMA models only.
  • FlashAttention installation may require specific CUDA/PyTorch versions.
Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 30 days

Explore Similar Projects

Starred by Cody Yu Cody Yu(Coauthor of vLLM; MTS at OpenAI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

Consistency_LLM by hao-ai-lab

0%
405
Parallel decoder for efficient LLM inference
Created 1 year ago
Updated 11 months ago
Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
1 more.

ArcticInference by snowflakedb

1.4%
292
vLLM plugin for high-throughput, low-latency LLM and embedding inference
Created 7 months ago
Updated 18 hours ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

0.5%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 3 weeks ago
Feedback? Help us improve.