LookaheadDecoding by hao-ai-lab

Parallel decoding algorithm for faster LLM inference

Created 2 years ago

1,315 stars

Top 30.3% on SourcePulse

4 Experts Love This Project

chiphuyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Ying1123

Coauthor of SGLang

bryanhelmig

Cofounder of Zapier

zhuohan123

Coauthor of vLLM

Project Summary

This repository introduces Lookahead Decoding, a novel parallel inference algorithm for Large Language Models (LLMs) that significantly accelerates generation speed without requiring a draft model or data store. It targets researchers and engineers seeking to optimize LLM inference latency, offering speedups of 1.5x to 2.3x.

How It Works

Lookahead Decoding adapts Jacobi Decoding, which treats LLM inference as solving nonlinear systems for simultaneous future token prediction. It enhances Jacobi's feasibility by caching and verifying n-grams generated from Jacobi iteration trajectories. The algorithm employs two parallel branches: a lookahead branch generates n-grams within a defined window (W, N parameters), and a verification branch uses string matching to identify and validate candidate n-grams via LLM forward passes, all optimized within a single attention mask.

Quick Start & Requirements

Install via pip: pip install lade
Install from source: git clone https://github.com/hao-ai-lab/LookaheadDecoding.git && cd LookaheadDecoding && pip install -r requirements.txt && pip install -e .
Dependencies: Python, PyTorch. FlashAttention v2.3.3 is recommended for optimal performance.
Demo: python minimal.py (with USE_LADE=1 LOAD_LADE=1)
Docs: Paper, Blog

Highlighted Details

Achieves 1.5x-2.3x latency reduction on various LLMs and datasets.
Eliminates sequential dependency without draft models or data stores.
Supports FlashAttention for further performance gains.
Integrates into existing code with minimal changes (3 LoCs).

Maintenance & Community

The project is associated with ICML 2024.
Core implementation is in decoding.py, with model adaptations in models/.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

Currently supports LLaMA models only.
FlashAttention installation may require specific CUDA/PyTorch versions.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

9 stars in the last 30 days

Explore Similar Projects

flash-tokenizer by NLPOptimize

CPU tokenizer library for LLM inference serving

Created 10 months ago

Updated 3 months ago

VisionZip by JIA-Lab-research

Vision-language model research paper for efficient VLMs

Created 1 year ago

Updated 5 months ago

Starred by

Cody Yu

Cody Yu(Coauthor of vLLM; MTS at OpenAI),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

2 more.

Consistency_LLM by hao-ai-lab

Parallel decoder for efficient LLM inference

Created 2 years ago

Updated 1 year ago

Starred by

Andrej Karpathy

Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n).

flex-nano-vllm by changjonathanc

Fast Gemma 2 inference engine

Created 5 months ago

Updated 2 months ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

Quest by mit-han-lab

Inference framework for efficient long-context LLM inference

Created 1 year ago

Updated 6 months ago

Starred by

Stas Bekman

Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake),

Luis Capelo

Luis Capelo(Cofounder of Lightning AI), and

1 more.

ArcticInference by snowflakedb

vLLM plugin for high-throughput, low-latency LLM and embedding inference

Created 9 months ago

Updated 5 days ago

omniserve by mit-han-lab

Unified inference engine for large-scale LLM serving

Created 1 year ago

Updated 10 months ago

Starred by

Victor Taelin

Victor Taelin(Author of Bend, Kind, HVM).

GPU-Benchmarks-on-LLM-Inference by XiongjieDai

GPU benchmark for LLM inference using llama.cpp

Created 2 years ago

Updated 1 year ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI), and

1 more.

H2O by FMInference

KV cache eviction research paper for efficient LLM inference

Created 2 years ago

Updated 1 year ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai) and

Zhuohan Li

Zhuohan Li(Coauthor of vLLM).

marlin by IST-DASLab

FP16xINT4 kernel for fast LLM inference

Created 2 years ago

Updated 1 year ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA),

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and

8 more.

EAGLE by SafeAILab

Speculative decoding research paper for faster LLM inference

Created 2 years ago

Updated 3 weeks ago

Starred by

Lysandre Debut

Lysandre Debut(Chief Open-Source Officer at Hugging Face),

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind), and

8 more.

PowerInfer by SJTU-IPADS

LLM inference engine for local deployment on consumer GPUs

Created 2 years ago

Updated 5 months ago

Feedback? Help us improve.