Sequoia  by Infini-AI-Lab

Tree-based speculative decoding algorithm (research paper)

created 1 year ago
351 stars

Top 80.4% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Sequoia implements a scalable and robust tree-based speculative decoding algorithm designed to accelerate large language model (LLM) inference. It targets researchers and engineers seeking to improve LLM throughput by reducing latency, particularly for demanding inference workloads.

How It Works

Sequoia employs a tree-based speculative decoding approach, where a smaller "draft" model generates multiple candidate tokens in parallel. These candidates are then validated by a larger "target" model. The tree structure, defined by "growmaps," allows for efficient exploration of potential token sequences, optimizing the trade-off between draft model speed and target model accuracy. This method aims to achieve higher inference speeds by performing multiple draft steps for each confirmed target step.

Quick Start & Requirements

  • Install dependencies: pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121, pip install transformers==4.36.2, pip install accelerate==0.26.1, pip install datasets==2.16.1, pip install einops, pip install protobuf, pip install sentencepiece, pip install typing-extensions.
  • Requires CUDA 12.1.
  • Evaluation scripts (testbed.py, testbed_greedy.py, etc.) require specific growmaps and model paths.
  • See UMbreLLa for updated models and features.

Highlighted Details

  • Supports Llama3, Qwen, and Deepseek models.
  • Includes AWQ quantization support.
  • Provides Gradio, API, and CLI chatbot interfaces.
  • Offers scripts for generating acceptance rate vectors and growmaps.
  • Benchmarking scripts are available for L40 and A100 GPUs.

Maintenance & Community

The project is associated with Infini-AI-Lab. Further community engagement details (Discord, Slack, roadmap) are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Currently, only Llama family models are supported for the --model and --target arguments in the evaluation scripts. Support for other open-source models, multi-round dialogue, INT4/8 quantization, and multi-GPU inference are listed as future TODOs. The maximum sequence length for experiments is 256, requiring adjustments to --M for longer sequences.

Health Check
Last commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Zhuohan Li Zhuohan Li(Author of vLLM), and
1 more.

Consistency_LLM by hao-ai-lab

0%
397
Parallel decoder for efficient LLM inference
created 1 year ago
updated 8 months ago
Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

HALOs by ContextualAI

0.2%
873
Library for aligning LLMs using human-aware loss functions
created 1 year ago
updated 2 weeks ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Ying Sheng Ying Sheng(Author of SGLang), and
1 more.

LookaheadDecoding by hao-ai-lab

0.1%
1k
Parallel decoding algorithm for faster LLM inference
created 1 year ago
updated 4 months ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

Medusa by FasterDecoding

0.2%
3k
Framework for accelerating LLM generation using multiple decoding heads
created 1 year ago
updated 1 year ago
Feedback? Help us improve.