Sequoia  by Infini-AI-Lab

Tree-based speculative decoding algorithm (research paper)

Created 1 year ago
358 stars

Top 78.0% on SourcePulse

GitHubView on GitHub
Project Summary

Sequoia implements a scalable and robust tree-based speculative decoding algorithm designed to accelerate large language model (LLM) inference. It targets researchers and engineers seeking to improve LLM throughput by reducing latency, particularly for demanding inference workloads.

How It Works

Sequoia employs a tree-based speculative decoding approach, where a smaller "draft" model generates multiple candidate tokens in parallel. These candidates are then validated by a larger "target" model. The tree structure, defined by "growmaps," allows for efficient exploration of potential token sequences, optimizing the trade-off between draft model speed and target model accuracy. This method aims to achieve higher inference speeds by performing multiple draft steps for each confirmed target step.

Quick Start & Requirements

  • Install dependencies: pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121, pip install transformers==4.36.2, pip install accelerate==0.26.1, pip install datasets==2.16.1, pip install einops, pip install protobuf, pip install sentencepiece, pip install typing-extensions.
  • Requires CUDA 12.1.
  • Evaluation scripts (testbed.py, testbed_greedy.py, etc.) require specific growmaps and model paths.
  • See UMbreLLa for updated models and features.

Highlighted Details

  • Supports Llama3, Qwen, and Deepseek models.
  • Includes AWQ quantization support.
  • Provides Gradio, API, and CLI chatbot interfaces.
  • Offers scripts for generating acceptance rate vectors and growmaps.
  • Benchmarking scripts are available for L40 and A100 GPUs.

Maintenance & Community

The project is associated with Infini-AI-Lab. Further community engagement details (Discord, Slack, roadmap) are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Currently, only Llama family models are supported for the --model and --target arguments in the evaluation scripts. Support for other open-source models, multi-round dialogue, INT4/8 quantization, and multi-GPU inference are listed as future TODOs. The maximum sequence length for experiments is 256, requiring adjustments to --M for longer sequences.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Cody Yu Cody Yu(Coauthor of vLLM; MTS at OpenAI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

Consistency_LLM by hao-ai-lab

0.3%
404
Parallel decoder for efficient LLM inference
Created 1 year ago
Updated 10 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.