Sequoia by Infini-AI-Lab

Tree-based speculative decoding algorithm (research paper)

Created 1 year ago

366 stars

Top 77.0% on SourcePulse

View on GitHub

3 Experts Love This Project

Luca Antiga

CTO of Lightning AI

Luis Capelo

Cofounder of Lightning AI

Jiaming Song

Chief Scientist at Luma AI

Project Summary

Sequoia implements a scalable and robust tree-based speculative decoding algorithm designed to accelerate large language model (LLM) inference. It targets researchers and engineers seeking to improve LLM throughput by reducing latency, particularly for demanding inference workloads.

How It Works

Sequoia employs a tree-based speculative decoding approach, where a smaller "draft" model generates multiple candidate tokens in parallel. These candidates are then validated by a larger "target" model. The tree structure, defined by "growmaps," allows for efficient exploration of potential token sequences, optimizing the trade-off between draft model speed and target model accuracy. This method aims to achieve higher inference speeds by performing multiple draft steps for each confirmed target step.

Quick Start & Requirements

Install dependencies: pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121, pip install transformers==4.36.2, pip install accelerate==0.26.1, pip install datasets==2.16.1, pip install einops, pip install protobuf, pip install sentencepiece, pip install typing-extensions.
Requires CUDA 12.1.
Evaluation scripts (testbed.py, testbed_greedy.py, etc.) require specific growmaps and model paths.
See UMbreLLa for updated models and features.

Highlighted Details

Supports Llama3, Qwen, and Deepseek models.
Includes AWQ quantization support.
Provides Gradio, API, and CLI chatbot interfaces.
Offers scripts for generating acceptance rate vectors and growmaps.
Benchmarking scripts are available for L40 and A100 GPUs.

Maintenance & Community

The project is associated with Infini-AI-Lab. Further community engagement details (Discord, Slack, roadmap) are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Currently, only Llama family models are supported for the --model and --target arguments in the evaluation scripts. Support for other open-source models, multi-round dialogue, INT4/8 quantization, and multi-GPU inference are listed as future TODOs. The maximum sequence length for experiments is 256, requiring adjustments to --M for longer sequences.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days