ssd  by tanishqkumar

Lightweight LLM inference engine for accelerated decoding

Created 1 week ago

New!

779 stars

Top 45.0% on SourcePulse

GitHubView on GitHub
Project Summary

This project introduces Speculative Speculative Decoding (SSD), a novel LLM inference algorithm designed for extreme speed. It enables parallel drafting and verification across distinct hardware, significantly outperforming traditional sequential speculative decoding for high-throughput LLM deployments.

How It Works

SSD enhances speculative decoding (SD) by executing the small model's token guessing (drafting) and the large model's verification steps concurrently on separate hardware. Unlike sequential SD, SSD's small model anticipates verification outcomes, allowing simultaneous speculation. This parallel approach eliminates drafting overhead and can immediately return correct speculations, boosting inference speed.

Quick Start & Requirements

  • Installation: Requires uv (install via curl -LsSf https://astral.sh/uv/install.sh | sh). Clone repo, sync deps (uv sync, uv sync --extra scripts), activate env (source .venv/bin/activate).
  • Prerequisites: Python 3.11+, CUDA >= 12.8. Tested on H100s.
  • Environment Variables: SSD_HF_CACHE, SSD_DATASET_DIR, SSD_CUDA_ARCH, HF_DATASETS_CACHE must be set.
  • Model/Dataset Download: Scripts scripts/download_from_hf.py and scripts/get_data_from_hf.py are provided.
  • Links: Repo: https://github.com/tanishqkumar/ssd. Paper: https://arxiv.org/abs/2603.03251.

Highlighted Details

  • Achieves up to 2x faster inference than leading baselines.
  • Supports Qwen3 and Llama3 model families.
  • Integrates optimizations: Tensor Parallelism, PagedAttention, CUDAgraphs, torch compilation, prefix caching.
  • Provides optimized autoregressive and standard speculative decoding baselines.

Maintenance & Community

  • Roadmap includes draft data parallelism, OpenAI-compatible inference, and new models (GPT-OSS, Kimi-K2.5).
  • Contributions are welcomed. No specific community links or sponsorships mentioned.

Licensing & Compatibility

  • License type is not specified in the README.
  • Commercial use compatibility is unknown due to the missing license.

Limitations & Caveats

  • This is a "reference implementation."
  • Requires specific hardware (H100s tested) and CUDA versions (>= 12.8).
  • Large models incur significant load/warmup/compilation times.
  • Performance varies by dataset.
  • The unspecified license is a critical adoption blocker.
Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
2
Star History
783 stars in the last 9 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Ying Sheng Ying Sheng(Coauthor of SGLang), and
2 more.

LookaheadDecoding by hao-ai-lab

0.2%
1k
Parallel decoding algorithm for faster LLM inference
Created 2 years ago
Updated 1 year ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI) and Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory).

ZhiLight by zhihu

0%
905
LLM inference engine for Llama and variants, optimized for PCIe GPUs
Created 1 year ago
Updated 2 days ago
Feedback? Help us improve.