ssd by tanishqkumar

Lightweight LLM inference engine for accelerated decoding

Created 3 months ago

951 stars

Top 38.3% on SourcePulse

2 Experts Love This Project

parano

Founder of Bento

luiscape

Cofounder of Lightning AI

Project Summary

This project introduces Speculative Speculative Decoding (SSD), a novel LLM inference algorithm designed for extreme speed. It enables parallel drafting and verification across distinct hardware, significantly outperforming traditional sequential speculative decoding for high-throughput LLM deployments.

How It Works

SSD enhances speculative decoding (SD) by executing the small model's token guessing (drafting) and the large model's verification steps concurrently on separate hardware. Unlike sequential SD, SSD's small model anticipates verification outcomes, allowing simultaneous speculation. This parallel approach eliminates drafting overhead and can immediately return correct speculations, boosting inference speed.

Quick Start & Requirements

Installation: Requires uv (install via curl -LsSf https://astral.sh/uv/install.sh | sh). Clone repo, sync deps (uv sync, uv sync --extra scripts), activate env (source .venv/bin/activate).
Prerequisites: Python 3.11+, CUDA >= 12.8. Tested on H100s.
Environment Variables: SSD_HF_CACHE, SSD_DATASET_DIR, SSD_CUDA_ARCH, HF_DATASETS_CACHE must be set.
Model/Dataset Download: Scripts scripts/download_from_hf.py and scripts/get_data_from_hf.py are provided.
Links: Repo: https://github.com/tanishqkumar/ssd. Paper: https://arxiv.org/abs/2603.03251.

Highlighted Details

Achieves up to 2x faster inference than leading baselines.
Supports Qwen3 and Llama3 model families.
Integrates optimizations: Tensor Parallelism, PagedAttention, CUDAgraphs, torch compilation, prefix caching.
Provides optimized autoregressive and standard speculative decoding baselines.

Maintenance & Community

Roadmap includes draft data parallelism, OpenAI-compatible inference, and new models (GPT-OSS, Kimi-K2.5).
Contributions are welcomed. No specific community links or sponsorships mentioned.

Licensing & Compatibility

License type is not specified in the README.
Commercial use compatibility is unknown due to the missing license.

Limitations & Caveats

This is a "reference implementation."
Requires specific hardware (H100s tested) and CUDA versions (>= 12.8).
Large models incur significant load/warmup/compilation times.
Performance varies by dataset.
The unspecified license is a critical adoption blocker.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

2

Issues (30d)

2

Star History

48 stars in the last 30 days

Explore Similar Projects

Awesome_LLM_System-PaperList by galeselee

LLM acceleration paper list, focusing on inference and serving

Created 2 years ago

Updated 1 year ago

mini-infer by psmarter

LLM inference engine built from scratch

Created 5 months ago

Updated 1 month ago

Starred by

David Cournapeau

David Cournapeau(Author of scikit-learn).

ntransformer by xaskasdf

LLM inference engine enabling large models on consumer GPUs

Created 3 months ago

Updated 3 months ago

Starred by

Lysandre Debut

Lysandre Debut(Chief Open-Source Officer at Hugging Face).

xinfer by guoqingbao

Pure Rust LLM inference engine

Created 11 months ago

Updated 1 day ago

Starred by

Will Brown

Will Brown(Research Lead at Prime Intellect),

Shishir Patil

Shishir Patil(Author of BFCL, Gorilla), and

2 more.

tokasaurus by ScalingIntelligence

LLM inference engine for high-throughput workloads

Created 1 year ago

Updated 6 months ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

ScaleLLM by vectorch-ai

LLM inference system for production environments

Created 2 years ago

Updated 5 months ago

dInfer by inclusionAI

Diffusion language model inference optimized for speed and efficiency

Created 8 months ago

Updated 4 months ago

tiny-vllm by jmaczan

High-performance LLM inference engine in C++/CUDA

Created 4 months ago

Updated 1 month ago

omniserve by mit-han-lab

Unified inference engine for large-scale LLM serving

Created 2 years ago

Updated 1 year ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Ying Sheng

Ying Sheng(Coauthor of SGLang), and

2 more.

LookaheadDecoding by hao-ai-lab

Parallel decoding algorithm for faster LLM inference

Created 2 years ago

Updated 1 year ago

Starred by

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI) and

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

ZhiLight by zhihu

LLM inference engine for Llama and variants, optimized for PCIe GPUs

Created 1 year ago

Updated 2 months ago

Starred by

Michael Han

Michael Han(Cofounder of Unsloth),

Meng Zhang

Meng Zhang(Cofounder of TabbyML), and

11 more.

lmdeploy by InternLM

Toolkit for LLM compression, deployment, and serving

Created 3 years ago

Updated 1 day ago

Feedback? Help us improve.