speculative-decoding  by lucidrains

Speculative decoding explorations

Created 2 years ago
285 stars

Top 91.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository explores speculative decoding techniques to accelerate text-to-semantic decoders, particularly for applications like Spear-TTS. It targets researchers and engineers seeking to improve inference speed for large language models.

How It Works

The project implements and experiments with various speculative decoding strategies, including early exit schemes and a "prophet transformer" approach. These methods aim to speed up generation by using a smaller, faster "draft" model to predict token sequences, which are then verified by a larger, more accurate model, reducing the number of forward passes required.

Quick Start & Requirements

  • Installation: pip install ... (specific command not provided in README)
  • Dependencies: PyTorch, CUDA (implied for performance)
  • Resources: Requires significant computational resources for training and experimentation.
  • Links: No direct quick-start or demo links provided.

Highlighted Details

  • Explores early exit schemes and a novel "prophet transformer" for speculative decoding.
  • Investigates batched speculative decoding for improved efficiency.
  • Aims to optimize performance and reduce indexing overhead in batched decoding.
  • Benchmarking and comparison charts are planned.

Maintenance & Community

  • Sponsored by StabilityAI and Huggingface.
  • Author is lucidrains, known for open-sourcing AI techniques.
  • No explicit community links (Discord, Slack) are mentioned.

Licensing & Compatibility

  • License: Not explicitly stated in the README.
  • Compatibility: Assumed to be compatible with PyTorch-based ecosystems.

Limitations & Caveats

The project is described as "explorations," and some functionalities like batched speculative decoding are noted as requiring significant work to become usable. Performance optimization is an ongoing effort.

Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Cody Yu Cody Yu(Coauthor of vLLM; MTS at OpenAI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

Consistency_LLM by hao-ai-lab

0.3%
404
Parallel decoder for efficient LLM inference
Created 1 year ago
Updated 10 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.