dflash  by z-lab

Ultra-fast speculative decoding for LLMs

Created 1 week ago

New!

275 stars

Top 94.1% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> DFlash introduces "Block Diffusion" for "Flash Speculative Decoding," aiming to accelerate large language model inference. It enables efficient, high-quality parallel drafting, benefiting researchers and developers seeking faster LLM generation speeds.

How It Works

The core innovation is a lightweight block diffusion model designed for speculative decoding. This approach facilitates parallel drafting, allowing the model to propose multiple tokens simultaneously in blocks, significantly speeding up generation compared to traditional sequential decoding. It leverages custom diffusion mechanisms, potentially enabling more efficient token prediction by treating generation as a denoising process over token sequences, contrasting with standard methods relying on a smaller draft model.

Quick Start & Requirements

  • Primary install / run command (pip, Docker, binary, etc.): Requires Python 3.11 via Conda, cloning the repository, and installing dependencies with pip install -r requirements.txt and pip install flash-attn --no-build-isolation.
  • Non-default prerequisites and dependencies (GPU, CUDA >= 12, Python 3.12, large dataset, API keys, OS, hardware, etc.): CUDA-enabled GPU (recommended for example usage), flash-attn library.
  • Estimated setup time or resource footprint: Example usage suggests running on a single GPU.
  • If they are present, include links to official quick-start, docs, demo, or other relevant pages: GitHub Repository. A blog post is mentioned, and the full paper is "Coming Soon".

Highlighted Details

  • Achieves "Ultra-Fast Speculative Decoding" via a novel "Block Diffusion" technique.
  • Provides scripts (run_benchmark.sh) to reproduce reported speedup and acceptance length metrics.
  • Benchmarks were conducted on NVIDIA B200 GPUs.
  • Includes a Python example demonstrating loading DFlash draft (z-lab/Qwen3-8B-DFlash-b16) and target models (Qwen3-8B) for speculative decoding.

Maintenance & Community

The project is associated with a forthcoming research paper and a blog post. No specific community channels (e.g., Discord, Slack) or detailed contributor information are provided in the README.

Licensing & Compatibility

The repository's README does not specify a software license. This omission requires clarification for adoption decisions, especially concerning commercial use or integration into proprietary systems.

Limitations & Caveats

The project is research-oriented, with a paper "Coming Soon," suggesting potential for ongoing development. Example usage recommends running on a single GPU. Requires trust_remote_code=True for custom model architectures, necessitating careful security review. flash-attn installation uses --no-build-isolation, which can sometimes lead to build issues.

Health Check
Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
3
Star History
288 stars in the last 7 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

0.9%
2k
Speculative decoding research paper for faster LLM inference
Created 2 years ago
Updated 3 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Lei Zhang Lei Zhang(Director Engineering AI at AMD), and
23 more.

gpt-fast by meta-pytorch

0.1%
6k
PyTorch text generation for efficient transformer inference
Created 2 years ago
Updated 4 months ago
Feedback? Help us improve.