dflash by z-lab

Ultra-fast speculative decoding for LLMs

Created 1 month ago

560 stars

Top 57.4% on SourcePulse

View on GitHub

2 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Pawel Garbacki

Cofounder of Fireworks AI

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> DFlash introduces "Block Diffusion" for "Flash Speculative Decoding," aiming to accelerate large language model inference. It enables efficient, high-quality parallel drafting, benefiting researchers and developers seeking faster LLM generation speeds.

How It Works

The core innovation is a lightweight block diffusion model designed for speculative decoding. This approach facilitates parallel drafting, allowing the model to propose multiple tokens simultaneously in blocks, significantly speeding up generation compared to traditional sequential decoding. It leverages custom diffusion mechanisms, potentially enabling more efficient token prediction by treating generation as a denoising process over token sequences, contrasting with standard methods relying on a smaller draft model.

Quick Start & Requirements

Primary install / run command (pip, Docker, binary, etc.): Requires Python 3.11 via Conda, cloning the repository, and installing dependencies with pip install -r requirements.txt and pip install flash-attn --no-build-isolation.
Non-default prerequisites and dependencies (GPU, CUDA >= 12, Python 3.12, large dataset, API keys, OS, hardware, etc.): CUDA-enabled GPU (recommended for example usage), flash-attn library.
Estimated setup time or resource footprint: Example usage suggests running on a single GPU.
If they are present, include links to official quick-start, docs, demo, or other relevant pages: GitHub Repository. A blog post is mentioned, and the full paper is "Coming Soon".

Highlighted Details

Achieves "Ultra-Fast Speculative Decoding" via a novel "Block Diffusion" technique.
Provides scripts (run_benchmark.sh) to reproduce reported speedup and acceptance length metrics.
Benchmarks were conducted on NVIDIA B200 GPUs.
Includes a Python example demonstrating loading DFlash draft (z-lab/Qwen3-8B-DFlash-b16) and target models (Qwen3-8B) for speculative decoding.

Maintenance & Community

The project is associated with a forthcoming research paper and a blog post. No specific community channels (e.g., Discord, Slack) or detailed contributor information are provided in the README.

Licensing & Compatibility

The repository's README does not specify a software license. This omission requires clarification for adoption decisions, especially concerning commercial use or integration into proprietary systems.

Limitations & Caveats

The project is research-oriented, with a paper "Coming Soon," suggesting potential for ongoing development. Example usage recommends running on a single GPU. Requires trust_remote_code=True for custom model architectures, necessitating careful security review. flash-attn installation uses --no-build-isolation, which can sometimes lead to build issues.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

145 stars in the last 30 days