Discover and explore top open-source AI tools and projects—updated daily.
z-labUltra-fast speculative decoding for LLMs
New!
Top 94.1% on SourcePulse
<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> DFlash introduces "Block Diffusion" for "Flash Speculative Decoding," aiming to accelerate large language model inference. It enables efficient, high-quality parallel drafting, benefiting researchers and developers seeking faster LLM generation speeds.
How It Works
The core innovation is a lightweight block diffusion model designed for speculative decoding. This approach facilitates parallel drafting, allowing the model to propose multiple tokens simultaneously in blocks, significantly speeding up generation compared to traditional sequential decoding. It leverages custom diffusion mechanisms, potentially enabling more efficient token prediction by treating generation as a denoising process over token sequences, contrasting with standard methods relying on a smaller draft model.
Quick Start & Requirements
pip install -r requirements.txt and pip install flash-attn --no-build-isolation.flash-attn library.Highlighted Details
run_benchmark.sh) to reproduce reported speedup and acceptance length metrics.z-lab/Qwen3-8B-DFlash-b16) and target models (Qwen3-8B) for speculative decoding.Maintenance & Community
The project is associated with a forthcoming research paper and a blog post. No specific community channels (e.g., Discord, Slack) or detailed contributor information are provided in the README.
Licensing & Compatibility
The repository's README does not specify a software license. This omission requires clarification for adoption decisions, especially concerning commercial use or integration into proprietary systems.
Limitations & Caveats
The project is research-oriented, with a paper "Coming Soon," suggesting potential for ongoing development. Example usage recommends running on a single GPU. Requires trust_remote_code=True for custom model architectures, necessitating careful security review. flash-attn installation uses --no-build-isolation, which can sometimes lead to build issues.
6 days ago
Inactive
lucidrains
theroyallab
SafeAILab
meta-pytorch
huggingface