Diffusion LLM inference acceleration framework
Top 79.3% on SourcePulse
Fast-dLLM is an open-source framework designed to accelerate the inference of diffusion-based Large Language Models (LLMs), specifically targeting models like Dream and LLaDA. It offers significant speedups for LLM inference without requiring model retraining, making it valuable for researchers and developers working with these computationally intensive models.
How It Works
Fast-dLLM employs two primary techniques for acceleration. First, it introduces a block-wise Key-Value (KV) Cache mechanism that reuses attention KV activations across decoding steps within blocks, reducing redundant computations. A "DualCache" extension further optimizes this by caching masked suffix tokens. Second, it implements a confidence-aware parallel decoding strategy, where only tokens exceeding a confidence threshold are unmasked in parallel at each step, balancing efficiency and output quality.
Quick Start & Requirements
pip install -r requirements.txt
python llada/chat.py --gen_length 128 --steps 128 --block_size 32
requirements.txt
). No specific hardware like GPUs is explicitly mandated in the README, but performance gains are typically observed with them.Highlighted Details
Maintenance & Community
The project is associated with NVlabs and appears to be actively updated, with recent news entries in July and August 2025. No specific community channels (Discord/Slack) are listed.
Licensing & Compatibility
Licensed under the Apache License 2.0. This license is permissive and generally compatible with commercial and closed-source use.
Limitations & Caveats
The README focuses on performance gains and implementation details. Potential limitations regarding specific model compatibility beyond Dream and LLaDA, or performance on diverse hardware configurations, are not detailed. The project appears to be based on recent research, implying potential for ongoing development and refinement.
2 weeks ago
Inactive