Fast-dLLM by NVlabs

Diffusion LLM inference acceleration framework

Created 6 months ago

711 stars

Top 48.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Wing Lian

Founder of Axolotl AI

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Project Summary

Fast-dLLM is an open-source framework designed to accelerate the inference of diffusion-based Large Language Models (LLMs), specifically targeting models like Dream and LLaDA. It offers significant speedups for LLM inference without requiring model retraining, making it valuable for researchers and developers working with these computationally intensive models.

How It Works

Fast-dLLM employs two primary techniques for acceleration. First, it introduces a block-wise Key-Value (KV) Cache mechanism that reuses attention KV activations across decoding steps within blocks, reducing redundant computations. A "DualCache" extension further optimizes this by caching masked suffix tokens. Second, it implements a confidence-aware parallel decoding strategy, where only tokens exceeding a confidence threshold are unmasked in parallel at each step, balancing efficiency and output quality.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Usage example (LLaDA): python llada/chat.py --gen_length 128 --steps 128 --block_size 32
Prerequisites: Python, standard ML libraries (as per requirements.txt). No specific hardware like GPUs is explicitly mandated in the README, but performance gains are typically observed with them.
Demo: https://fast-dllm.hanlab.ai/

Highlighted Details

Achieves 2x to 3.6x speedup with KV Cache alone.
Combined KV Cache and parallel decoding yield up to 8.1x speedup on GSM8K (256 tokens) and 4.0x on HumanEval (512 tokens).
Supports interactive chat and model evaluation.
Integrated into LLaDA-V, reportedly reducing latency from 60s to 6s.

Maintenance & Community

The project is associated with NVlabs and appears to be actively updated, with recent news entries in July and August 2025. No specific community channels (Discord/Slack) are listed.

Licensing & Compatibility

Licensed under the Apache License 2.0. This license is permissive and generally compatible with commercial and closed-source use.

Limitations & Caveats

The README focuses on performance gains and implementation details. Potential limitations regarding specific model compatibility beyond Dream and LLaDA, or performance on diverse hardware configurations, are not detailed. The project appears to be based on recent research, implying potential for ongoing development and refinement.

Health Check

Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

100 stars in the last 30 days