Fast-dLLM  by NVlabs

Diffusion LLM inference acceleration framework

Created 4 months ago
563 stars

Top 57.1% on SourcePulse

GitHubView on GitHub
Project Summary

Fast-dLLM is an open-source framework designed to accelerate the inference of diffusion-based Large Language Models (LLMs), specifically targeting models like Dream and LLaDA. It offers significant speedups for LLM inference without requiring model retraining, making it valuable for researchers and developers working with these computationally intensive models.

How It Works

Fast-dLLM employs two primary techniques for acceleration. First, it introduces a block-wise Key-Value (KV) Cache mechanism that reuses attention KV activations across decoding steps within blocks, reducing redundant computations. A "DualCache" extension further optimizes this by caching masked suffix tokens. Second, it implements a confidence-aware parallel decoding strategy, where only tokens exceeding a confidence threshold are unmasked in parallel at each step, balancing efficiency and output quality.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Usage example (LLaDA): python llada/chat.py --gen_length 128 --steps 128 --block_size 32
  • Prerequisites: Python, standard ML libraries (as per requirements.txt). No specific hardware like GPUs is explicitly mandated in the README, but performance gains are typically observed with them.
  • Demo: https://fast-dllm.hanlab.ai/

Highlighted Details

  • Achieves 2x to 3.6x speedup with KV Cache alone.
  • Combined KV Cache and parallel decoding yield up to 8.1x speedup on GSM8K (256 tokens) and 4.0x on HumanEval (512 tokens).
  • Supports interactive chat and model evaluation.
  • Integrated into LLaDA-V, reportedly reducing latency from 60s to 6s.

Maintenance & Community

The project is associated with NVlabs and appears to be actively updated, with recent news entries in July and August 2025. No specific community channels (Discord/Slack) are listed.

Licensing & Compatibility

Licensed under the Apache License 2.0. This license is permissive and generally compatible with commercial and closed-source use.

Limitations & Caveats

The README focuses on performance gains and implementation details. Potential limitations regarding specific model compatibility beyond Dream and LLaDA, or performance on diverse hardware configurations, are not detailed. The project appears to be based on recent research, implying potential for ongoing development and refinement.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
6
Star History
117 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Ying Sheng Ying Sheng(Coauthor of SGLang), and
2 more.

LookaheadDecoding by hao-ai-lab

0.2%
1k
Parallel decoding algorithm for faster LLM inference
Created 1 year ago
Updated 7 months ago
Starred by Taranjeet Singh Taranjeet Singh(Cofounder of Mem0), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

LMCache by LMCache

0.8%
6k
LLM serving engine extension for reduced TTFT and increased throughput
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.