Fast-dLLM  by NVlabs

Diffusion LLM inference acceleration framework

created 2 months ago
350 stars

Top 79.3% on SourcePulse

GitHubView on GitHub
Project Summary

Fast-dLLM is an open-source framework designed to accelerate the inference of diffusion-based Large Language Models (LLMs), specifically targeting models like Dream and LLaDA. It offers significant speedups for LLM inference without requiring model retraining, making it valuable for researchers and developers working with these computationally intensive models.

How It Works

Fast-dLLM employs two primary techniques for acceleration. First, it introduces a block-wise Key-Value (KV) Cache mechanism that reuses attention KV activations across decoding steps within blocks, reducing redundant computations. A "DualCache" extension further optimizes this by caching masked suffix tokens. Second, it implements a confidence-aware parallel decoding strategy, where only tokens exceeding a confidence threshold are unmasked in parallel at each step, balancing efficiency and output quality.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Usage example (LLaDA): python llada/chat.py --gen_length 128 --steps 128 --block_size 32
  • Prerequisites: Python, standard ML libraries (as per requirements.txt). No specific hardware like GPUs is explicitly mandated in the README, but performance gains are typically observed with them.
  • Demo: https://fast-dllm.hanlab.ai/

Highlighted Details

  • Achieves 2x to 3.6x speedup with KV Cache alone.
  • Combined KV Cache and parallel decoding yield up to 8.1x speedup on GSM8K (256 tokens) and 4.0x on HumanEval (512 tokens).
  • Supports interactive chat and model evaluation.
  • Integrated into LLaDA-V, reportedly reducing latency from 60s to 6s.

Maintenance & Community

The project is associated with NVlabs and appears to be actively updated, with recent news entries in July and August 2025. No specific community channels (Discord/Slack) are listed.

Licensing & Compatibility

Licensed under the Apache License 2.0. This license is permissive and generally compatible with commercial and closed-source use.

Limitations & Caveats

The README focuses on performance gains and implementation details. Potential limitations regarding specific model compatibility beyond Dream and LLaDA, or performance on diverse hardware configurations, are not detailed. The project appears to be based on recent research, implying potential for ongoing development and refinement.

Health Check
Last commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
4
Star History
52 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
42 more.

vllm by vllm-project

1.4%
55k
LLM serving engine for high-throughput, memory-efficient inference
created 2 years ago
updated 19 hours ago
Feedback? Help us improve.