Discrete-Diffusion-Forcing  by SJTU-DENG-Lab

Enabling dLLMs for faster-than-AR inference

Created 9 months ago
253 stars

Top 99.3% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This project introduces Discrete Diffusion Forcing (D2F), a novel training and inference paradigm designed to overcome the speed limitations of Discrete Diffusion Language Models (dLLMs). D2F enables dLLMs to achieve faster-than-autoregressive (AR) inference speeds for the first time, offering significant throughput advantages for researchers and engineers working with large language models who require high-performance generation.

How It Works

D2F employs a hybrid architecture featuring block-wise causal attention, allowing bidirectional attention within blocks and causal attention between them. This design ensures compatibility with standard KV caching, drastically reducing redundant computations. The model is efficiently trained via asymmetric distillation, where a student dLLM learns to mimic a powerful teacher dLLM using only a limited, causal context. Inference is accelerated through high-throughput pipelined decoding, enabling parallel refinement of multiple blocks and maximizing GPU utilization.

Quick Start & Requirements

  • Primary install: Clone the repository, set up the environment using uv sync or Conda (conda create -n d2f python=3.10; conda activate d2f), and install dependencies (pip install -r requirements.txt).
  • Prerequisites: Python 3.10 (recommended via Conda). vLLM integration is preliminary; full support is pending.
  • Resources: Links to the Paper, Blog Post, Online Demo, Discord, and Wechat are provided in the README.

Highlighted Details

  • Achieves up to 2.5x speedup over leading AR models like LLaMA3-8B.
  • Delivers over 50x acceleration compared to vanilla dLLM baselines.
  • Maintains comparable generation quality on standard reasoning and coding benchmarks.
  • Demonstrates significant speedups on LLaDA-Instruct-8B (up to 52.9x TPS) and Dream-Base-7B (up to 10.1x TPS).
  • Preliminary integration with vLLM shows potential for multiplicative speedups (e.g., 6.5x with Dream-Base).

Maintenance & Community

  • Recent news includes the release of the training pipeline (Aug 20, 2025) and inference code (Aug 8, 2025).
  • Community channels include Discord and Wechat.
  • Future work focuses on implementing fused dLLM-specific decoding kernels for vLLM, distributed inference, and CUDA graph capturing.

Licensing & Compatibility

  • License: Not specified in the README.
  • Compatibility: No explicit notes on commercial use or closed-source linking are provided.

Limitations & Caveats

The vLLM integration is a preliminary proof-of-concept, exhibiting a score drop that is actively being addressed. Further optimization, including specialized CUDA kernels and distributed inference, is planned. The project's license is not explicitly stated, which may impact adoption decisions.

Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

0.6%
2k
Speculative decoding research paper for faster LLM inference
Created 2 years ago
Updated 3 months ago
Feedback? Help us improve.