Discrete-Diffusion-Forcing by SJTU-DENG-Lab

Enabling dLLMs for faster-than-AR inference

Created 11 months ago

258 stars

Top 98.0% on SourcePulse

Project Summary

Summary

This project introduces Discrete Diffusion Forcing (D2F), a novel training and inference paradigm designed to overcome the speed limitations of Discrete Diffusion Language Models (dLLMs). D2F enables dLLMs to achieve faster-than-autoregressive (AR) inference speeds for the first time, offering significant throughput advantages for researchers and engineers working with large language models who require high-performance generation.

How It Works

D2F employs a hybrid architecture featuring block-wise causal attention, allowing bidirectional attention within blocks and causal attention between them. This design ensures compatibility with standard KV caching, drastically reducing redundant computations. The model is efficiently trained via asymmetric distillation, where a student dLLM learns to mimic a powerful teacher dLLM using only a limited, causal context. Inference is accelerated through high-throughput pipelined decoding, enabling parallel refinement of multiple blocks and maximizing GPU utilization.

Quick Start & Requirements

Primary install: Clone the repository, set up the environment using uv sync or Conda (conda create -n d2f python=3.10; conda activate d2f), and install dependencies (pip install -r requirements.txt).
Prerequisites: Python 3.10 (recommended via Conda). vLLM integration is preliminary; full support is pending.
Resources: Links to the Paper, Blog Post, Online Demo, Discord, and Wechat are provided in the README.

Highlighted Details

Achieves up to 2.5x speedup over leading AR models like LLaMA3-8B.
Delivers over 50x acceleration compared to vanilla dLLM baselines.
Maintains comparable generation quality on standard reasoning and coding benchmarks.
Demonstrates significant speedups on LLaDA-Instruct-8B (up to 52.9x TPS) and Dream-Base-7B (up to 10.1x TPS).
Preliminary integration with vLLM shows potential for multiplicative speedups (e.g., 6.5x with Dream-Base).

Maintenance & Community

Recent news includes the release of the training pipeline (Aug 20, 2025) and inference code (Aug 8, 2025).
Community channels include Discord and Wechat.
Future work focuses on implementing fused dLLM-specific decoding kernels for vLLM, distributed inference, and CUDA graph capturing.

Licensing & Compatibility

License: Not specified in the README.
Compatibility: No explicit notes on commercial use or closed-source linking are provided.

Limitations & Caveats

The vLLM integration is a preliminary proof-of-concept, exhibiting a score drop that is actively being addressed. Further optimization, including specialized CUDA kernels and distributed inference, is planned. The project's license is not explicitly stated, which may impact adoption decisions.

Discrete-Diffusion-Forcing by SJTU-DENG-Lab

Explore Similar Projects

vllm-swift by TheTom

SDAR by JetAstra

KVarN by huawei-csl

VisionZip by JIA-Lab-research

DGX_Spark_Qwen3.5-122B-A10B-AR-INT4 by albond

WeDLM by Tencent

omniserve by mit-han-lab

triattention by WeianMao

rotorquant by scrya-com

EAGLE by SafeAILab

RedKnot by rednote-machine-learning

Awesome-LLM-Inference by xlite-dev