dInfer by inclusionAI

Diffusion language model inference optimized for speed and efficiency

Created 4 months ago

398 stars

Top 72.7% on SourcePulse

Project Summary

dInfer is an efficient and extensible inference framework for diffusion language models (dLLMs), targeting researchers and engineers seeking to optimize dLLM deployment. It modularizes inference into four key components—model, diffusion iteration manager, decoder, and KV-cache manager—enabling flexible algorithm combinations and supporting batched inference for improved throughput.

How It Works

dInfer employs a modular architecture with distinct components for model loading, diffusion iteration management, decoding strategies, and KV-cache handling. Algorithmic innovations include soft diffusion iteration for smoother denoising, hierarchical and credit decoding for parallel processing, and a vicinity refresh strategy for KV-cache management. System-level optimizations are crucial, incorporating Tensor Parallelism (TP) and Expert Parallelism (EP) for GPU utilization, dynamic batching, PyTorch compilation, NVIDIA CUDA Graphs for efficient kernel execution, and a loop unrolling mechanism to minimize CUDA stream latency across diffusion iterations.

Quick Start & Requirements

Installation involves cloning the repository and installing via pip:

git clone https://github.com/inclusionAI/dInfer.git
cd dInfer
pip install .

Dependencies include PyTorch and Hugging Face libraries. For MoE models, conversion to FusedMoE format is required using hf_transfer and provided conversion scripts. Specific hardware requirements like GPUs are implied by the optimizations (TP, EP, CUDA Graphs). Links to HuggingFace models and an Arxiv technical report are provided.

Highlighted Details

Achieves over 1,100 Tokens Per Second (TPS) at batch size 1 and averages 800+ TPS across six benchmarks on a single node with 8x H800 GPUs.
Offers significant speedups: 10x faster than Fast-dLLM and 2-3x faster than Qwen2.5-3B on vLLM, while maintaining comparable quality.
Supports various dLLM variants, including LLaDA, LLaDA-MoE, LLaDA2.0-mini, and LLaDA2.0-flash.
Features algorithmic improvements like soft diffusion iteration, hierarchical/credit decoding, and vicinity refresh KV-cache strategy.

Maintenance & Community

No specific community channels (e.g., Discord, Slack) or notable contributors/sponsorships are detailed in the README.

Licensing & Compatibility

The project is licensed under the Apache 2.0 license, permitting commercial use and modification.

Limitations & Caveats

LLaDA2 models are limited to 4-way Tensor Parallelism (TP). Block Diffusion is exclusively supported on LLaDA2 models, not LLaDA Dense/MoE variants. The evaluation suite is currently configured only for LLaDA-MoE, with plans to extend support to LLaDA Dense/LLaDA2 models in the future.

dInfer by inclusionAI

Explore Similar Projects

Awesome-KV-Cache-Management by TreeAI-Lab

Awesome_LLM_System-PaperList by galeselee

MoE-Infinity by EfficientMoE

fiddler by efeslab

ScaleLLM by vectorch-ai

cache-dit by vipshop

omniserve by mit-han-lab

GPTFast by MDK8888

mixture_of_recursions by raymin0223

marlin by IST-DASLab

lmdeploy by InternLM

nano-vllm by GeeeekExplorer