dInfer  by inclusionAI

Diffusion language model inference optimized for speed and efficiency

Created 2 months ago
328 stars

Top 83.1% on SourcePulse

GitHubView on GitHub
Project Summary

dInfer is an efficient and extensible inference framework for diffusion language models (dLLMs), targeting researchers and engineers seeking to optimize dLLM deployment. It modularizes inference into four key components—model, diffusion iteration manager, decoder, and KV-cache manager—enabling flexible algorithm combinations and supporting batched inference for improved throughput.

How It Works

dInfer employs a modular architecture with distinct components for model loading, diffusion iteration management, decoding strategies, and KV-cache handling. Algorithmic innovations include soft diffusion iteration for smoother denoising, hierarchical and credit decoding for parallel processing, and a vicinity refresh strategy for KV-cache management. System-level optimizations are crucial, incorporating Tensor Parallelism (TP) and Expert Parallelism (EP) for GPU utilization, dynamic batching, PyTorch compilation, NVIDIA CUDA Graphs for efficient kernel execution, and a loop unrolling mechanism to minimize CUDA stream latency across diffusion iterations.

Quick Start & Requirements

Installation involves cloning the repository and installing via pip:

git clone https://github.com/inclusionAI/dInfer.git
cd dInfer
pip install .

Dependencies include PyTorch and Hugging Face libraries. For MoE models, conversion to FusedMoE format is required using hf_transfer and provided conversion scripts. Specific hardware requirements like GPUs are implied by the optimizations (TP, EP, CUDA Graphs). Links to HuggingFace models and an Arxiv technical report are provided.

Highlighted Details

  • Achieves over 1,100 Tokens Per Second (TPS) at batch size 1 and averages 800+ TPS across six benchmarks on a single node with 8x H800 GPUs.
  • Offers significant speedups: 10x faster than Fast-dLLM and 2-3x faster than Qwen2.5-3B on vLLM, while maintaining comparable quality.
  • Supports various dLLM variants, including LLaDA, LLaDA-MoE, LLaDA2.0-mini, and LLaDA2.0-flash.
  • Features algorithmic improvements like soft diffusion iteration, hierarchical/credit decoding, and vicinity refresh KV-cache strategy.

Maintenance & Community

No specific community channels (e.g., Discord, Slack) or notable contributors/sponsorships are detailed in the README.

Licensing & Compatibility

The project is licensed under the Apache 2.0 license, permitting commercial use and modification.

Limitations & Caveats

LLaDA2 models are limited to 4-way Tensor Parallelism (TP). Block Diffusion is exclusively supported on LLaDA2 models, not LLaDA Dense/MoE variants. The evaluation suite is currently configured only for LLaDA-MoE, with plans to extend support to LLaDA Dense/LLaDA2 models in the future.

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
8
Star History
66 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Maxime Labonne Maxime Labonne(Head of Post-Training at Liquid AI), and
1 more.

GPTFast by MDK8888

0%
686
HF Transformers accelerator for faster inference
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.