Discover and explore top open-source AI tools and projects—updated daily.
inclusionAIDiffusion language model inference optimized for speed and efficiency
Top 83.1% on SourcePulse
dInfer is an efficient and extensible inference framework for diffusion language models (dLLMs), targeting researchers and engineers seeking to optimize dLLM deployment. It modularizes inference into four key components—model, diffusion iteration manager, decoder, and KV-cache manager—enabling flexible algorithm combinations and supporting batched inference for improved throughput.
How It Works
dInfer employs a modular architecture with distinct components for model loading, diffusion iteration management, decoding strategies, and KV-cache handling. Algorithmic innovations include soft diffusion iteration for smoother denoising, hierarchical and credit decoding for parallel processing, and a vicinity refresh strategy for KV-cache management. System-level optimizations are crucial, incorporating Tensor Parallelism (TP) and Expert Parallelism (EP) for GPU utilization, dynamic batching, PyTorch compilation, NVIDIA CUDA Graphs for efficient kernel execution, and a loop unrolling mechanism to minimize CUDA stream latency across diffusion iterations.
Quick Start & Requirements
Installation involves cloning the repository and installing via pip:
git clone https://github.com/inclusionAI/dInfer.git
cd dInfer
pip install .
Dependencies include PyTorch and Hugging Face libraries. For MoE models, conversion to FusedMoE format is required using hf_transfer and provided conversion scripts. Specific hardware requirements like GPUs are implied by the optimizations (TP, EP, CUDA Graphs). Links to HuggingFace models and an Arxiv technical report are provided.
Highlighted Details
Maintenance & Community
No specific community channels (e.g., Discord, Slack) or notable contributors/sponsorships are detailed in the README.
Licensing & Compatibility
The project is licensed under the Apache 2.0 license, permitting commercial use and modification.
Limitations & Caveats
LLaDA2 models are limited to 4-way Tensor Parallelism (TP). Block Diffusion is exclusively supported on LLaDA2 models, not LLaDA Dense/MoE variants. The evaluation suite is currently configured only for LLaDA-MoE, with plans to extend support to LLaDA Dense/LLaDA2 models in the future.
5 days ago
Inactive
MDK8888
GeeeekExplorer