Discover and explore top open-source AI tools and projects—updated daily.
TencentFastest diffusion language model for accelerated inference
New!
Top 60.0% on SourcePulse
WeDLM is a diffusion language model designed for high-speed inference, addressing the performance bottlenecks of traditional diffusion models by integrating standard causal attention. It targets researchers and engineers requiring fast, production-ready LLM deployment, offering significant wall-clock speedups over established autoregressive inference engines like vLLM, while maintaining competitive accuracy.
How It Works
WeDLM employs "Topological Reordering" to perform parallel mask recovery under standard causal attention. This core innovation ensures native KV cache compatibility with optimizations like FlashAttention, PagedAttention, and CUDA Graphs. By enabling parallel prediction within a causal framework, it translates theoretical speedups into tangible performance gains against optimized autoregressive baselines, and allows direct initialization from pre-trained AR models.
Quick Start & Requirements
bash install.sh (handles PyTorch and flash-attn compilation). Manual: pip install torch==2.8.0+cu129, pip install psutil ninja packaging, pip install flash-attn==2.7.4.post1 --no-build-isolation, pip install -e . within cloned repo. Docker: docker pull aiweiliu/wedlm:v3 and run.python web_demo.py), Python API (wedlm library).Highlighted Details
Maintenance & Community
No specific details on maintainers, community channels (Discord/Slack), or roadmap were found in the provided README.
Licensing & Compatibility
Licensed under the Apache 2.0 license, permitting commercial use and modification.
Limitations & Caveats
Speedup performance is task-dependent, with the most significant gains observed in structured, low-entropy tasks (e.g., math, code). Open-ended tasks show more moderate speedups (1.5-2x). Aggressive speed optimization may involve a quality-speed tradeoff.
5 days ago
Inactive
hao-ai-lab
SafeAILab