WeDLM  by Tencent

Fastest diffusion language model for accelerated inference

Created 2 weeks ago

New!

525 stars

Top 60.0% on SourcePulse

GitHubView on GitHub
Project Summary

WeDLM is a diffusion language model designed for high-speed inference, addressing the performance bottlenecks of traditional diffusion models by integrating standard causal attention. It targets researchers and engineers requiring fast, production-ready LLM deployment, offering significant wall-clock speedups over established autoregressive inference engines like vLLM, while maintaining competitive accuracy.

How It Works

WeDLM employs "Topological Reordering" to perform parallel mask recovery under standard causal attention. This core innovation ensures native KV cache compatibility with optimizations like FlashAttention, PagedAttention, and CUDA Graphs. By enabling parallel prediction within a causal framework, it translates theoretical speedups into tangible performance gains against optimized autoregressive baselines, and allows direct initialization from pre-trained AR models.

Quick Start & Requirements

  • Installation: Recommended: bash install.sh (handles PyTorch and flash-attn compilation). Manual: pip install torch==2.8.0+cu129, pip install psutil ninja packaging, pip install flash-attn==2.7.4.post1 --no-build-isolation, pip install -e . within cloned repo. Docker: docker pull aiweiliu/wedlm:v3 and run.
  • Prerequisites: PyTorch 2.8.0+cu129 (or compatible), CUDA >= 11.8 (install.sh defaults to 12.9), ninja, psutil, packaging.
  • Demo/API: Interactive Web Demo (python web_demo.py), Python API (wedlm library).
  • Links: Project Page, Paper.

Highlighted Details

  • Achieves 3-6x speedup over vLLM on GSM8K and MATH benchmarks, with up to 10x on highly deterministic tasks like sequential counting.
  • WeDLM-8B-Instruct demonstrates strong performance across benchmarks (e.g., 92.92 on ARC-C, 92.27 on GSM8K, 75.00 on HumanEval), often exceeding its base AR model.
  • Supports direct initialization from Qwen2.5 and Qwen3 models.
  • Model Zoo includes 7B and 8B parameter base and instruct versions with 32k context length.

Maintenance & Community

No specific details on maintainers, community channels (Discord/Slack), or roadmap were found in the provided README.

Licensing & Compatibility

Licensed under the Apache 2.0 license, permitting commercial use and modification.

Limitations & Caveats

Speedup performance is task-dependent, with the most significant gains observed in structured, low-entropy tasks (e.g., math, code). Open-ended tasks show more moderate speedups (1.5-2x). Aggressive speed optimization may involve a quality-speed tradeoff.

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
18
Star History
534 stars in the last 18 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Ying Sheng Ying Sheng(Coauthor of SGLang), and
2 more.

LookaheadDecoding by hao-ai-lab

0.2%
1k
Parallel decoding algorithm for faster LLM inference
Created 2 years ago
Updated 10 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

0.9%
2k
Speculative decoding research paper for faster LLM inference
Created 2 years ago
Updated 3 weeks ago
Feedback? Help us improve.