WeDLM by Tencent

Fastest diffusion language model for accelerated inference

Created 2 months ago

610 stars

Top 53.7% on SourcePulse

Project Summary

WeDLM is a diffusion language model designed for high-speed inference, addressing the performance bottlenecks of traditional diffusion models by integrating standard causal attention. It targets researchers and engineers requiring fast, production-ready LLM deployment, offering significant wall-clock speedups over established autoregressive inference engines like vLLM, while maintaining competitive accuracy.

How It Works

WeDLM employs "Topological Reordering" to perform parallel mask recovery under standard causal attention. This core innovation ensures native KV cache compatibility with optimizations like FlashAttention, PagedAttention, and CUDA Graphs. By enabling parallel prediction within a causal framework, it translates theoretical speedups into tangible performance gains against optimized autoregressive baselines, and allows direct initialization from pre-trained AR models.

Quick Start & Requirements

Installation: Recommended: bash install.sh (handles PyTorch and flash-attn compilation). Manual: pip install torch==2.8.0+cu129, pip install psutil ninja packaging, pip install flash-attn==2.7.4.post1 --no-build-isolation, pip install -e . within cloned repo. Docker: docker pull aiweiliu/wedlm:v3 and run.
Prerequisites: PyTorch 2.8.0+cu129 (or compatible), CUDA >= 11.8 (install.sh defaults to 12.9), ninja, psutil, packaging.
Demo/API: Interactive Web Demo (python web_demo.py), Python API (wedlm library).
Links: Project Page, Paper.

Highlighted Details

Achieves 3-6x speedup over vLLM on GSM8K and MATH benchmarks, with up to 10x on highly deterministic tasks like sequential counting.
WeDLM-8B-Instruct demonstrates strong performance across benchmarks (e.g., 92.92 on ARC-C, 92.27 on GSM8K, 75.00 on HumanEval), often exceeding its base AR model.
Supports direct initialization from Qwen2.5 and Qwen3 models.
Model Zoo includes 7B and 8B parameter base and instruct versions with 32k context length.

Maintenance & Community

No specific details on maintainers, community channels (Discord/Slack), or roadmap were found in the provided README.

Licensing & Compatibility

Licensed under the Apache 2.0 license, permitting commercial use and modification.

Limitations & Caveats

Speedup performance is task-dependent, with the most significant gains observed in structured, low-entropy tasks (e.g., math, code). Open-ended tasks show more moderate speedups (1.5-2x). Aggressive speed optimization may involve a quality-speed tradeoff.

WeDLM by Tencent

Explore Similar Projects

BAdam by Ledzy

SDAR by JetAstra

VisionZip by JIA-Lab-research

flux-fp8-api by aredden

Quest by mit-han-lab

ScaleLLM by vectorch-ai

Adan by sail-sg

LookaheadDecoding by hao-ai-lab

SpargeAttn by thu-ml

lightning-thunder by Lightning-AI

EAGLE by SafeAILab

Awesome-LLM-Inference by xlite-dev