Discover and explore top open-source AI tools and projects—updated daily.
chiennv2000LLM inference accelerated via dual-view diffusion decoding
New!
Top 76.7% on SourcePulse
Orthrus offers a novel dual-architecture framework for fast, lossless Large Language Model (LLM) inference. It addresses the sequential bottleneck of traditional autoregressive decoding by enabling parallel token generation, making it ideal for researchers and power users requiring high-throughput LLM deployment without compromising output fidelity. The primary benefit is significant inference acceleration, achieving up to 7.8x speedup while guaranteeing generation matches the base model's exact predictive distribution.
How It Works
Orthrus unifies the exact generation fidelity of autoregressive LLMs with the high-speed parallel token generation of diffusion models. Its core innovation lies in a dual-view approach where both autoregressive and diffusion views attend to the same Key-Value (KV) cache, resulting in only an O(1) memory overhead. This is achieved by fine-tuning only 16% of the base LLM's parameters while keeping the original model frozen. This design avoids the redundant memory usage common in speculative decoding methods, leading to higher token acceptance rates and faster inference, especially at scale.
Quick Start & Requirements
uv pip install -e ., uv pip install ninja packaging, uv pip install flash-attn --no-build-isolation (or pip install "flash-attn-4[cu13]"). uv is recommended for dependency resolution.torch.bfloat16 and flash_attention_2 (or flash_attention_4) are utilized for performance.Highlighted Details
Maintenance & Community
Native integration with vLLM and SGLang is planned for future release. No other community channels or specific contributor details are provided in the README.
Licensing & Compatibility
The license type and compatibility for commercial use are not explicitly stated in the provided README content.
Limitations & Caveats
The README does not detail specific limitations, alpha status, or known bugs. Future integrations with popular inference frameworks like vLLM and SGLang are noted as "coming soon." Performance claims are based on specific model backbones (Qwen3) and benchmarks.
1 week ago
Inactive
FMInference
hao-ai-lab