orthrus  by chiennv2000

LLM inference accelerated via dual-view diffusion decoding

Created 1 week ago

New!

367 stars

Top 76.7% on SourcePulse

GitHubView on GitHub
Project Summary

Orthrus offers a novel dual-architecture framework for fast, lossless Large Language Model (LLM) inference. It addresses the sequential bottleneck of traditional autoregressive decoding by enabling parallel token generation, making it ideal for researchers and power users requiring high-throughput LLM deployment without compromising output fidelity. The primary benefit is significant inference acceleration, achieving up to 7.8x speedup while guaranteeing generation matches the base model's exact predictive distribution.

How It Works

Orthrus unifies the exact generation fidelity of autoregressive LLMs with the high-speed parallel token generation of diffusion models. Its core innovation lies in a dual-view approach where both autoregressive and diffusion views attend to the same Key-Value (KV) cache, resulting in only an O(1) memory overhead. This is achieved by fine-tuning only 16% of the base LLM's parameters while keeping the original model frozen. This design avoids the redundant memory usage common in speculative decoding methods, leading to higher token acceptance rates and faster inference, especially at scale.

Quick Start & Requirements

  • Installation: Recommended: uv pip install -e ., uv pip install ninja packaging, uv pip install flash-attn --no-build-isolation (or pip install "flash-attn-4[cu13]"). uv is recommended for dependency resolution.
  • Prerequisites: Requires a CUDA-enabled GPU. torch.bfloat16 and flash_attention_2 (or flash_attention_4) are utilized for performance.
  • Demo: An instant Colab notebook is provided for quick testing.
  • Links: Model checkpoints are available on HuggingFace.

Highlighted Details

  • Achieves up to 7.8x inference speedup on generation tasks.
  • Guarantees strictly lossless generation, matching the base model's predictive distribution.
  • Offers zero redundant memory overhead due to a shared KV cache across dual views.
  • Parameter-efficient fine-tuning (16% of parameters) while keeping the base LLM frozen.
  • Outperforms speculative decoding methods like EAGLE-3 and DFlash in speed and throughput, particularly with increasing context lengths.
  • Supports native inference on Apple Silicon via MLX.

Maintenance & Community

Native integration with vLLM and SGLang is planned for future release. No other community channels or specific contributor details are provided in the README.

Licensing & Compatibility

The license type and compatibility for commercial use are not explicitly stated in the provided README content.

Limitations & Caveats

The README does not detail specific limitations, alpha status, or known bugs. Future integrations with popular inference frameworks like vLLM and SGLang are noted as "coming soon." Performance claims are based on specific model backbones (Qwen3) and benchmarks.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
9
Star History
367 stars in the last 13 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Ying Sheng Ying Sheng(Coauthor of SGLang), and
2 more.

LookaheadDecoding by hao-ai-lab

0%
1k
Parallel decoding algorithm for faster LLM inference
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.