orthrus by chiennv2000

LLM inference accelerated via dual-view diffusion decoding

Created 1 month ago

459 stars

Top 65.1% on SourcePulse

Project Summary

Orthrus offers a novel dual-architecture framework for fast, lossless Large Language Model (LLM) inference. It addresses the sequential bottleneck of traditional autoregressive decoding by enabling parallel token generation, making it ideal for researchers and power users requiring high-throughput LLM deployment without compromising output fidelity. The primary benefit is significant inference acceleration, achieving up to 7.8x speedup while guaranteeing generation matches the base model's exact predictive distribution.

How It Works

Orthrus unifies the exact generation fidelity of autoregressive LLMs with the high-speed parallel token generation of diffusion models. Its core innovation lies in a dual-view approach where both autoregressive and diffusion views attend to the same Key-Value (KV) cache, resulting in only an O(1) memory overhead. This is achieved by fine-tuning only 16% of the base LLM's parameters while keeping the original model frozen. This design avoids the redundant memory usage common in speculative decoding methods, leading to higher token acceptance rates and faster inference, especially at scale.

Quick Start & Requirements

Installation: Recommended: uv pip install -e ., uv pip install ninja packaging, uv pip install flash-attn --no-build-isolation (or pip install "flash-attn-4[cu13]"). uv is recommended for dependency resolution.
Prerequisites: Requires a CUDA-enabled GPU. torch.bfloat16 and flash_attention_2 (or flash_attention_4) are utilized for performance.
Demo: An instant Colab notebook is provided for quick testing.
Links: Model checkpoints are available on HuggingFace.

Highlighted Details

Achieves up to 7.8x inference speedup on generation tasks.
Guarantees strictly lossless generation, matching the base model's predictive distribution.
Offers zero redundant memory overhead due to a shared KV cache across dual views.
Parameter-efficient fine-tuning (16% of parameters) while keeping the base LLM frozen.
Outperforms speculative decoding methods like EAGLE-3 and DFlash in speed and throughput, particularly with increasing context lengths.
Supports native inference on Apple Silicon via MLX.

Maintenance & Community

Native integration with vLLM and SGLang is planned for future release. No other community channels or specific contributor details are provided in the README.

Licensing & Compatibility

The license type and compatibility for commercial use are not explicitly stated in the provided README content.

Limitations & Caveats

The README does not detail specific limitations, alpha status, or known bugs. Future integrations with popular inference frameworks like vLLM and SGLang are noted as "coming soon." Performance claims are based on specific model backbones (Qwen3) and benchmarks.

orthrus by chiennv2000

Explore Similar Projects

flex-nano-vllm by changjonathanc

simple-llm by naklecha

dflash-mlx by Aryagm

Quest by mit-han-lab

dflash-mlx by bstnxbt

omniserve by mit-han-lab

ssd by tanishqkumar

OSCAR by FutureMLS-Lab

atlas by Avarok-Cybersecurity

LookaheadDecoding by hao-ai-lab

turboquant by 0xSero

RedKnot by rednote-machine-learning