Adaptive LLM computation with dynamic recursion
Top 81.9% on sourcepulse
Mixture-of-Recursions (MoR) addresses the efficiency bottlenecks in adaptive computation for LLMs, specifically the missing KV cache and inefficient batched inference issues. It targets researchers and practitioners seeking to optimize LLM inference throughput and resource usage by enabling dynamic, token-level computation depth.
How It Works
MoR introduces a unified framework with an end-to-end trained routing mechanism that dynamically assigns optimal recursion depths to each token. It enhances this with a recursion-wise KV caching strategy, selectively storing KV pairs to resolve missing cache problems while optimizing memory. This approach tackles both key challenges simultaneously, unlike prior methods that addressed them separately.
Quick Start & Requirements
pip install -r requirements.txt
(with specific versions for torch==2.6.0+cu124
, flash_attn==2.7.4.post1
, transformers==4.52.4
recommended).Highlighted Details
LlamaForCausalLM
.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project is presented as research with potential for further optimization, including data loading speed and integration with FlexAttention. The provided checkpoint download script has a noted potential bug.
5 days ago
Inactive