mixture_of_recursions by raymin0223

Adaptive LLM computation with dynamic recursion

Created 7 months ago

528 stars

Top 59.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Wing Lian

Founder of Axolotl AI

Project Summary

Mixture-of-Recursions (MoR) addresses the efficiency bottlenecks in adaptive computation for LLMs, specifically the missing KV cache and inefficient batched inference issues. It targets researchers and practitioners seeking to optimize LLM inference throughput and resource usage by enabling dynamic, token-level computation depth.

How It Works

MoR introduces a unified framework with an end-to-end trained routing mechanism that dynamically assigns optimal recursion depths to each token. It enhances this with a recursion-wise KV caching strategy, selectively storing KV pairs to resolve missing cache problems while optimizing memory. This approach tackles both key challenges simultaneously, unlike prior methods that addressed them separately.

Quick Start & Requirements

Install: pip install -r requirements.txt (with specific versions for torch==2.6.0+cu124, flash_attn==2.7.4.post1, transformers==4.52.4 recommended).
Prerequisites: Python 3.12, CUDA 12.4, H100/A100 GPUs for training.
Dataset: FineWeb-Edu dataset (download script provided).
Links: Pretrained Checkpoints

Highlighted Details

Achieves up to 2x greater inference throughput compared to standard transformers at similar accuracy.
Reduces total training FLOPs and memory requirements.
Built upon the Llama architecture, modifying LlamaForCausalLM.
Offers both Expert-choice and Token-choice routing versions.

Maintenance & Community

The project is associated with KAIST AI, Mila, Google DeepMind, and Google Research.
No specific community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. The presence of Google affiliations suggests potential licensing considerations for commercial use.

Limitations & Caveats

The project is presented as research with potential for further optimization, including data loading speed and integration with FlexAttention. The provided checkpoint download script has a noted potential bug.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days