mixture_of_recursions  by raymin0223

Adaptive LLM computation with dynamic recursion

Created 4 months ago
498 stars

Top 62.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Mixture-of-Recursions (MoR) addresses the efficiency bottlenecks in adaptive computation for LLMs, specifically the missing KV cache and inefficient batched inference issues. It targets researchers and practitioners seeking to optimize LLM inference throughput and resource usage by enabling dynamic, token-level computation depth.

How It Works

MoR introduces a unified framework with an end-to-end trained routing mechanism that dynamically assigns optimal recursion depths to each token. It enhances this with a recursion-wise KV caching strategy, selectively storing KV pairs to resolve missing cache problems while optimizing memory. This approach tackles both key challenges simultaneously, unlike prior methods that addressed them separately.

Quick Start & Requirements

  • Install: pip install -r requirements.txt (with specific versions for torch==2.6.0+cu124, flash_attn==2.7.4.post1, transformers==4.52.4 recommended).
  • Prerequisites: Python 3.12, CUDA 12.4, H100/A100 GPUs for training.
  • Dataset: FineWeb-Edu dataset (download script provided).
  • Links: Pretrained Checkpoints

Highlighted Details

  • Achieves up to 2x greater inference throughput compared to standard transformers at similar accuracy.
  • Reduces total training FLOPs and memory requirements.
  • Built upon the Llama architecture, modifying LlamaForCausalLM.
  • Offers both Expert-choice and Token-choice routing versions.

Maintenance & Community

  • The project is associated with KAIST AI, Mila, Google DeepMind, and Google Research.
  • No specific community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. The presence of Google affiliations suggests potential licensing considerations for commercial use.

Limitations & Caveats

The project is presented as research with potential for further optimization, including data loading speed and integration with FlexAttention. The provided checkpoint download script has a noted potential bug.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
31 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.