mixture_of_recursions  by raymin0223

Adaptive LLM computation with dynamic recursion

Created 3 months ago
440 stars

Top 68.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Mixture-of-Recursions (MoR) addresses the efficiency bottlenecks in adaptive computation for LLMs, specifically the missing KV cache and inefficient batched inference issues. It targets researchers and practitioners seeking to optimize LLM inference throughput and resource usage by enabling dynamic, token-level computation depth.

How It Works

MoR introduces a unified framework with an end-to-end trained routing mechanism that dynamically assigns optimal recursion depths to each token. It enhances this with a recursion-wise KV caching strategy, selectively storing KV pairs to resolve missing cache problems while optimizing memory. This approach tackles both key challenges simultaneously, unlike prior methods that addressed them separately.

Quick Start & Requirements

  • Install: pip install -r requirements.txt (with specific versions for torch==2.6.0+cu124, flash_attn==2.7.4.post1, transformers==4.52.4 recommended).
  • Prerequisites: Python 3.12, CUDA 12.4, H100/A100 GPUs for training.
  • Dataset: FineWeb-Edu dataset (download script provided).
  • Links: Pretrained Checkpoints

Highlighted Details

  • Achieves up to 2x greater inference throughput compared to standard transformers at similar accuracy.
  • Reduces total training FLOPs and memory requirements.
  • Built upon the Llama architecture, modifying LlamaForCausalLM.
  • Offers both Expert-choice and Token-choice routing versions.

Maintenance & Community

  • The project is associated with KAIST AI, Mila, Google DeepMind, and Google Research.
  • No specific community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. The presence of Google affiliations suggests potential licensing considerations for commercial use.

Limitations & Caveats

The project is presented as research with potential for further optimization, including data loading speed and integration with FlexAttention. The provided checkpoint download script has a noted potential bug.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
34 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Ying Sheng Ying Sheng(Coauthor of SGLang), and
2 more.

LookaheadDecoding by hao-ai-lab

0.2%
1k
Parallel decoding algorithm for faster LLM inference
Created 1 year ago
Updated 6 months ago
Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

dynamo by ai-dynamo

1.0%
5k
Inference framework for distributed generative AI model serving
Created 6 months ago
Updated 15 hours ago
Feedback? Help us improve.