mixture_of_recursions  by raymin0223

Adaptive LLM computation with dynamic recursion

created 1 month ago
342 stars

Top 81.9% on sourcepulse

GitHubView on GitHub
Project Summary

Mixture-of-Recursions (MoR) addresses the efficiency bottlenecks in adaptive computation for LLMs, specifically the missing KV cache and inefficient batched inference issues. It targets researchers and practitioners seeking to optimize LLM inference throughput and resource usage by enabling dynamic, token-level computation depth.

How It Works

MoR introduces a unified framework with an end-to-end trained routing mechanism that dynamically assigns optimal recursion depths to each token. It enhances this with a recursion-wise KV caching strategy, selectively storing KV pairs to resolve missing cache problems while optimizing memory. This approach tackles both key challenges simultaneously, unlike prior methods that addressed them separately.

Quick Start & Requirements

  • Install: pip install -r requirements.txt (with specific versions for torch==2.6.0+cu124, flash_attn==2.7.4.post1, transformers==4.52.4 recommended).
  • Prerequisites: Python 3.12, CUDA 12.4, H100/A100 GPUs for training.
  • Dataset: FineWeb-Edu dataset (download script provided).
  • Links: Pretrained Checkpoints

Highlighted Details

  • Achieves up to 2x greater inference throughput compared to standard transformers at similar accuracy.
  • Reduces total training FLOPs and memory requirements.
  • Built upon the Llama architecture, modifying LlamaForCausalLM.
  • Offers both Expert-choice and Token-choice routing versions.

Maintenance & Community

  • The project is associated with KAIST AI, Mila, Google DeepMind, and Google Research.
  • No specific community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. The presence of Google affiliations suggests potential licensing considerations for commercial use.

Limitations & Caveats

The project is presented as research with potential for further optimization, including data loading speed and integration with FlexAttention. The provided checkpoint download script has a noted potential bug.

Health Check
Last commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
7
Star History
350 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

HALOs by ContextualAI

0.2%
873
Library for aligning LLMs using human-aware loss functions
created 1 year ago
updated 2 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), and
10 more.

TinyLlama by jzhang38

0.3%
9k
Tiny pretraining project for a 1.1B Llama model
created 1 year ago
updated 1 year ago
Feedback? Help us improve.