TransMLA  by fxmeng

Post-training method converts GQA-based LLMs to MLA models

created 7 months ago
333 stars

Top 83.6% on sourcepulse

GitHubView on GitHub
Project Summary

TransMLA enables the conversion of existing Grouped-Query Attention (GQA) Large Language Models (LLMs) into Multi-Head Latent Attention (MLA) models. This addresses communication bottlenecks in LLMs by reducing the activation cache size through low-rank KV caching, thereby accelerating inference and improving efficiency for researchers and developers working with large models.

How It Works

TransMLA employs a post-training conversion method to transform GQA models into MLA equivalents. MLA utilizes low-rank matrices for KV layers, enabling the caching of compressed latent KV states. This significantly reduces the memory footprint of the activation cache compared to standard attention mechanisms. An up-projection matrix is also incorporated to enhance expressiveness, balancing computation with communication overhead reduction.

Quick Start & Requirements

  • Install:
    conda create -n transmla python=3.12.8
    conda activate transmla
    pip install vllm==0.8.4 accelerate==1.3.0 ipykernel
    
  • Run:
    python main.py --ppl-eval-batch-size 1 --cal-batch-size 1 --dim2head 4 --q-lora-rank 512 --kv-lora-rank 256 --v-mqa-dim 64
    
  • Prerequisites: Python 3.12.8, vLLM (0.8.4), accelerate (1.3.0).
  • Resources: Requires a Conda environment. Further details on resource requirements for training/inference are not specified.
  • Docs: Technical report available at https://huggingface.co/papers/2502.07864.

Highlighted Details

  • Converts GQA models to MLA post-training.
  • Reduces KV cache size via low-rank KV caching.
  • Incorporates an up-projection matrix for expressiveness.
  • Recent updates (v3) include PCA across RoPE and further KV cache reduction.

Maintenance & Community

  • Project is actively developed with recent updates in January, February, and April 2025.
  • Planned features include RoPE compatibility, Absorb operation support, efficient generation mode, vLLM integration, broader model support (LLaMA, Mistral, Gemma2), and fine-tuning on R1 distillation datasets.
  • No community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is still under active development with several planned features, including support for vLLM and efficient generation modes, indicating it may not be production-ready for all use cases. Compatibility with a wider range of models beyond Qwen2.5 and LLaMA-3 is also a future goal.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
97 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Zhuohan Li Zhuohan Li(Author of vLLM), and
1 more.

Consistency_LLM by hao-ai-lab

0%
397
Parallel decoder for efficient LLM inference
created 1 year ago
updated 8 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Ying Sheng Ying Sheng(Author of SGLang), and
1 more.

LookaheadDecoding by hao-ai-lab

0.1%
1k
Parallel decoding algorithm for faster LLM inference
created 1 year ago
updated 4 months ago
Feedback? Help us improve.