TransMLA  by fxmeng

Post-training method converts GQA-based LLMs to MLA models

Created 9 months ago
384 stars

Top 74.4% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

TransMLA enables the conversion of existing Grouped-Query Attention (GQA) Large Language Models (LLMs) into Multi-Head Latent Attention (MLA) models. This addresses communication bottlenecks in LLMs by reducing the activation cache size through low-rank KV caching, thereby accelerating inference and improving efficiency for researchers and developers working with large models.

How It Works

TransMLA employs a post-training conversion method to transform GQA models into MLA equivalents. MLA utilizes low-rank matrices for KV layers, enabling the caching of compressed latent KV states. This significantly reduces the memory footprint of the activation cache compared to standard attention mechanisms. An up-projection matrix is also incorporated to enhance expressiveness, balancing computation with communication overhead reduction.

Quick Start & Requirements

  • Install:
    conda create -n transmla python=3.12.8
    conda activate transmla
    pip install vllm==0.8.4 accelerate==1.3.0 ipykernel
    
  • Run:
    python main.py --ppl-eval-batch-size 1 --cal-batch-size 1 --dim2head 4 --q-lora-rank 512 --kv-lora-rank 256 --v-mqa-dim 64
    
  • Prerequisites: Python 3.12.8, vLLM (0.8.4), accelerate (1.3.0).
  • Resources: Requires a Conda environment. Further details on resource requirements for training/inference are not specified.
  • Docs: Technical report available at https://huggingface.co/papers/2502.07864.

Highlighted Details

  • Converts GQA models to MLA post-training.
  • Reduces KV cache size via low-rank KV caching.
  • Incorporates an up-projection matrix for expressiveness.
  • Recent updates (v3) include PCA across RoPE and further KV cache reduction.

Maintenance & Community

  • Project is actively developed with recent updates in January, February, and April 2025.
  • Planned features include RoPE compatibility, Absorb operation support, efficient generation mode, vLLM integration, broader model support (LLaMA, Mistral, Gemma2), and fine-tuning on R1 distillation datasets.
  • No community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is still under active development with several planned features, including support for vLLM and efficient generation modes, indicating it may not be production-ready for all use cases. Compatibility with a wider range of models beyond Qwen2.5 and LLaMA-3 is also a future goal.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
15
Star History
33 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dots.llm1 by rednote-hilab

0%
465
MoE model for research
Created 5 months ago
Updated 1 month ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

1.3%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 2 days ago
Feedback? Help us improve.