TransMLA by MuLabPKU

Post-training method converts GQA-based LLMs to MLA models

Created 11 months ago

413 stars

Top 70.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Wing Lian

Founder of Axolotl AI

Project Summary

TransMLA enables the conversion of existing Grouped-Query Attention (GQA) Large Language Models (LLMs) into Multi-Head Latent Attention (MLA) models. This addresses communication bottlenecks in LLMs by reducing the activation cache size through low-rank KV caching, thereby accelerating inference and improving efficiency for researchers and developers working with large models.

How It Works

TransMLA employs a post-training conversion method to transform GQA models into MLA equivalents. MLA utilizes low-rank matrices for KV layers, enabling the caching of compressed latent KV states. This significantly reduces the memory footprint of the activation cache compared to standard attention mechanisms. An up-projection matrix is also incorporated to enhance expressiveness, balancing computation with communication overhead reduction.

Quick Start & Requirements

Install:

conda create -n transmla python=3.12.8
conda activate transmla
pip install vllm==0.8.4 accelerate==1.3.0 ipykernel

Run:

python main.py --ppl-eval-batch-size 1 --cal-batch-size 1 --dim2head 4 --q-lora-rank 512 --kv-lora-rank 256 --v-mqa-dim 64

Prerequisites: Python 3.12.8, vLLM (0.8.4), accelerate (1.3.0).
Resources: Requires a Conda environment. Further details on resource requirements for training/inference are not specified.
Docs: Technical report available at https://huggingface.co/papers/2502.07864.

Highlighted Details

Converts GQA models to MLA post-training.
Reduces KV cache size via low-rank KV caching.
Incorporates an up-projection matrix for expressiveness.
Recent updates (v3) include PCA across RoPE and further KV cache reduction.

Maintenance & Community

Project is actively developed with recent updates in January, February, and April 2025.
Planned features include RoPE compatibility, Absorb operation support, efficient generation mode, vLLM integration, broader model support (LLaMA, Mistral, Gemma2), and fine-tuning on R1 distillation datasets.
No community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is still under active development with several planned features, including support for vLLM and efficient generation modes, indicating it may not be production-ready for all use cases. Compatibility with a wider range of models beyond Qwen2.5 and LLaMA-3 is also a future goal.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

18 stars in the last 30 days