Post-training method converts GQA-based LLMs to MLA models
Top 83.6% on sourcepulse
TransMLA enables the conversion of existing Grouped-Query Attention (GQA) Large Language Models (LLMs) into Multi-Head Latent Attention (MLA) models. This addresses communication bottlenecks in LLMs by reducing the activation cache size through low-rank KV caching, thereby accelerating inference and improving efficiency for researchers and developers working with large models.
How It Works
TransMLA employs a post-training conversion method to transform GQA models into MLA equivalents. MLA utilizes low-rank matrices for KV layers, enabling the caching of compressed latent KV states. This significantly reduces the memory footprint of the activation cache compared to standard attention mechanisms. An up-projection matrix is also incorporated to enhance expressiveness, balancing computation with communication overhead reduction.
Quick Start & Requirements
conda create -n transmla python=3.12.8
conda activate transmla
pip install vllm==0.8.4 accelerate==1.3.0 ipykernel
python main.py --ppl-eval-batch-size 1 --cal-batch-size 1 --dim2head 4 --q-lora-rank 512 --kv-lora-rank 256 --v-mqa-dim 64
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project is still under active development with several planned features, including support for vLLM and efficient generation modes, indicating it may not be production-ready for all use cases. Compatibility with a wider range of models beyond Qwen2.5 and LLaMA-3 is also a future goal.
2 weeks ago
1 day