MoH by SkyworkAI

Mixture-of-Head attention for efficient Transformers

Created 1 year ago

301 stars

Top 88.7% on SourcePulse

Project Summary

This project introduces Mixture-of-Head Attention (MoH), a novel architecture that treats attention heads as experts within a Mixture-of-Experts framework. It enhances inference efficiency and model performance by allowing tokens to select appropriate attention heads and employing a weighted summation instead of standard summation. MoH is applicable to Vision Transformers (ViT), Diffusion Transformers (DiT), and Large Language Models (LLMs), targeting researchers and practitioners in computer vision and natural language processing.

How It Works

MoH replaces the standard multi-head attention mechanism with a system where attention heads are treated as specialized experts. Each token dynamically selects a subset of these heads, guided by learned routing mechanisms, and their outputs are combined via a weighted sum. This approach allows for more efficient parameter utilization and increased model flexibility compared to traditional multi-head attention, as demonstrated by performance gains even when using a reduced number of attention heads.

Quick Start & Requirements

Installation: Code is available for MoH-ViT, MoH-DiT, and MoH-LLaMA3-8B. Specific setup instructions are provided within their respective subdirectories.
Dependencies: Requires standard deep learning libraries (PyTorch, Transformers). Specific model variants may have additional requirements detailed in their respective directories.
Resources: Pre-trained models are available on Hugging Face. Continue-tuning LLaMA3-8B with MoH requires a significant training budget (e.g., 10B tokens for initial recovery).
Links:
- MoH-ViT: https://github.com/SkyworkAI/MoH/tree/main/MoH-ViT
- MoH-DiT: https://github.com/SkyworkAI/MoH/tree/main/MoH-DiT
- MoH-LLaMA3-8B: https://github.com/SkyworkAI/MoH/tree/main/MoH-LLaMA3
- HuggingFace Models: https://huggingface.co/Chat-UniVi

Highlighted Details

Outperforms standard multi-head attention using only 50%-90% of attention heads across ViT, DiT, and LLM benchmarks.
Pre-trained models like LLaMA3-8B can be effectively continue-tuned to MoH, achieving improved accuracy (e.g., +2.4% on 14 benchmarks for MoH-LLaMA3-8B using 75% heads).
Demonstrates flexible head assignment patterns that adapt to different tasks and data categories.
The weighted summation of heads introduces additional flexibility and performance potential.

Maintenance & Community

The project is associated with SkyworkAI and has recent updates (October 2024) regarding LLaMA3-8B model availability and tokenizer configuration. Related projects include MoE++ and Chat-UniVi.

Licensing & Compatibility

The majority of the project is released under the Apache 2.0 license. However, the service is a research preview intended for non-commercial use only, subject to LLaMA's model license, OpenAI's data terms, and ShareGPT's privacy practices.

Limitations & Caveats

The "service" aspect is explicitly stated as a research preview for non-commercial use, imposing restrictions beyond the Apache 2.0 license due to dependencies on other model licenses and data terms.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days