MoH  by SkyworkAI

Mixture-of-Head attention for efficient Transformers

Created 11 months ago
274 stars

Top 94.3% on SourcePulse

GitHubView on GitHub
Project Summary

This project introduces Mixture-of-Head Attention (MoH), a novel architecture that treats attention heads as experts within a Mixture-of-Experts framework. It enhances inference efficiency and model performance by allowing tokens to select appropriate attention heads and employing a weighted summation instead of standard summation. MoH is applicable to Vision Transformers (ViT), Diffusion Transformers (DiT), and Large Language Models (LLMs), targeting researchers and practitioners in computer vision and natural language processing.

How It Works

MoH replaces the standard multi-head attention mechanism with a system where attention heads are treated as specialized experts. Each token dynamically selects a subset of these heads, guided by learned routing mechanisms, and their outputs are combined via a weighted sum. This approach allows for more efficient parameter utilization and increased model flexibility compared to traditional multi-head attention, as demonstrated by performance gains even when using a reduced number of attention heads.

Quick Start & Requirements

Highlighted Details

  • Outperforms standard multi-head attention using only 50%-90% of attention heads across ViT, DiT, and LLM benchmarks.
  • Pre-trained models like LLaMA3-8B can be effectively continue-tuned to MoH, achieving improved accuracy (e.g., +2.4% on 14 benchmarks for MoH-LLaMA3-8B using 75% heads).
  • Demonstrates flexible head assignment patterns that adapt to different tasks and data categories.
  • The weighted summation of heads introduces additional flexibility and performance potential.

Maintenance & Community

The project is associated with SkyworkAI and has recent updates (October 2024) regarding LLaMA3-8B model availability and tokenizer configuration. Related projects include MoE++ and Chat-UniVi.

Licensing & Compatibility

The majority of the project is released under the Apache 2.0 license. However, the service is a research preview intended for non-commercial use only, subject to LLaMA's model license, OpenAI's data terms, and ShareGPT's privacy practices.

Limitations & Caveats

The "service" aspect is explicitly stated as a research preview for non-commercial use, imposing restrictions beyond the Apache 2.0 license due to dependencies on other model licenses and data terms.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

x-transformers by lucidrains

0.2%
6k
Transformer library with extensive experimental features
Created 4 years ago
Updated 5 days ago
Feedback? Help us improve.