MoE-plus-plus by SkyworkAI

Accelerating Mixture-of-Experts models with zero-computation techniques

Created 1 year ago

264 stars

Top 96.7% on SourcePulse

Project Summary

MoE++ addresses the computational inefficiency of Mixture-of-Experts (MoE) models by introducing "zero-computation experts" and "gating residuals." This approach accelerates MoE methods, offering significant performance gains and higher throughput. It is targeted at researchers and engineers seeking to optimize large language models, providing a foundation for more efficient MoE architectures.

How It Works

MoE++ integrates three types of zero-computation experts: the zero expert (discard), copy expert (skip), and constant expert (replace). These experts require negligible computation, allowing for flexible allocation of computational resources. The system also employs gating residuals, which enable tokens to consider previous layer routing decisions when selecting experts. This mechanism facilitates reduced computation for simpler tokens and allows more complex tokens to utilize a greater number of experts, thereby enhancing overall performance and efficiency.

Quick Start & Requirements

Inference can be performed using the Hugging Face transformers library. The base model MoE++7B-Base is available at Chat-UniVi/MoE-Plus-Plus-7B. A Python snippet demonstrates loading the model and tokenizer, requiring trust_remote_code=True and device_map='auto' for potential multi-GPU utilization. Training code is built upon Skywork-MoE and will be released after approval; evaluation uses the Eleuther AI Language Model Evaluation Harness.

Highlighted Details

Low Computing Overhead: MoE++ models exhibit lower computational complexity than vanilla MoE models with equivalent parameter counts.
High Performance & Throughput: Achieves superior performance and 1.1x to 2.1x greater expert forward throughput compared to standard MoE models of similar size.
Deployment Friendly: Zero-computation experts have minimal parameters, simplifying deployment across GPUs and mitigating communication overhead and load imbalance issues.
Flexible Computation Allocation: Optimizes resource usage by allowing simpler tokens to consume fewer experts, freeing up capacity for more complex tokens.
Stable Routing: Gating residuals contribute to stable routing by reducing the variance of routing scores across layers.

Maintenance & Community

The repository encourages users to watch for the latest updates. Links to GitHub issues are provided for tracking. Related projects like Skywork-MoE, MoH, and Chat-UniVi are also highlighted.

Licensing & Compatibility

The project is primarily licensed under Apache 2.0. However, it is designated as a research preview for non-commercial use only, subject to the LLaMA model license, OpenAI's data terms of use, and ShareGPT's privacy practices.

Limitations & Caveats

The chat model inference is marked as "Coming Soon." The release of training code is contingent on the open-sourcing of Skywork-MoE. The non-commercial use restriction is a significant caveat for adoption.

MoE-plus-plus by SkyworkAI

Explore Similar Projects

GRIN-MoE by microsoft

Efficient_Foundation_Model_Survey by UbiquitousLearning

MoE-Infinity by EfficientMoE

varuna by microsoft

fiddler by efeslab

LPLB by deepseek-ai

MegCC by MegEngine

Efficient-LLMs-Survey by AIoT-MLSys-Lab

SimAI by aliyun

megablocks by databricks

nanotron by huggingface

optimate by nebuly-ai