MoE-Infinity by EfficientMoE

Cost-effective, fast MoE model inference library

Created 2 years ago

273 stars

Top 94.7% on SourcePulse

Project Summary

Summary MoE-Infinity is a PyTorch library for cost-effective, fast, and user-friendly inference of Mixture-of-Experts (MoE) models. It addresses memory constraints on GPUs by offloading experts to host memory, enabling users to serve large MoE models efficiently and achieve significant latency improvements.

How It Works The library minimizes MoE expert offloading overhead through novel techniques: expert activation tracing, activation-aware expert prefetching, and activation-aware expert caching. This approach allows memory-constrained GPUs to serve MoE models. It integrates LLM acceleration like FlashAttention and supports multi-GPU environments with OS-level optimizations, aiming for state-of-the-art latency in resource-limited settings.

Quick Start & Requirements Installation is recommended in a Python 3.9 virtual environment. Install via PyPI (pip install moe-infinity) or from source by cloning the repository from GitHub and running pip install -e .. A prerequisite is libstdcxx-ng=12. For enhanced performance, install FlashAttention (>=2.5.2) using FLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attn. The library is HuggingFace compatible.

Highlighted Details MoE-Infinity achieves state-of-the-art latency in resource-constrained GPU environments. On a single A5000 (24GB), it shows significantly lower per-token latency for models like Switch-large-128 (0.130s vs. Accelerate's 1.043s) and NLLB-MoE-54B (0.119s vs. Accelerate's 3.071s). It also offers competitive results for Mixtral-8x7b (0.735s vs. Ollama's 0.903s) and supports checkpoints including Deepseek-V2, Google Switch Transformers, Meta NLLB-MoE, and Mixtral. An OpenAI-compatible server is included.

Maintenance & Community Authors include Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. Future plans involve supporting vLLM as an inference runtime and implementing expert parallelism for distributed inference. No specific community channels are detailed in the provided text.

Licensing & Compatibility The specific open-source license for MoE-Infinity is not explicitly stated in the provided README text. Compatibility is highlighted with HuggingFace models and workflows.

Limitations & Caveats The current open-sourced version omits distributed inference support, prioritizing HuggingFace user-friendliness. The offload_path must be unique per MoE model to avoid unexpected behavior, and vLLM runtime support is planned but not yet implemented. The OpenAI-compatible server supports only essential fields.

MoE-Infinity by EfficientMoE

Explore Similar Projects

fiddler by efeslab

dInfer by inclusionAI

ScaleLLM by vectorch-ai

flashtensors by leoheuler

yalm by andrewkchan

omniserve by mit-han-lab

chitu by thu-pacman

Tutel by microsoft

ollm by Mega4alik

mini-sglang by sgl-project

CTranslate2 by OpenNMT

dynamo by ai-dynamo