Discover and explore top open-source AI tools and projects—updated daily.
Cost-effective, fast MoE model inference library
Top 99.6% on SourcePulse
Summary MoE-Infinity is a PyTorch library for cost-effective, fast, and user-friendly inference of Mixture-of-Experts (MoE) models. It addresses memory constraints on GPUs by offloading experts to host memory, enabling users to serve large MoE models efficiently and achieve significant latency improvements.
How It Works The library minimizes MoE expert offloading overhead through novel techniques: expert activation tracing, activation-aware expert prefetching, and activation-aware expert caching. This approach allows memory-constrained GPUs to serve MoE models. It integrates LLM acceleration like FlashAttention and supports multi-GPU environments with OS-level optimizations, aiming for state-of-the-art latency in resource-limited settings.
Quick Start & Requirements
Installation is recommended in a Python 3.9 virtual environment. Install via PyPI (pip install moe-infinity
) or from source by cloning the repository from GitHub and running pip install -e .
. A prerequisite is libstdcxx-ng=12
. For enhanced performance, install FlashAttention (>=2.5.2) using FLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attn
. The library is HuggingFace compatible.
Highlighted Details MoE-Infinity achieves state-of-the-art latency in resource-constrained GPU environments. On a single A5000 (24GB), it shows significantly lower per-token latency for models like Switch-large-128 (0.130s vs. Accelerate's 1.043s) and NLLB-MoE-54B (0.119s vs. Accelerate's 3.071s). It also offers competitive results for Mixtral-8x7b (0.735s vs. Ollama's 0.903s) and supports checkpoints including Deepseek-V2, Google Switch Transformers, Meta NLLB-MoE, and Mixtral. An OpenAI-compatible server is included.
Maintenance & Community Authors include Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. Future plans involve supporting vLLM as an inference runtime and implementing expert parallelism for distributed inference. No specific community channels are detailed in the provided text.
Licensing & Compatibility The specific open-source license for MoE-Infinity is not explicitly stated in the provided README text. Compatibility is highlighted with HuggingFace models and workflows.
Limitations & Caveats
The current open-sourced version omits distributed inference support, prioritizing HuggingFace user-friendliness. The offload_path
must be unique per MoE model to avoid unexpected behavior, and vLLM runtime support is planned but not yet implemented. The OpenAI-compatible server supports only essential fields.
2 days ago
1 day