MoE-Infinity  by EfficientMoE

Cost-effective, fast MoE model inference library

Created 1 year ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary MoE-Infinity is a PyTorch library for cost-effective, fast, and user-friendly inference of Mixture-of-Experts (MoE) models. It addresses memory constraints on GPUs by offloading experts to host memory, enabling users to serve large MoE models efficiently and achieve significant latency improvements.

How It Works The library minimizes MoE expert offloading overhead through novel techniques: expert activation tracing, activation-aware expert prefetching, and activation-aware expert caching. This approach allows memory-constrained GPUs to serve MoE models. It integrates LLM acceleration like FlashAttention and supports multi-GPU environments with OS-level optimizations, aiming for state-of-the-art latency in resource-limited settings.

Quick Start & Requirements Installation is recommended in a Python 3.9 virtual environment. Install via PyPI (pip install moe-infinity) or from source by cloning the repository from GitHub and running pip install -e .. A prerequisite is libstdcxx-ng=12. For enhanced performance, install FlashAttention (>=2.5.2) using FLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attn. The library is HuggingFace compatible.

Highlighted Details MoE-Infinity achieves state-of-the-art latency in resource-constrained GPU environments. On a single A5000 (24GB), it shows significantly lower per-token latency for models like Switch-large-128 (0.130s vs. Accelerate's 1.043s) and NLLB-MoE-54B (0.119s vs. Accelerate's 3.071s). It also offers competitive results for Mixtral-8x7b (0.735s vs. Ollama's 0.903s) and supports checkpoints including Deepseek-V2, Google Switch Transformers, Meta NLLB-MoE, and Mixtral. An OpenAI-compatible server is included.

Maintenance & Community Authors include Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. Future plans involve supporting vLLM as an inference runtime and implementing expert parallelism for distributed inference. No specific community channels are detailed in the provided text.

Licensing & Compatibility The specific open-source license for MoE-Infinity is not explicitly stated in the provided README text. Compatibility is highlighted with HuggingFace models and workflows.

Limitations & Caveats The current open-sourced version omits distributed inference support, prioritizing HuggingFace user-friendliness. The offload_path must be unique per MoE model to avoid unexpected behavior, and vLLM runtime support is planned but not yet implemented. The OpenAI-compatible server supports only essential fields.

Health Check
Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
2
Star History
17 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
12 more.

mistral.rs by EricLBuehler

0.1%
6k
LLM inference engine for blazing fast performance
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.