Discover and explore top open-source AI tools and projects—updated daily.
efeslabFast local inference for large Mixture-of-Experts LLMs
Top 99.1% on SourcePulse
Fiddler is a fast inference system designed for running large Mixture-of-Experts (MoE) Large Language Models (LLMs) on local devices with limited GPU memory. It targets researchers and power users seeking to deploy unquantized MoE models efficiently, offering a significant performance uplift over existing offloading techniques by intelligently orchestrating CPU and GPU resources.
How It Works
Fiddler's core innovation lies in its CPU-GPU orchestration strategy for MoE inference. Instead of merely offloading model weights to CPU memory for storage while computation remains GPU-bound, Fiddler shifts the computation of expert layers to the CPU. When an expert is needed but not in GPU memory, Fiddler copies the activation values (which are considerably smaller than weights) from GPU to CPU, performs the expert computation on the CPU, and then transfers the output activations back to the GPU. This approach drastically reduces data transfer overhead, making inference faster despite the CPU's slower computation speeds.
Quick Start & Requirements
pip install -r requirements.txtpython src/fiddler/infer.py --model <path/to/mixtral/model> --input <prompt>Highlighted Details
Maintenance & Community
This repository is explicitly labeled as a proof-of-concept and is under heavy construction. The roadmap includes expanding support to other MoE models (e.g., DeepSeek-MoE, OpenMoE, Switch Transformer), quantized models, and AVX512_BF16. No community channels (like Discord or Slack) are listed.
Licensing & Compatibility
The license type is not specified in the provided README.
Limitations & Caveats
The system is currently a research prototype supporting only the 16-bit Mixtral-8x7B model. Performance is notably impacted if the CPU lacks AVX512 support, as the current CPU expert processing relies on PyTorch implementations.
1 year ago
Inactive
alibaba
Mega4alik
SJTU-IPADS