fiddler  by efeslab

Fast local inference for large Mixture-of-Experts LLMs

Created 1 year ago
254 stars

Top 99.1% on SourcePulse

GitHubView on GitHub
Project Summary

Fiddler is a fast inference system designed for running large Mixture-of-Experts (MoE) Large Language Models (LLMs) on local devices with limited GPU memory. It targets researchers and power users seeking to deploy unquantized MoE models efficiently, offering a significant performance uplift over existing offloading techniques by intelligently orchestrating CPU and GPU resources.

How It Works

Fiddler's core innovation lies in its CPU-GPU orchestration strategy for MoE inference. Instead of merely offloading model weights to CPU memory for storage while computation remains GPU-bound, Fiddler shifts the computation of expert layers to the CPU. When an expert is needed but not in GPU memory, Fiddler copies the activation values (which are considerably smaller than weights) from GPU to CPU, performs the expert computation on the CPU, and then transfers the output activations back to the GPU. This approach drastically reduces data transfer overhead, making inference faster despite the CPU's slower computation speeds.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Run: python src/fiddler/infer.py --model <path/to/mixtral/model> --input <prompt>
  • Prerequisites: Requires a GPU (tested with 24GB VRAM), a multi-core CPU (tested with Intel Skylake/Cascade Lake), and PyTorch. CPU support for AVX512 is recommended for optimal performance.
  • Links: arxiv preprint

Highlighted Details

  • Achieves >3 tokens/s inference for unquantized Mixtral-8x7B (>90GB) on a single 24GB GPU.
  • Demonstrates an order of magnitude speedup compared to DeepSpeed-MII and Mixtral offloading.
  • Offers average speedups of 19.4x (vs. DeepSpeed-MII) and 8.2x (vs. Mixtral offloading) on a Quadro RTX 6000 + Intel Skylake setup.
  • Provides average speedups of 22.5x (vs. DeepSpeed-MII) and 10.1x (vs. Mixtral offloading) on an L4 GPU + Intel Cascade Lake setup.

Maintenance & Community

This repository is explicitly labeled as a proof-of-concept and is under heavy construction. The roadmap includes expanding support to other MoE models (e.g., DeepSeek-MoE, OpenMoE, Switch Transformer), quantized models, and AVX512_BF16. No community channels (like Discord or Slack) are listed.

Licensing & Compatibility

The license type is not specified in the provided README.

Limitations & Caveats

The system is currently a research prototype supporting only the 16-bit Mixtral-8x7B model. Performance is notably impacted if the CPU lacks AVX512 support, as the current CPU expert processing relies on PyTorch implementations.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

0.7%
995
LLM inference engine for diverse applications
Created 2 years ago
Updated 16 hours ago
Feedback? Help us improve.