fiddler by efeslab

Fast local inference for large Mixture-of-Experts LLMs

Created 2 years ago

260 stars

Top 97.6% on SourcePulse

Project Summary

Fiddler is a fast inference system designed for running large Mixture-of-Experts (MoE) Large Language Models (LLMs) on local devices with limited GPU memory. It targets researchers and power users seeking to deploy unquantized MoE models efficiently, offering a significant performance uplift over existing offloading techniques by intelligently orchestrating CPU and GPU resources.

How It Works

Fiddler's core innovation lies in its CPU-GPU orchestration strategy for MoE inference. Instead of merely offloading model weights to CPU memory for storage while computation remains GPU-bound, Fiddler shifts the computation of expert layers to the CPU. When an expert is needed but not in GPU memory, Fiddler copies the activation values (which are considerably smaller than weights) from GPU to CPU, performs the expert computation on the CPU, and then transfers the output activations back to the GPU. This approach drastically reduces data transfer overhead, making inference faster despite the CPU's slower computation speeds.

Quick Start & Requirements

Install: pip install -r requirements.txt
Run: python src/fiddler/infer.py --model <path/to/mixtral/model> --input <prompt>
Prerequisites: Requires a GPU (tested with 24GB VRAM), a multi-core CPU (tested with Intel Skylake/Cascade Lake), and PyTorch. CPU support for AVX512 is recommended for optimal performance.
Links: arxiv preprint

Highlighted Details

Achieves >3 tokens/s inference for unquantized Mixtral-8x7B (>90GB) on a single 24GB GPU.
Demonstrates an order of magnitude speedup compared to DeepSpeed-MII and Mixtral offloading.
Offers average speedups of 19.4x (vs. DeepSpeed-MII) and 8.2x (vs. Mixtral offloading) on a Quadro RTX 6000 + Intel Skylake setup.
Provides average speedups of 22.5x (vs. DeepSpeed-MII) and 10.1x (vs. Mixtral offloading) on an L4 GPU + Intel Cascade Lake setup.

Maintenance & Community

This repository is explicitly labeled as a proof-of-concept and is under heavy construction. The roadmap includes expanding support to other MoE models (e.g., DeepSeek-MoE, OpenMoE, Switch Transformer), quantized models, and AVX512_BF16. No community channels (like Discord or Slack) are listed.

Licensing & Compatibility

The license type is not specified in the provided README.

Limitations & Caveats

The system is currently a research prototype supporting only the 16-bit Mixtral-8x7B model. Performance is notably impacted if the CPU lacks AVX512 support, as the current CPU expert processing relies on PyTorch implementations.

fiddler by efeslab

Explore Similar Projects

Ling-V2 by inclusionAI

ntransformer by xaskasdf

MoE-Infinity by EfficientMoE

FastFlowLM by FastFlowLM

dInfer by inclusionAI

glake by antgroup

LLaMA_MPS by jankais3r

omniserve by mit-han-lab

marlin by IST-DASLab

ollm by Mega4alik

PowerInfer by Tiiny-AI

chitu by thu-pacman