FlexLLMGen by FMInference

High-throughput generation engine for LLMs with limited GPU memory

Created 2 years ago

9,381 stars

Top 5.5% on SourcePulse

View on GitHub

25 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Jonathan Ragan-Kelley

and 21 more!

Project Summary

FlexLLMGen is a high-throughput generative inference engine designed for running large language models on a single GPU, even with limited memory. It targets throughput-oriented applications like batch processing, data wrangling, and benchmarking, enabling cost-effective LLM deployment on commodity hardware.

How It Works

FlexLLMGen employs IO-efficient offloading of model weights, activations, and KV cache across GPU, CPU, and disk. It uses a linear programming optimizer to find optimal tensor storage and access patterns. A key innovation is its block scheduling approach, which improves I/O efficiency and overlaps computation with data transfers, outperforming row-by-row schedules for throughput. It also supports 4-bit compression for weights and KV cache with minimal accuracy loss.

Quick Start & Requirements

Install via pip: pip install flexllmgen or from source.
Requires PyTorch >= 1.12.
Examples provided for OPT-1.3B, OPT-30B (CPU offloading), and OPT-175B (disk offloading, requires Alpa format weights).
Official docs and examples: https://github.com/FMInference/FlexLLMGen

Highlighted Details

Achieves significantly higher throughput than Hugging Face Accelerate, DeepSpeed ZeRO-Inference, and Petals on OPT-30B and OPT-175B, especially with compression.
Enables running large models like OPT-175B on a single GPU with SSD offloading.
Integrates with the HELM benchmark framework.
Supports a Hugging Face Transformers-like generation API.

Maintenance & Community

Roadmap includes support for multi-GPU, more models (BLOOM, CodeGen, GLM), and Apple Silicon/AMD GPUs.
A cost model and policy optimizer are planned.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Significantly slower than in-memory inference on powerful GPUs, particularly for small batches.
Primarily optimized for single-GPU, throughput-oriented batch processing.
Manual tuning of offloading strategy (--percent) is currently required.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days