FlexLLMGen  by FMInference

High-throughput generation engine for LLMs with limited GPU memory

Created 2 years ago
9,381 stars

Top 5.5% on SourcePulse

GitHubView on GitHub
Project Summary

FlexLLMGen is a high-throughput generative inference engine designed for running large language models on a single GPU, even with limited memory. It targets throughput-oriented applications like batch processing, data wrangling, and benchmarking, enabling cost-effective LLM deployment on commodity hardware.

How It Works

FlexLLMGen employs IO-efficient offloading of model weights, activations, and KV cache across GPU, CPU, and disk. It uses a linear programming optimizer to find optimal tensor storage and access patterns. A key innovation is its block scheduling approach, which improves I/O efficiency and overlaps computation with data transfers, outperforming row-by-row schedules for throughput. It also supports 4-bit compression for weights and KV cache with minimal accuracy loss.

Quick Start & Requirements

  • Install via pip: pip install flexllmgen or from source.
  • Requires PyTorch >= 1.12.
  • Examples provided for OPT-1.3B, OPT-30B (CPU offloading), and OPT-175B (disk offloading, requires Alpa format weights).
  • Official docs and examples: https://github.com/FMInference/FlexLLMGen

Highlighted Details

  • Achieves significantly higher throughput than Hugging Face Accelerate, DeepSpeed ZeRO-Inference, and Petals on OPT-30B and OPT-175B, especially with compression.
  • Enables running large models like OPT-175B on a single GPU with SSD offloading.
  • Integrates with the HELM benchmark framework.
  • Supports a Hugging Face Transformers-like generation API.

Maintenance & Community

  • Roadmap includes support for multi-GPU, more models (BLOOM, CodeGen, GLM), and Apple Silicon/AMD GPUs.
  • A cost model and policy optimizer are planned.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • Significantly slower than in-memory inference on powerful GPUs, particularly for small batches.
  • Primarily optimized for single-GPU, throughput-oriented batch processing.
  • Manual tuning of offloading strategy (--percent) is currently required.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
4 more.

S-LoRA by S-LoRA

0.1%
2k
System for scalable LoRA adapter serving
Created 2 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.1%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
9 more.

FlashMLA by deepseek-ai

0.1%
12k
Efficient CUDA kernels for MLA decoding
Created 10 months ago
Updated 3 weeks ago
Feedback? Help us improve.