FlexLLMGen  by FMInference

High-throughput generation engine for LLMs with limited GPU memory

created 2 years ago
9,354 stars

Top 5.5% on sourcepulse

GitHubView on GitHub
Project Summary

FlexLLMGen is a high-throughput generative inference engine designed for running large language models on a single GPU, even with limited memory. It targets throughput-oriented applications like batch processing, data wrangling, and benchmarking, enabling cost-effective LLM deployment on commodity hardware.

How It Works

FlexLLMGen employs IO-efficient offloading of model weights, activations, and KV cache across GPU, CPU, and disk. It uses a linear programming optimizer to find optimal tensor storage and access patterns. A key innovation is its block scheduling approach, which improves I/O efficiency and overlaps computation with data transfers, outperforming row-by-row schedules for throughput. It also supports 4-bit compression for weights and KV cache with minimal accuracy loss.

Quick Start & Requirements

  • Install via pip: pip install flexllmgen or from source.
  • Requires PyTorch >= 1.12.
  • Examples provided for OPT-1.3B, OPT-30B (CPU offloading), and OPT-175B (disk offloading, requires Alpa format weights).
  • Official docs and examples: https://github.com/FMInference/FlexLLMGen

Highlighted Details

  • Achieves significantly higher throughput than Hugging Face Accelerate, DeepSpeed ZeRO-Inference, and Petals on OPT-30B and OPT-175B, especially with compression.
  • Enables running large models like OPT-175B on a single GPU with SSD offloading.
  • Integrates with the HELM benchmark framework.
  • Supports a Hugging Face Transformers-like generation API.

Maintenance & Community

  • Roadmap includes support for multi-GPU, more models (BLOOM, CodeGen, GLM), and Apple Silicon/AMD GPUs.
  • A cost model and policy optimizer are planned.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • Significantly slower than in-memory inference on powerful GPUs, particularly for small batches.
  • Primarily optimized for single-GPU, throughput-oriented batch processing.
  • Manual tuning of offloading strategy (--percent) is currently required.
Health Check
Last commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
66 stars in the last 90 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Author of SGLang) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

llm-analysis by cli99

0.2%
441
CLI tool for LLM latency/memory analysis during training/inference
created 2 years ago
updated 3 months ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Feedback? Help us improve.