High-throughput generation engine for LLMs with limited GPU memory
Top 5.5% on sourcepulse
FlexLLMGen is a high-throughput generative inference engine designed for running large language models on a single GPU, even with limited memory. It targets throughput-oriented applications like batch processing, data wrangling, and benchmarking, enabling cost-effective LLM deployment on commodity hardware.
How It Works
FlexLLMGen employs IO-efficient offloading of model weights, activations, and KV cache across GPU, CPU, and disk. It uses a linear programming optimizer to find optimal tensor storage and access patterns. A key innovation is its block scheduling approach, which improves I/O efficiency and overlaps computation with data transfers, outperforming row-by-row schedules for throughput. It also supports 4-bit compression for weights and KV cache with minimal accuracy loss.
Quick Start & Requirements
pip install flexllmgen
or from source.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
--percent
) is currently required.9 months ago
Inactive