Transformers-compatible library for LLM compression, optimized for vLLM deployment
Top 25.4% on sourcepulse
This library provides a streamlined workflow for compressing large language models (LLMs) using various quantization and sparsity techniques, specifically targeting optimized deployment with vLLM. It's designed for researchers and engineers looking to reduce model size and improve inference speed without significant accuracy loss.
How It Works
LLM Compressor integrates seamlessly with Hugging Face models, allowing users to apply post-training quantization (PTQ) methods like GPTQ, AWQ, and SmoothQuant, as well as sparsity algorithms. It supports weight-only and activation quantization (W8A8, W4A16, FP8) and semi-structured/unstructured sparsity. The library uses a recipe-based approach to define the compression scheme, making it flexible and reproducible. Compressed models are saved in safetensors format, directly compatible with vLLM for efficient inference.
Quick Start & Requirements
pip install llmcompressor
transformers
, accelerate
, safetensors
. GPU with CUDA is recommended for quantization.Highlighted Details
Maintenance & Community
The project is actively maintained by the vLLM team. Contributions are welcomed via GitHub issues and pull requests.
Licensing & Compatibility
The project is released under the Apache 2.0 license, permitting commercial use and integration with closed-source applications.
Limitations & Caveats
While comprehensive, the optimal choice of compression scheme depends on the specific model and task, requiring experimentation as detailed in the documentation.
20 hours ago
1 day