llm-compressor  by vllm-project

Transformers-compatible library for LLM compression, optimized for vLLM deployment

created 1 year ago
1,713 stars

Top 25.4% on sourcepulse

GitHubView on GitHub
Project Summary

This library provides a streamlined workflow for compressing large language models (LLMs) using various quantization and sparsity techniques, specifically targeting optimized deployment with vLLM. It's designed for researchers and engineers looking to reduce model size and improve inference speed without significant accuracy loss.

How It Works

LLM Compressor integrates seamlessly with Hugging Face models, allowing users to apply post-training quantization (PTQ) methods like GPTQ, AWQ, and SmoothQuant, as well as sparsity algorithms. It supports weight-only and activation quantization (W8A8, W4A16, FP8) and semi-structured/unstructured sparsity. The library uses a recipe-based approach to define the compression scheme, making it flexible and reproducible. Compressed models are saved in safetensors format, directly compatible with vLLM for efficient inference.

Quick Start & Requirements

  • Install: pip install llmcompressor
  • Prerequisites: Python 3.8+, Hugging Face transformers, accelerate, safetensors. GPU with CUDA is recommended for quantization.
  • Examples: End-to-End Examples

Highlighted Details

  • Supports W8A8, FP8, W4A16 activation and weight quantization.
  • Implements GPTQ, AWQ, SmoothQuant, and SparseGPT algorithms.
  • Integrates with Hugging Face models and safetensors format for vLLM compatibility.
  • Offers a recipe-based API for defining custom compression schemes.

Maintenance & Community

The project is actively maintained by the vLLM team. Contributions are welcomed via GitHub issues and pull requests.

Licensing & Compatibility

The project is released under the Apache 2.0 license, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

While comprehensive, the optimal choice of compression scheme depends on the specific model and task, requiring experimentation as detailed in the documentation.

Health Check
Last commit

20 hours ago

Responsiveness

1 day

Pull Requests (30d)
66
Issues (30d)
41
Star History
433 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

ctransformers by marella

0.1%
2k
Python bindings for fast Transformer model inference
created 2 years ago
updated 1 year ago
Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

AQLM by Vahe1994

0.1%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
created 1 year ago
updated 2 months ago
Feedback? Help us improve.