llm-compressor by vllm-project

Transformers-compatible library for LLM compression, optimized for vLLM deployment

Created 1 year ago

2,553 stars

Top 18.1% on SourcePulse

View on GitHub

8 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Dan Guido

Cofounder of Trail of Bits

Jeff Hammerbacher

Cofounder of Cloudera

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

and 4 more!

Project Summary

This library provides a streamlined workflow for compressing large language models (LLMs) using various quantization and sparsity techniques, specifically targeting optimized deployment with vLLM. It's designed for researchers and engineers looking to reduce model size and improve inference speed without significant accuracy loss.

How It Works

LLM Compressor integrates seamlessly with Hugging Face models, allowing users to apply post-training quantization (PTQ) methods like GPTQ, AWQ, and SmoothQuant, as well as sparsity algorithms. It supports weight-only and activation quantization (W8A8, W4A16, FP8) and semi-structured/unstructured sparsity. The library uses a recipe-based approach to define the compression scheme, making it flexible and reproducible. Compressed models are saved in safetensors format, directly compatible with vLLM for efficient inference.

Quick Start & Requirements

Install: pip install llmcompressor
Prerequisites: Python 3.8+, Hugging Face transformers, accelerate, safetensors. GPU with CUDA is recommended for quantization.
Examples: End-to-End Examples

Highlighted Details

Supports W8A8, FP8, W4A16 activation and weight quantization.
Implements GPTQ, AWQ, SmoothQuant, and SparseGPT algorithms.
Integrates with Hugging Face models and safetensors format for vLLM compatibility.
Offers a recipe-based API for defining custom compression schemes.

Maintenance & Community

The project is actively maintained by the vLLM team. Contributions are welcomed via GitHub issues and pull requests.

Licensing & Compatibility

The project is released under the Apache 2.0 license, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

While comprehensive, the optimal choice of compression scheme depends on the specific model and task, requiring experimentation as detailed in the documentation.

Health Check

Last Commit

14 hours ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

172 stars in the last 30 days