compressed-tensors by vllm-project

Efficient safetensors extension for compressed LLM tensor storage

Created 2 years ago

299 stars

Top 88.8% on SourcePulse

Project Summary

compressed-tensors tackles the fragmentation in LLM model compression by extending the safetensors format into a unified, extensible solution. It enables efficient storage and management of diverse quantized and sparse tensor data, supporting popular techniques like GPTQ, AWQ, SmoothQuant, INT8, FP8, and various sparsity patterns. This library benefits developers and researchers by simplifying the integration of multiple compression methods, streamlining deployment pipelines, and reducing the overhead associated with managing disparate storage formats.

How It Works

The core innovation lies in extending safetensors to create a single, consistent format capable of representing a wide array of compression schemes. It supports granular quantization options, including weight-only (e.g., W4A16), activation (e.g., W8A8), KV cache, and non-uniform quantization across different layers. Additionally, it handles both unstructured and semi-structured sparsity patterns. This unified approach simplifies experimentation and deployment by abstracting away the complexities of individual compression techniques.

Quick Start & Requirements

Installation:
- Stable release: pip install compressed-tensors
- Nightly release: pip install --pre compressed-tensors
- From source: git clone https://github.com/vllm-project/compressed-tensors && cd compressed-tensors && pip install -e .
Prerequisites: PyTorch, Hugging Face Transformers. CUDA is required for the provided Post-Training Quantization (PTQ) examples.
Resources: Links to example directories and notebooks for in-depth tutorials are available within the repository.

Highlighted Details

Unified checkpoint format supporting diverse compression schemes (GPTQ, AWQ, SmoothQuant, INT8, FP8, etc.).
Flexible quantization support: weight-only, activation, KV cache, and non-uniform schemes.
Comprehensive sparsity handling: unstructured and semi-structured (e.g., 2:4) patterns.
Designed for seamless integration with Hugging Face models and PyTorch ecosystems.

Maintenance & Community

The provided README does not contain specific details regarding maintainers, community channels (e.g., Discord, Slack), sponsorships, or a public roadmap.

Licensing & Compatibility

The README does not explicitly state the project's license type or provide compatibility notes for commercial use or integration with closed-source projects.

Limitations & Caveats

The README focuses on the library's capabilities and does not explicitly detail limitations, alpha status, or known bugs. The advanced quantization examples necessitate a CUDA-enabled environment.

compressed-tensors by vllm-project

Explore Similar Projects

Sparsebit by megvii-research

neural-speed by intel

SpQR by Vahe1994

SqueezeLLM by SqueezeAILab

VPTQ by microsoft

buun-llama-cpp by spiritbuun

LightCompress by ModelTC

optimum-quanto by huggingface

deepcompressor by nunchaku-ai

GPTQModel by ModelCloud

ik_llama.cpp by ikawrakow

llm-compressor by vllm-project