compressed-tensors  by vllm-project

Efficient safetensors extension for compressed LLM tensor storage

Created 1 year ago
255 stars

Top 98.7% on SourcePulse

GitHubView on GitHub
Project Summary

compressed-tensors tackles the fragmentation in LLM model compression by extending the safetensors format into a unified, extensible solution. It enables efficient storage and management of diverse quantized and sparse tensor data, supporting popular techniques like GPTQ, AWQ, SmoothQuant, INT8, FP8, and various sparsity patterns. This library benefits developers and researchers by simplifying the integration of multiple compression methods, streamlining deployment pipelines, and reducing the overhead associated with managing disparate storage formats.

How It Works

The core innovation lies in extending safetensors to create a single, consistent format capable of representing a wide array of compression schemes. It supports granular quantization options, including weight-only (e.g., W4A16), activation (e.g., W8A8), KV cache, and non-uniform quantization across different layers. Additionally, it handles both unstructured and semi-structured sparsity patterns. This unified approach simplifies experimentation and deployment by abstracting away the complexities of individual compression techniques.

Quick Start & Requirements

  • Installation:
    • Stable release: pip install compressed-tensors
    • Nightly release: pip install --pre compressed-tensors
    • From source: git clone https://github.com/vllm-project/compressed-tensors && cd compressed-tensors && pip install -e .
  • Prerequisites: PyTorch, Hugging Face Transformers. CUDA is required for the provided Post-Training Quantization (PTQ) examples.
  • Resources: Links to example directories and notebooks for in-depth tutorials are available within the repository.

Highlighted Details

  • Unified checkpoint format supporting diverse compression schemes (GPTQ, AWQ, SmoothQuant, INT8, FP8, etc.).
  • Flexible quantization support: weight-only, activation, KV cache, and non-uniform schemes.
  • Comprehensive sparsity handling: unstructured and semi-structured (e.g., 2:4) patterns.
  • Designed for seamless integration with Hugging Face models and PyTorch ecosystems.

Maintenance & Community

The provided README does not contain specific details regarding maintainers, community channels (e.g., Discord, Slack), sponsorships, or a public roadmap.

Licensing & Compatibility

The README does not explicitly state the project's license type or provide compatibility notes for commercial use or integration with closed-source projects.

Limitations & Caveats

The README focuses on the library's capabilities and does not explicitly detail limitations, alpha status, or known bugs. The advanced quantization examples necessitate a CUDA-enabled environment.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
50
Issues (30d)
3
Star History
18 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

0.3%
1k
LLM inference engine for diverse applications
Created 2 years ago
Updated 20 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Dan Guido Dan Guido(Cofounder of Trail of Bits), and
6 more.

llm-compressor by vllm-project

0.7%
3k
Transformers-compatible library for LLM compression, optimized for vLLM deployment
Created 1 year ago
Updated 22 hours ago
Feedback? Help us improve.