paroquant by z-lab

Efficient LLM inference via novel quantization

Created 9 months ago

319 stars

Top 84.7% on SourcePulse

Project Summary

Summary

ParoQuant, presented at ICLR 2026, addresses efficient Large Language Model (LLM) inference by introducing state-of-the-art INT4 quantization. Targeting engineers and researchers, it significantly reduces model size and computational overhead while preserving accuracy, enabling faster and more accessible LLM deployment for reasoning tasks.

How It Works

The core innovation lies in learned pairwise rotations, a technique designed to effectively suppress weight outliers within LLM architectures. This approach allows ParoQuant to achieve INT4 quantization accuracy comparable to FP16 models, a significant improvement over traditional methods. The architecture is optimized for high-speed inference, rivaling established techniques like AWQ.

Quick Start & Requirements

Installation is straightforward via pip: pip install "paroquant[vllm]" for NVIDIA GPUs (CUDA 12.9/13.0) or pip install "paroquant[mlx]" for Apple Silicon. Docker images are available for chat and API serving on NVIDIA GPUs. Models are hosted on Hugging Face. Specific CUDA versions (12.9, 13.0) and associated vLLM/PyTorch versions are required for NVIDIA GPU setups.

Highlighted Details

Achieves state-of-the-art INT4 quantization for LLMs.
Employs learned pairwise rotations to mitigate weight outlier impact.
Minimizes the accuracy gap between INT4 and FP16 precision.
Delivers inference speeds competitive with AWQ.
Supports deployment on NVIDIA GPUs (via vLLM integration) and Apple Silicon (via MLX).

Maintenance & Community

The project's main branch is under active development. Reproducibility for the ICLR 2026 paper is guaranteed on a legacy branch. No specific community channels (e.g., Discord, Slack) or roadmap details are provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. This absence prevents a definitive assessment of compatibility for commercial use or integration into closed-source projects.

Limitations & Caveats

The primary limitation is the lack of guaranteed reproducibility on the main development branch; users requiring stable, paper-verified results must utilize the legacy branch. Furthermore, the absence of explicit licensing information poses a significant adoption risk for commercial applications.

paroquant by z-lab

Explore Similar Projects

r1-ktransformers-guide by ubergarm

inferflow by inferflow

apex-quant by localai-org

EfficientQAT by OpenGVLab

Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash by AEON-7

buun-llama-cpp by spiritbuun

LightCompress by ModelTC

InferLLM by MegEngine

deepcompressor by nunchaku-ai

smoothquant by mit-han-lab

llm-awq by mit-han-lab

PowerInfer by Tiiny-AI