paroquant  by z-lab

Efficient LLM inference via novel quantization

Created 7 months ago
294 stars

Top 89.7% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

ParoQuant, presented at ICLR 2026, addresses efficient Large Language Model (LLM) inference by introducing state-of-the-art INT4 quantization. Targeting engineers and researchers, it significantly reduces model size and computational overhead while preserving accuracy, enabling faster and more accessible LLM deployment for reasoning tasks.

How It Works

The core innovation lies in learned pairwise rotations, a technique designed to effectively suppress weight outliers within LLM architectures. This approach allows ParoQuant to achieve INT4 quantization accuracy comparable to FP16 models, a significant improvement over traditional methods. The architecture is optimized for high-speed inference, rivaling established techniques like AWQ.

Quick Start & Requirements

Installation is straightforward via pip: pip install "paroquant[vllm]" for NVIDIA GPUs (CUDA 12.9/13.0) or pip install "paroquant[mlx]" for Apple Silicon. Docker images are available for chat and API serving on NVIDIA GPUs. Models are hosted on Hugging Face. Specific CUDA versions (12.9, 13.0) and associated vLLM/PyTorch versions are required for NVIDIA GPU setups.

Highlighted Details

  • Achieves state-of-the-art INT4 quantization for LLMs.
  • Employs learned pairwise rotations to mitigate weight outlier impact.
  • Minimizes the accuracy gap between INT4 and FP16 precision.
  • Delivers inference speeds competitive with AWQ.
  • Supports deployment on NVIDIA GPUs (via vLLM integration) and Apple Silicon (via MLX).

Maintenance & Community

The project's main branch is under active development. Reproducibility for the ICLR 2026 paper is guaranteed on a legacy branch. No specific community channels (e.g., Discord, Slack) or roadmap details are provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. This absence prevents a definitive assessment of compatibility for commercial use or integration into closed-source projects.

Limitations & Caveats

The primary limitation is the lack of guaranteed reproducibility on the main development branch; users requiring stable, paper-verified results must utilize the legacy branch. Furthermore, the absence of explicit licensing information poses a significant adoption risk for commercial applications.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
9
Star History
65 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Zack Li Zack Li(Cofounder of Nexa AI), and
4 more.

smoothquant by mit-han-lab

0.1%
2k
Post-training quantization research paper for large language models
Created 3 years ago
Updated 1 year ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
4k
Weight quantization research paper for LLM compression/acceleration
Created 3 years ago
Updated 10 months ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

1.1%
18k
Inference optimization for LLMs on low-resource hardware
Created 3 years ago
Updated 2 months ago
Feedback? Help us improve.