apex-quant  by mudler

MoE-aware mixed-precision quantization for LLMs

Created 1 month ago
314 stars

Top 85.7% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

APEX is a novel MoE-aware mixed-precision quantization technique for llama.cpp, designed to significantly reduce model size and accelerate inference while preserving or enhancing accuracy. It targets engineers and researchers deploying large Mixture-of-Experts (MoE) models, offering substantial benefits such as up to 2x size reduction compared to other methods and improved performance metrics across various benchmarks.

How It Works

APEX moves beyond uniform quantization by classifying tensors based on their role within MoE architectures (routed expert, shared expert, attention/SSM) and assigning precision adaptively. It leverages MoE sparsity, where only a few experts are active per token, and applies a layer-wise precision gradient, giving higher precision to sensitive edge layers and compressing middle layers more aggressively. "I-variants" further enhance real-world accuracy by using a diverse calibration dataset (chat, code, reasoning, tool-calling) instead of Wikipedia, leading to lower KL divergence and better downstream task performance.

Quick Start & Requirements

Clone the repository (https://github.com/mudler/apex-quant.git) and use the provided ./scripts/quantize.sh script with an F16 GGUF model. For example, ./scripts/quantize.sh --i-quality model-f16.gguf model-apex-i-quality.gguf. Running full pipelines from HuggingFace model IDs is also supported. Requires a pre-existing F16 GGUF model and a compatible llama.cpp build. Benchmarks were conducted on NVIDIA hardware with CUDA 13.0. APEX quantized models work out-of-the-box with LocalAI (https://github.com/mudler/LocalAI).

Highlighted Details

  • Achieves perplexity comparable to or better than Q8_0 at half the size, and outperforms F16.
  • Significantly outperforms Unsloth Dynamic 2.0 (UD) quantizations in size, perplexity, and speed, offering up to 2x reduction.
  • Provides five quantization tiers, from "I-Quality" (21.3 GB) to "Mini" (12.2 GB), catering to various deployment scenarios from maximum accuracy to consumer GPU inference.
  • "I-variants" improve downstream accuracy and reduce KL divergence by using a diverse calibration imatrix.
  • Includes an upstream fix for llama.cpp enabling accuracy benchmarks on hybrid MoE models.
  • Optional integration with TurboQuant+ provides ~14% prompt processing speedup at 8K context via KV cache compression.

Maintenance & Community

Developed by the LocalAI team, built upon llama.cpp by Georgi Gerganov and contributors. No specific community channels (Discord, Slack) are detailed in the provided text.

Licensing & Compatibility

Released under the MIT License, permitting commercial use and linking. Compatible with stock llama.cpp without custom builds for quantization. APEX GGUF models are directly usable with LocalAI.

Limitations & Caveats

Primarily targets MoE models compatible with llama.cpp. Accuracy benchmark evaluation on certain hybrid MoE models requires a specific llama.cpp patch. "I-variants" may show slightly higher perplexity on standard benchmarks like wikitext due to their focus on broader real-world task performance.

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
6
Star History
90 stars in the last 30 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
4k
Weight quantization research paper for LLM compression/acceleration
Created 3 years ago
Updated 10 months ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

1.1%
18k
Inference optimization for LLMs on low-resource hardware
Created 3 years ago
Updated 2 months ago
Feedback? Help us improve.