apex-quant by localai-org

MoE-aware mixed-precision quantization for LLMs

Created 3 months ago

384 stars

Top 74.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Ettore Di Giacinto

Author of LocalAI

Wing Lian

Founder of Axolotl AI

Project Summary

APEX is a novel MoE-aware mixed-precision quantization technique for llama.cpp, designed to significantly reduce model size and accelerate inference while preserving or enhancing accuracy. It targets engineers and researchers deploying large Mixture-of-Experts (MoE) models, offering substantial benefits such as up to 2x size reduction compared to other methods and improved performance metrics across various benchmarks.

How It Works

APEX moves beyond uniform quantization by classifying tensors based on their role within MoE architectures (routed expert, shared expert, attention/SSM) and assigning precision adaptively. It leverages MoE sparsity, where only a few experts are active per token, and applies a layer-wise precision gradient, giving higher precision to sensitive edge layers and compressing middle layers more aggressively. "I-variants" further enhance real-world accuracy by using a diverse calibration dataset (chat, code, reasoning, tool-calling) instead of Wikipedia, leading to lower KL divergence and better downstream task performance.

Quick Start & Requirements

Clone the repository (https://github.com/mudler/apex-quant.git) and use the provided ./scripts/quantize.sh script with an F16 GGUF model. For example, ./scripts/quantize.sh --i-quality model-f16.gguf model-apex-i-quality.gguf. Running full pipelines from HuggingFace model IDs is also supported. Requires a pre-existing F16 GGUF model and a compatible llama.cpp build. Benchmarks were conducted on NVIDIA hardware with CUDA 13.0. APEX quantized models work out-of-the-box with LocalAI (https://github.com/mudler/LocalAI).

Highlighted Details

Achieves perplexity comparable to or better than Q8_0 at half the size, and outperforms F16.
Significantly outperforms Unsloth Dynamic 2.0 (UD) quantizations in size, perplexity, and speed, offering up to 2x reduction.
Provides five quantization tiers, from "I-Quality" (21.3 GB) to "Mini" (12.2 GB), catering to various deployment scenarios from maximum accuracy to consumer GPU inference.
"I-variants" improve downstream accuracy and reduce KL divergence by using a diverse calibration imatrix.
Includes an upstream fix for llama.cpp enabling accuracy benchmarks on hybrid MoE models.
Optional integration with TurboQuant+ provides ~14% prompt processing speedup at 8K context via KV cache compression.

Maintenance & Community

Developed by the LocalAI team, built upon llama.cpp by Georgi Gerganov and contributors. No specific community channels (Discord, Slack) are detailed in the provided text.

Licensing & Compatibility

Released under the MIT License, permitting commercial use and linking. Compatible with stock llama.cpp without custom builds for quantization. APEX GGUF models are directly usable with LocalAI.

Limitations & Caveats

Primarily targets MoE models compatible with llama.cpp. Accuracy benchmark evaluation on certain hybrid MoE models requires a specific llama.cpp patch. "I-variants" may show slightly higher perplexity on standard benchmarks like wikitext due to their focus on broader real-world task performance.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

36 stars in the last 30 days