Discover and explore top open-source AI tools and projects—updated daily.
mudlerMoE-aware mixed-precision quantization for LLMs
Top 85.7% on SourcePulse
APEX is a novel MoE-aware mixed-precision quantization technique for llama.cpp, designed to significantly reduce model size and accelerate inference while preserving or enhancing accuracy. It targets engineers and researchers deploying large Mixture-of-Experts (MoE) models, offering substantial benefits such as up to 2x size reduction compared to other methods and improved performance metrics across various benchmarks.
How It Works
APEX moves beyond uniform quantization by classifying tensors based on their role within MoE architectures (routed expert, shared expert, attention/SSM) and assigning precision adaptively. It leverages MoE sparsity, where only a few experts are active per token, and applies a layer-wise precision gradient, giving higher precision to sensitive edge layers and compressing middle layers more aggressively. "I-variants" further enhance real-world accuracy by using a diverse calibration dataset (chat, code, reasoning, tool-calling) instead of Wikipedia, leading to lower KL divergence and better downstream task performance.
Quick Start & Requirements
Clone the repository (https://github.com/mudler/apex-quant.git) and use the provided ./scripts/quantize.sh script with an F16 GGUF model. For example, ./scripts/quantize.sh --i-quality model-f16.gguf model-apex-i-quality.gguf. Running full pipelines from HuggingFace model IDs is also supported. Requires a pre-existing F16 GGUF model and a compatible llama.cpp build. Benchmarks were conducted on NVIDIA hardware with CUDA 13.0. APEX quantized models work out-of-the-box with LocalAI (https://github.com/mudler/LocalAI).
Highlighted Details
llama.cpp enabling accuracy benchmarks on hybrid MoE models.Maintenance & Community
Developed by the LocalAI team, built upon llama.cpp by Georgi Gerganov and contributors. No specific community channels (Discord, Slack) are detailed in the provided text.
Licensing & Compatibility
Released under the MIT License, permitting commercial use and linking. Compatible with stock llama.cpp without custom builds for quantization. APEX GGUF models are directly usable with LocalAI.
Limitations & Caveats
Primarily targets MoE models compatible with llama.cpp. Accuracy benchmark evaluation on certain hybrid MoE models requires a specific llama.cpp patch. "I-variants" may show slightly higher perplexity on standard benchmarks like wikitext due to their focus on broader real-world task performance.
5 days ago
Inactive
dropbox
mit-han-lab
Tiiny-AI
lyogavin