Optimized quantization and inference library for local LLM execution
Top 67.3% on sourcepulse
ExLlamaV3 is an early preview of a library for optimized inference of quantized Large Language Models (LLMs) on consumer GPUs. It aims to provide a more modular and extensible framework than its predecessor, ExLlamaV2, to support a wider range of modern LLM architectures and enable efficient tensor-parallel inference. The project targets researchers and power users seeking to run LLMs locally with reduced VRAM and improved performance.
How It Works
ExLlamaV3 introduces a new quantization format, EXL3, based on QTIP. This format is designed for streamlined and efficient conversion, utilizing on-the-fly Hessian computation and a fused Viterbi kernel for single-step quantization. The library employs a Marlin-inspired GEMM kernel for inference, aiming for memory-bound latency. This rewrite from scratch addresses ExLlamaV2's limitations with multi-GPU tensor parallelism and its Llama-centric architecture.
Quick Start & Requirements
pip install -r requirements.txt
followed by pip install .
EXLLAMA_NOCOMPILE=1 pip install .
or run scripts directly from the repo.python convert.py -i <input_model> -o <output_dir> -w <working_dir>
python examples/chat.py -m <model_path> -mode <model_type>
Highlighted Details
Maintenance & Community
The project is an early preview with active development. Specific contributors or community channels (like Discord/Slack) are not detailed in the README. Integration with TabbyAPI is planned.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
This is an early preview release; the framework is not fully optimized, with potential performance issues on Ampere GPUs and CPU bottlenecks on slower processors. AMD GPU (ROCm) support is missing. Tensor parallelism and multimodal support are yet to be added. FlashAttention-2 is a hard requirement, with plans to switch to FlashInfer. No release builds are available yet.
21 hours ago
1 day