exllamav3 by turboderp-org

Optimized quantization and inference library for local LLM execution

Created 9 months ago

616 stars

Top 53.5% on SourcePulse

Project Summary

ExLlamaV3 is an early preview of a library for optimized inference of quantized Large Language Models (LLMs) on consumer GPUs. It aims to provide a more modular and extensible framework than its predecessor, ExLlamaV2, to support a wider range of modern LLM architectures and enable efficient tensor-parallel inference. The project targets researchers and power users seeking to run LLMs locally with reduced VRAM and improved performance.

How It Works

ExLlamaV3 introduces a new quantization format, EXL3, based on QTIP. This format is designed for streamlined and efficient conversion, utilizing on-the-fly Hessian computation and a fused Viterbi kernel for single-step quantization. The library employs a Marlin-inspired GEMM kernel for inference, aiming for memory-bound latency. This rewrite from scratch addresses ExLlamaV2's limitations with multi-GPU tensor parallelism and its Llama-centric architecture.

Quick Start & Requirements

Installation: pip install -r requirements.txt followed by pip install .
Prerequisites: PyTorch with CUDA 12.4 or later, CUDA Toolkit. FlashAttention-2 is currently required.
JIT Mode: EXLLAMA_NOCOMPILE=1 pip install . or run scripts directly from the repo.
Conversion: python convert.py -i <input_model> -o <output_dir> -w <working_dir>
Example Chat: python examples/chat.py -m <model_path> -mode <model_type>
Documentation: https://github.com/turboderp-org/exllamav3 (links to benchmarks and format write-up are mentioned but not directly provided in README)

Highlighted Details

EXL3 quantization aims to be significantly faster than other SOTA techniques, with a 70B model conversion taking hours on a single RTX 4090.
Achieves coherent generation at 1.6 bpw for Llama-3.1-70B and enables inference in under 16 GB VRAM with specific configurations.
Designed for easier extension to other frameworks like HF Transformers and vLLM due to retained file structure.
Supports JIT compilation and running scripts directly from the repository without installation.

Maintenance & Community

The project is an early preview with active development. Specific contributors or community channels (like Discord/Slack) are not detailed in the README. Integration with TabbyAPI is planned.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

This is an early preview release; the framework is not fully optimized, with potential performance issues on Ampere GPUs and CPU bottlenecks on slower processors. AMD GPU (ROCm) support is missing. Tensor parallelism and multimodal support are yet to be added. FlashAttention-2 is a hard requirement, with plans to switch to FlashInfer. No release builds are available yet.

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

21 stars in the last 30 days