exllamav3  by turboderp-org

Optimized quantization and inference library for local LLM execution

created 3 months ago
456 stars

Top 67.3% on sourcepulse

GitHubView on GitHub
Project Summary

ExLlamaV3 is an early preview of a library for optimized inference of quantized Large Language Models (LLMs) on consumer GPUs. It aims to provide a more modular and extensible framework than its predecessor, ExLlamaV2, to support a wider range of modern LLM architectures and enable efficient tensor-parallel inference. The project targets researchers and power users seeking to run LLMs locally with reduced VRAM and improved performance.

How It Works

ExLlamaV3 introduces a new quantization format, EXL3, based on QTIP. This format is designed for streamlined and efficient conversion, utilizing on-the-fly Hessian computation and a fused Viterbi kernel for single-step quantization. The library employs a Marlin-inspired GEMM kernel for inference, aiming for memory-bound latency. This rewrite from scratch addresses ExLlamaV2's limitations with multi-GPU tensor parallelism and its Llama-centric architecture.

Quick Start & Requirements

  • Installation: pip install -r requirements.txt followed by pip install .
  • Prerequisites: PyTorch with CUDA 12.4 or later, CUDA Toolkit. FlashAttention-2 is currently required.
  • JIT Mode: EXLLAMA_NOCOMPILE=1 pip install . or run scripts directly from the repo.
  • Conversion: python convert.py -i <input_model> -o <output_dir> -w <working_dir>
  • Example Chat: python examples/chat.py -m <model_path> -mode <model_type>
  • Documentation: https://github.com/turboderp-org/exllamav3 (links to benchmarks and format write-up are mentioned but not directly provided in README)

Highlighted Details

  • EXL3 quantization aims to be significantly faster than other SOTA techniques, with a 70B model conversion taking hours on a single RTX 4090.
  • Achieves coherent generation at 1.6 bpw for Llama-3.1-70B and enables inference in under 16 GB VRAM with specific configurations.
  • Designed for easier extension to other frameworks like HF Transformers and vLLM due to retained file structure.
  • Supports JIT compilation and running scripts directly from the repository without installation.

Maintenance & Community

The project is an early preview with active development. Specific contributors or community channels (like Discord/Slack) are not detailed in the README. Integration with TabbyAPI is planned.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

This is an early preview release; the framework is not fully optimized, with potential performance issues on Ampere GPUs and CPU bottlenecks on slower processors. AMD GPU (ROCm) support is missing. Tensor parallelism and multimodal support are yet to be added. FlashAttention-2 is a hard requirement, with plans to switch to FlashInfer. No release builds are available yet.

Health Check
Last commit

21 hours ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
10
Star History
124 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 14 hours ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

ktransformers by kvcache-ai

0.4%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 2 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
16 more.

flash-attention by Dao-AILab

0.7%
19k
Fast, memory-efficient attention implementation
created 3 years ago
updated 18 hours ago
Feedback? Help us improve.