exllamav3  by turboderp-org

Optimized quantization and inference library for local LLM execution

Created 5 months ago
495 stars

Top 62.5% on SourcePulse

GitHubView on GitHub
Project Summary

ExLlamaV3 is an early preview of a library for optimized inference of quantized Large Language Models (LLMs) on consumer GPUs. It aims to provide a more modular and extensible framework than its predecessor, ExLlamaV2, to support a wider range of modern LLM architectures and enable efficient tensor-parallel inference. The project targets researchers and power users seeking to run LLMs locally with reduced VRAM and improved performance.

How It Works

ExLlamaV3 introduces a new quantization format, EXL3, based on QTIP. This format is designed for streamlined and efficient conversion, utilizing on-the-fly Hessian computation and a fused Viterbi kernel for single-step quantization. The library employs a Marlin-inspired GEMM kernel for inference, aiming for memory-bound latency. This rewrite from scratch addresses ExLlamaV2's limitations with multi-GPU tensor parallelism and its Llama-centric architecture.

Quick Start & Requirements

  • Installation: pip install -r requirements.txt followed by pip install .
  • Prerequisites: PyTorch with CUDA 12.4 or later, CUDA Toolkit. FlashAttention-2 is currently required.
  • JIT Mode: EXLLAMA_NOCOMPILE=1 pip install . or run scripts directly from the repo.
  • Conversion: python convert.py -i <input_model> -o <output_dir> -w <working_dir>
  • Example Chat: python examples/chat.py -m <model_path> -mode <model_type>
  • Documentation: https://github.com/turboderp-org/exllamav3 (links to benchmarks and format write-up are mentioned but not directly provided in README)

Highlighted Details

  • EXL3 quantization aims to be significantly faster than other SOTA techniques, with a 70B model conversion taking hours on a single RTX 4090.
  • Achieves coherent generation at 1.6 bpw for Llama-3.1-70B and enables inference in under 16 GB VRAM with specific configurations.
  • Designed for easier extension to other frameworks like HF Transformers and vLLM due to retained file structure.
  • Supports JIT compilation and running scripts directly from the repository without installation.

Maintenance & Community

The project is an early preview with active development. Specific contributors or community channels (like Discord/Slack) are not detailed in the README. Integration with TabbyAPI is planned.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

This is an early preview release; the framework is not fully optimized, with potential performance issues on Ampere GPUs and CPU bottlenecks on slower processors. AMD GPU (ROCm) support is missing. Tensor parallelism and multimodal support are yet to be added. FlashAttention-2 is a hard requirement, with plans to switch to FlashInfer. No release builds are available yet.

Health Check
Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
10
Star History
24 stars in the last 30 days

Explore Similar Projects

Starred by Jeremy Howard Jeremy Howard(Cofounder of fast.ai), Sasha Rush Sasha Rush(Research Scientist at Cursor; Professor at Cornell Tech), and
1 more.

GPTQ-triton by fpgaminer

0%
307
Triton kernel for GPTQ inference, improving context scaling
Created 2 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Jeremy Howard Jeremy Howard(Cofounder of fast.ai).

GPTFast by MDK8888

0%
687
HF Transformers accelerator for faster inference
Created 1 year ago
Updated 1 year ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.