exllamav2  by turboderp-org

Inference library for running LLMs locally on consumer GPUs

created 1 year ago
4,255 stars

Top 11.7% on sourcepulse

GitHubView on GitHub
Project Summary

ExLlamaV2 is a high-performance inference library designed for running large language models (LLMs) locally on consumer-grade GPUs. It targets users who want to deploy LLMs on their own hardware, offering significant speedups and memory efficiency through advanced quantization techniques and optimized kernels.

How It Works

ExLlamaV2 utilizes a novel EXL2 quantization format, supporting bitrates from 2 to 8 bits per weight, with the ability to mix quantization levels within a model. This allows for fine-grained control over the trade-off between model size, VRAM usage, and accuracy. It also incorporates features like paged attention via Flash Attention 2.5.7+, dynamic batching, prompt caching, and K/V cache deduplication for further performance gains.

Quick Start & Requirements

  • Install: pip install exllamav2 (JIT version) or install from source/wheels.
  • Prerequisites: CUDA Toolkit, GCC/Visual Studio, compatible PyTorch version.
  • Resources: Requires modern consumer GPUs. Performance varies by GPU and model size.
  • Docs: Wiki

Highlighted Details

  • Supports EXL2 quantization (2-8 bits), outperforming GPTQ.
  • Achieves high throughput (e.g., 200+ t/s on 4090 for 7B models).
  • Enables running large models (e.g., 70B) on 24GB VRAM with 2.55 bits/weight.
  • Offers OpenAI-compatible API via TabbyAPI integration.

Maintenance & Community

  • Active development with new features like dynamic generator.
  • Community support via Discord: https://discord.gg/NSFwVuCjRq
  • Several EXL2-quantized models available on Hugging Face.

Licensing & Compatibility

  • License: MIT.
  • Compatible with commercial use and closed-source applications.

Limitations & Caveats

  • Prebuilt wheels require matching PyTorch and CUDA versions.
  • Some advanced features might require compiling C++ extensions.
Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
3
Star History
120 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 14 hours ago
Feedback? Help us improve.