Inference library for running LLMs locally on consumer GPUs
Top 11.7% on sourcepulse
ExLlamaV2 is a high-performance inference library designed for running large language models (LLMs) locally on consumer-grade GPUs. It targets users who want to deploy LLMs on their own hardware, offering significant speedups and memory efficiency through advanced quantization techniques and optimized kernels.
How It Works
ExLlamaV2 utilizes a novel EXL2 quantization format, supporting bitrates from 2 to 8 bits per weight, with the ability to mix quantization levels within a model. This allows for fine-grained control over the trade-off between model size, VRAM usage, and accuracy. It also incorporates features like paged attention via Flash Attention 2.5.7+, dynamic batching, prompt caching, and K/V cache deduplication for further performance gains.
Quick Start & Requirements
pip install exllamav2
(JIT version) or install from source/wheels.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
2 weeks ago
1 day