exllamav2  by turboderp-org

Inference library for running LLMs locally on consumer GPUs

Created 2 years ago
4,315 stars

Top 11.4% on SourcePulse

GitHubView on GitHub
Project Summary

ExLlamaV2 is a high-performance inference library designed for running large language models (LLMs) locally on consumer-grade GPUs. It targets users who want to deploy LLMs on their own hardware, offering significant speedups and memory efficiency through advanced quantization techniques and optimized kernels.

How It Works

ExLlamaV2 utilizes a novel EXL2 quantization format, supporting bitrates from 2 to 8 bits per weight, with the ability to mix quantization levels within a model. This allows for fine-grained control over the trade-off between model size, VRAM usage, and accuracy. It also incorporates features like paged attention via Flash Attention 2.5.7+, dynamic batching, prompt caching, and K/V cache deduplication for further performance gains.

Quick Start & Requirements

  • Install: pip install exllamav2 (JIT version) or install from source/wheels.
  • Prerequisites: CUDA Toolkit, GCC/Visual Studio, compatible PyTorch version.
  • Resources: Requires modern consumer GPUs. Performance varies by GPU and model size.
  • Docs: Wiki

Highlighted Details

  • Supports EXL2 quantization (2-8 bits), outperforming GPTQ.
  • Achieves high throughput (e.g., 200+ t/s on 4090 for 7B models).
  • Enables running large models (e.g., 70B) on 24GB VRAM with 2.55 bits/weight.
  • Offers OpenAI-compatible API via TabbyAPI integration.

Maintenance & Community

  • Active development with new features like dynamic generator.
  • Community support via Discord: https://discord.gg/NSFwVuCjRq
  • Several EXL2-quantized models available on Hugging Face.

Licensing & Compatibility

  • License: MIT.
  • Compatible with commercial use and closed-source applications.

Limitations & Caveats

  • Prebuilt wheels require matching PyTorch and CUDA versions.
  • Some advanced features might require compiling C++ extensions.
Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
0
Star History
45 stars in the last 30 days

Explore Similar Projects

Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
4 more.

gemma_pytorch by google

0.2%
6k
PyTorch implementation for Google's Gemma models
Created 1 year ago
Updated 3 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
58 more.

vllm by vllm-project

1.1%
58k
LLM serving engine for high-throughput, memory-efficient inference
Created 2 years ago
Updated 12 hours ago
Feedback? Help us improve.