exllamav2 by turboderp-org

Inference library for running LLMs locally on consumer GPUs

Created 2 years ago

4,409 stars

Top 11.0% on SourcePulse

12 Experts Love This Project

tobi

Cofounder of Shopify

shimmyshimmer

Cofounder of Unsloth

JustinLin610

Core Maintainer at Alibaba Qwen

chiphuyen

Author of "AI Engineering", "Designing Machine Learning Systems"

and 8 more!

Project Summary

ExLlamaV2 is a high-performance inference library designed for running large language models (LLMs) locally on consumer-grade GPUs. It targets users who want to deploy LLMs on their own hardware, offering significant speedups and memory efficiency through advanced quantization techniques and optimized kernels.

How It Works

ExLlamaV2 utilizes a novel EXL2 quantization format, supporting bitrates from 2 to 8 bits per weight, with the ability to mix quantization levels within a model. This allows for fine-grained control over the trade-off between model size, VRAM usage, and accuracy. It also incorporates features like paged attention via Flash Attention 2.5.7+, dynamic batching, prompt caching, and K/V cache deduplication for further performance gains.

Quick Start & Requirements

Install: pip install exllamav2 (JIT version) or install from source/wheels.
Prerequisites: CUDA Toolkit, GCC/Visual Studio, compatible PyTorch version.
Resources: Requires modern consumer GPUs. Performance varies by GPU and model size.
Docs: Wiki

Highlighted Details

Supports EXL2 quantization (2-8 bits), outperforming GPTQ.
Achieves high throughput (e.g., 200+ t/s on 4090 for 7B models).
Enables running large models (e.g., 70B) on 24GB VRAM with 2.55 bits/weight.
Offers OpenAI-compatible API via TabbyAPI integration.

Maintenance & Community

Active development with new features like dynamic generator.
Community support via Discord: https://discord.gg/NSFwVuCjRq
Several EXL2-quantized models available on Hugging Face.

Licensing & Compatibility

License: MIT.
Compatible with commercial use and closed-source applications.

Limitations & Caveats

Prebuilt wheels require matching PyTorch and CUDA versions.
Some advanced features might require compiling C++ extensions.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

1

Issues (30d)

0

Star History

31 stars in the last 30 days

Explore Similar Projects

Starred by

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dash-infer by modelscope

LLM inference engine for optimized performance across diverse hardware

Created 1 year ago

Updated 5 months ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

ScaleLLM by vectorch-ai

LLM inference system for production environments

Created 2 years ago

Updated 3 weeks ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera) and

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs).

llama2.rs by srush

Rust library for fast Llama2 inference on CPU

Created 2 years ago

Updated 2 years ago

candle-vllm by EricLBuehler

Platform for local LLM inference and serving with OpenAI API compatibility

Created 2 years ago

Updated 3 days ago

LLM-Viewer by hahnyuan

LLM analysis tool for inference performance on diverse hardware

Created 2 years ago

Updated 1 year ago

InferLLM by MegEngine

Lightweight LLM inference framework

Created 2 years ago

Updated 1 year ago

Starred by

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI) and

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

ZhiLight by zhihu

LLM inference engine for Llama and variants, optimized for PCIe GPUs

Created 1 year ago

Updated 6 months ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory) and

Ying Sheng

Ying Sheng(Coauthor of SGLang).

GPTQModel by ModelCloud

LLM compression toolkit for accelerated CPU/GPU inference

Created 1 year ago

Updated 1 day ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI),

Michael Han

Michael Han(Cofounder of Unsloth), and

4 more.

aphrodite-engine by aphrodite-engine

LLM inference engine for serving HuggingFace models at scale

Created 2 years ago

Updated 4 days ago

Starred by

Andrej Karpathy

Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

4 more.

gemma_pytorch by google

PyTorch implementation for Google's Gemma models

Created 1 year ago

Updated 7 months ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI) and

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

Inference optimization for LLMs on low-resource hardware

Created 2 years ago

Updated 4 months ago

Starred by

Andrej Karpathy

Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n),

Clement Delangue

Clement Delangue(Cofounder of Hugging Face), and

60 more.

vllm by vllm-project

LLM serving engine for high-throughput, memory-efficient inference

Created 2 years ago

Updated 9 hours ago

Feedback? Help us improve.