exllama  by turboderp

Llama implementation for memory-efficient quantized weights

created 2 years ago
2,891 stars

Top 16.8% on sourcepulse

GitHubView on GitHub
Project Summary

ExLlama is a standalone Python/C++/CUDA implementation of Llama for 4-bit GPTQ quantized weights, targeting modern NVIDIA GPUs. It offers a memory-efficient and fast inference solution for users looking to run large language models locally with reduced VRAM requirements.

How It Works

ExLlama leverages a custom C++/CUDA backend for optimized inference, specifically designed to handle 4-bit GPTQ quantized models. This approach bypasses the overhead of standard Hugging Face Transformers, leading to significant improvements in speed and memory usage, particularly for larger models. The CUDA extension is compiled and cached on first run for seamless integration.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.9+, PyTorch (tested with 2.0.1, 2.1.0 nightly with cu118), safetensors 0.3.2, sentencepiece, ninja.
  • CUDA 11.7 or 11.8 is recommended.
  • Development is on RTX 4090/3090-Ti; 30-series NVIDIA GPUs are well supported. Older GPUs with poor FP16 support are not recommended. ROCm (HIP) is theoretically supported but untested.
  • Setup involves cloning the repo, installing dependencies, and running benchmarks or examples. CUDA extension compilation may take time on first run.
  • Docs: https://github.com/turboderp/exllama

Highlighted Details

  • Significantly lower VRAM usage for 4-bit GPTQ models compared to standard implementations.
  • High inference speeds, with benchmarks showing thousands of tokens/sec on high-end GPUs.
  • Supports sharded models and includes a utility for splitting large safetensors files.
  • Preliminary ExLlamaV2 available, offering further improvements.

Maintenance & Community

The project is actively developed by turboderp. A web UI is available, and a Python module for integration is maintained by jllllll.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is described as a "work in progress." Support for older GPUs (pre-Pascal) is limited, and ROCm support is untested. The web UI's JavaScript is noted as experimental.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
30 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 13 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Feedback? Help us improve.