exllama by turboderp

Llama implementation for memory-efficient quantized weights

Created 2 years ago

2,905 stars

Top 16.3% on SourcePulse

View on GitHub

11 Experts Love This Project

and 7 more!

Project Summary

ExLlama is a standalone Python/C++/CUDA implementation of Llama for 4-bit GPTQ quantized weights, targeting modern NVIDIA GPUs. It offers a memory-efficient and fast inference solution for users looking to run large language models locally with reduced VRAM requirements.

How It Works

ExLlama leverages a custom C++/CUDA backend for optimized inference, specifically designed to handle 4-bit GPTQ quantized models. This approach bypasses the overhead of standard Hugging Face Transformers, leading to significant improvements in speed and memory usage, particularly for larger models. The CUDA extension is compiled and cached on first run for seamless integration.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: Python 3.9+, PyTorch (tested with 2.0.1, 2.1.0 nightly with cu118), safetensors 0.3.2, sentencepiece, ninja.
CUDA 11.7 or 11.8 is recommended.
Development is on RTX 4090/3090-Ti; 30-series NVIDIA GPUs are well supported. Older GPUs with poor FP16 support are not recommended. ROCm (HIP) is theoretically supported but untested.
Setup involves cloning the repo, installing dependencies, and running benchmarks or examples. CUDA extension compilation may take time on first run.
Docs: https://github.com/turboderp/exllama

Highlighted Details

Significantly lower VRAM usage for 4-bit GPTQ models compared to standard implementations.
High inference speeds, with benchmarks showing thousands of tokens/sec on high-end GPUs.
Supports sharded models and includes a utility for splitting large safetensors files.
Preliminary ExLlamaV2 available, offering further improvements.

Maintenance & Community

The project is actively developed by turboderp. A web UI is available, and a Python module for integration is maintained by jllllll.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is described as a "work in progress." Support for older GPUs (pre-Pascal) is limited, and ROCm support is untested. The web UI's JavaScript is noted as experimental.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days