Llama implementation for memory-efficient quantized weights
Top 16.8% on sourcepulse
ExLlama is a standalone Python/C++/CUDA implementation of Llama for 4-bit GPTQ quantized weights, targeting modern NVIDIA GPUs. It offers a memory-efficient and fast inference solution for users looking to run large language models locally with reduced VRAM requirements.
How It Works
ExLlama leverages a custom C++/CUDA backend for optimized inference, specifically designed to handle 4-bit GPTQ quantized models. This approach bypasses the overhead of standard Hugging Face Transformers, leading to significant improvements in speed and memory usage, particularly for larger models. The CUDA extension is compiled and cached on first run for seamless integration.
Quick Start & Requirements
pip install -r requirements.txt
Highlighted Details
Maintenance & Community
The project is actively developed by turboderp. A web UI is available, and a Python module for integration is maintained by jllllll.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project is described as a "work in progress." Support for older GPUs (pre-Pascal) is limited, and ROCm support is untested. The web UI's JavaScript is noted as experimental.
1 year ago
Inactive