Web UI for local Llama 2 inference
Top 22.9% on sourcepulse
This project provides a Gradio-based web UI for running Llama 2 models locally on various hardware, including CPU and GPU across Linux, Windows, and macOS. It aims to simplify the deployment and interaction with Llama 2 variants, offering an OpenAI-compatible API and serving as a backend for generative AI applications.
How It Works
The project supports multiple backends for inference: transformers
(with bitsandbytes
for 8-bit quantization), AutoGPTQ
(for 4-bit quantization), and llama.cpp
(for GGML/GGUF formats). This flexibility allows users to choose the best trade-off between performance, VRAM usage, and model precision based on their hardware. The llama2-wrapper
library abstracts these backends, providing a unified interface for model loading, inference, and API serving.
Quick Start & Requirements
pip install llama2-wrapper
git clone https://github.com/liltom-eth/llama2-webui.git && cd llama2-webui && pip install -r requirements.txt
bitsandbytes
versions may be needed for older NVIDIA GPUs or Windows.llama-cpp-python
installation is required.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
bitsandbytes
version compatibility can be an issue on older NVIDIA GPUs, potentially requiring downgrades.bitsandbytes
and Mac Metal acceleration.1 year ago
1 week