Web app for describing scenes using a local LLM
Top 63.8% on sourcepulse
LLaVaVision is a web application that provides a "Be My Eyes" experience, enabling users to get visual descriptions of their surroundings. It targets users who need assistance with visual interpretation and leverages a local LLM backend for privacy and offline capabilities.
How It Works
The application utilizes the llama.cpp
project to run the BakLLaVA-1 multimodal model. It captures video input, processes it through the LLM to generate textual descriptions, and then uses the Web Speech API to narrate these descriptions to the user. This approach allows for local, private processing of visual data without relying on external cloud services.
Quick Start & Requirements
pip install -r requirements.txt
.llama.cpp
to be built with CUDA support (-DLLAMA_CUBLAS=ON
) and the llama.cpp
server to be running. Download mmproj-model-f16.gguf
and a quantized model (e.g., ggml-model-q4_k.gguf
).llama.cpp
and downloading models can take time. The web app requires dummy certificates for HTTPS.Highlighted Details
llama.cpp
for efficient local LLM inference.Maintenance & Community
The project is a personal creation by @lxe, inspired by other open-source multimodal projects. No specific community channels or roadmap are detailed in the README.
Licensing & Compatibility
The repository itself is not explicitly licensed in the README. However, it depends on llama.cpp
(MIT License) and uses models from Hugging Face, whose licenses should be checked. Compatibility for commercial use depends on the underlying model and llama.cpp
licenses.
Limitations & Caveats
The application requires a machine with approximately 5GB of RAM/VRAM for the q4_k
model version. HTTPS is mandatory for mobile video functionality, necessitating certificate generation. The project is described as being built in about an hour, suggesting it may be a proof-of-concept rather than a production-ready application.
1 year ago
1 week