llavavision by lxe

Web app for describing scenes using a local LLM

Created 2 years ago

492 stars

Top 62.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Jason Huggins

Creator of Selenium

Georgi Gerganov

Author of llama.cpp, whisper.cpp

Project Summary

LLaVaVision is a web application that provides a "Be My Eyes" experience, enabling users to get visual descriptions of their surroundings. It targets users who need assistance with visual interpretation and leverages a local LLM backend for privacy and offline capabilities.

How It Works

The application utilizes the llama.cpp project to run the BakLLaVA-1 multimodal model. It captures video input, processes it through the LLM to generate textual descriptions, and then uses the Web Speech API to narrate these descriptions to the user. This approach allows for local, private processing of visual data without relying on external cloud services.

Quick Start & Requirements

Installation: Clone the repository, set up a Python virtual environment, and install dependencies via pip install -r requirements.txt.
Prerequisites: Requires llama.cpp to be built with CUDA support (-DLLAMA_CUBLAS=ON) and the llama.cpp server to be running. Download mmproj-model-f16.gguf and a quantized model (e.g., ggml-model-q4_k.gguf).
Setup: Building llama.cpp and downloading models can take time. The web app requires dummy certificates for HTTPS.
Links: llama.cpp build instructions, ggml_bakllava-1 models.

Highlighted Details

Leverages llama.cpp for efficient local LLM inference.
Uses BakLLaVA-1 model for multimodal understanding.
Integrates Web Speech API for text-to-speech narration.
Supports GPU acceleration via CUDA.

Maintenance & Community

The project is a personal creation by @lxe, inspired by other open-source multimodal projects. No specific community channels or roadmap are detailed in the README.

Licensing & Compatibility

The repository itself is not explicitly licensed in the README. However, it depends on llama.cpp (MIT License) and uses models from Hugging Face, whose licenses should be checked. Compatibility for commercial use depends on the underlying model and llama.cpp licenses.

Limitations & Caveats

The application requires a machine with approximately 5GB of RAM/VRAM for the q4_k model version. HTTPS is mandatory for mobile video functionality, necessitating certificate generation. The project is described as being built in about an hour, suggesting it may be a proof-of-concept rather than a production-ready application.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days