localGPT-Vision  by PromtEngineer

Vision-language RAG pipeline for document Q&A

created 10 months ago
593 stars

Top 55.7% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an end-to-end Retrieval-Augmented Generation (RAG) system focused on vision-based document interaction. It allows users to chat with PDFs and images by indexing their visual content, retrieving relevant pages using Colqwen/ColPali models, and generating answers with various Vision Language Models (VLMs). This approach is beneficial for users who need to query documents where visual layout and elements are as important as text.

How It Works

The system employs a unique RAG pipeline that bypasses traditional text chunking and embedding. Instead, it uses Colqwen and ColPali models to create embeddings directly from document page images, capturing visual cues like layout and figures. During querying, these visual embeddings are matched to retrieve relevant pages. The retrieved document images, along with the user's query, are then fed into a selected VLM (e.g., Qwen2-VL, Gemini, GPT-4o) for response generation, offering a more holistic understanding of the document's content.

Quick Start & Requirements

  • Install: Clone the repo, create a conda environment (conda create -n localgpt-vision python=3.10), activate it (conda activate localgpt-gpt-vision), install dependencies (pip install -r requirements.txt), and install the dev version of Transformers (pip uninstall transformers && pip install git+https://github.com/huggingface/transformers).
  • Prerequisites: Anaconda/Miniconda, Python 3.10+, Git. For debugging on Ubuntu/Debian, libpoppler-cpp-dev, poppler-utils, cmake, pkgconfig, and python3-poppler-qt5 are required. API keys for Gemini, OpenAI, and Groq may be needed.
  • Run: python app.py
  • Access: http://localhost:5050/
  • Docs: https://github.com/PromtEngineer/localGPT-Vision

Highlighted Details

  • End-to-end vision-based RAG pipeline.
  • Indexes documents using visual embeddings (Colqwen/ColPali), eliminating text extraction and chunking.
  • Supports multiple VLMs: Qwen2-VL-7B-Instruct, LLAMA-3.2-11B-Vision, Pixtral-12B-2409, Molmo-7B-O-0924, Google Gemini, OpenAI GPT-4o, LLAMA-3.2 with Ollama.
  • Session management for persistent chat history and indexes.

Maintenance & Community

  • Open to contributions via pull requests.
  • Star History link provided.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

Response quality is highly dependent on the chosen VLM and the resolution of document images. Debugging setup instructions are specific to Ubuntu/Debian-based systems.

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
56 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.