Vision-language RAG pipeline for document Q&A
Top 55.7% on sourcepulse
This project provides an end-to-end Retrieval-Augmented Generation (RAG) system focused on vision-based document interaction. It allows users to chat with PDFs and images by indexing their visual content, retrieving relevant pages using Colqwen/ColPali models, and generating answers with various Vision Language Models (VLMs). This approach is beneficial for users who need to query documents where visual layout and elements are as important as text.
How It Works
The system employs a unique RAG pipeline that bypasses traditional text chunking and embedding. Instead, it uses Colqwen and ColPali models to create embeddings directly from document page images, capturing visual cues like layout and figures. During querying, these visual embeddings are matched to retrieve relevant pages. The retrieved document images, along with the user's query, are then fed into a selected VLM (e.g., Qwen2-VL, Gemini, GPT-4o) for response generation, offering a more holistic understanding of the document's content.
Quick Start & Requirements
conda create -n localgpt-vision python=3.10
), activate it (conda activate localgpt-gpt-vision
), install dependencies (pip install -r requirements.txt
), and install the dev version of Transformers (pip uninstall transformers && pip install git+https://github.com/huggingface/transformers
).libpoppler-cpp-dev
, poppler-utils
, cmake
, pkgconfig
, and python3-poppler-qt5
are required. API keys for Gemini, OpenAI, and Groq may be needed.python app.py
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Response quality is highly dependent on the chosen VLM and the resolution of document images. Debugging setup instructions are specific to Ubuntu/Debian-based systems.
1 week ago
1 day