localGPT-Vision by PromtEngineer

Vision-language RAG pipeline for document Q&A

Created 1 year ago

621 stars

Top 53.2% on SourcePulse

Project Summary

This project provides an end-to-end Retrieval-Augmented Generation (RAG) system focused on vision-based document interaction. It allows users to chat with PDFs and images by indexing their visual content, retrieving relevant pages using Colqwen/ColPali models, and generating answers with various Vision Language Models (VLMs). This approach is beneficial for users who need to query documents where visual layout and elements are as important as text.

How It Works

The system employs a unique RAG pipeline that bypasses traditional text chunking and embedding. Instead, it uses Colqwen and ColPali models to create embeddings directly from document page images, capturing visual cues like layout and figures. During querying, these visual embeddings are matched to retrieve relevant pages. The retrieved document images, along with the user's query, are then fed into a selected VLM (e.g., Qwen2-VL, Gemini, GPT-4o) for response generation, offering a more holistic understanding of the document's content.

Quick Start & Requirements

Install: Clone the repo, create a conda environment (conda create -n localgpt-vision python=3.10), activate it (conda activate localgpt-gpt-vision), install dependencies (pip install -r requirements.txt), and install the dev version of Transformers (pip uninstall transformers && pip install git+https://github.com/huggingface/transformers).
Prerequisites: Anaconda/Miniconda, Python 3.10+, Git. For debugging on Ubuntu/Debian, libpoppler-cpp-dev, poppler-utils, cmake, pkgconfig, and python3-poppler-qt5 are required. API keys for Gemini, OpenAI, and Groq may be needed.
Run: python app.py
Access: http://localhost:5050/
Docs: https://github.com/PromtEngineer/localGPT-Vision

Highlighted Details

End-to-end vision-based RAG pipeline.
Indexes documents using visual embeddings (Colqwen/ColPali), eliminating text extraction and chunking.
Supports multiple VLMs: Qwen2-VL-7B-Instruct, LLAMA-3.2-11B-Vision, Pixtral-12B-2409, Molmo-7B-O-0924, Google Gemini, OpenAI GPT-4o, LLAMA-3.2 with Ollama.
Session management for persistent chat history and indexes.

Maintenance & Community

Open to contributions via pull requests.
Star History link provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

Response quality is highly dependent on the chosen VLM and the resolution of document images. Debugging setup instructions are specific to Ubuntu/Debian-based systems.

localGPT-Vision by PromtEngineer

Explore Similar Projects

tiny-rag by wdndev

ViDoRAG by Alibaba-NLP

VARAG by adithya-s-k

llm-search by snexus

ArXivChatGuru by redis-developer

layra by liweiphys

ColiVara by tjmlabs

Local_Pdf_Chat_RAG by weiwill88

clip-retrieval by rom1504

PageIndex by VectifyAI

pdfGPT by bhaskatripathi

WeKnora by Tencent