PixelRAG by StarTrail-org

Visual RAG for documents and web pages

Created 3 weeks ago

New!

3,589 stars

Top 13.3% on SourcePulse

View on GitHub

1 Expert Loves This Project

Matei Zaharia

Cofounder of Databricks

Project Summary

PixelRAG addresses the limitations of traditional text-based retrieval systems by enabling search and retrieval directly from visual content within documents. It targets engineers, researchers, and power users who need to extract information from visually rich sources like web pages and PDFs, offering a benefit of more accurate and context-aware answers by preserving visual structure.

How It Works

PixelRAG's core innovation lies in rendering documents into screenshot tiles rather than parsing them into text chunks. This approach preserves visual elements such as tables, charts, and layout. A Qwen3-VL-Embedding model, fine-tuned on screenshot data, then embeds these images into a vector space, allowing for retrieval based on visual similarity and content. This method ensures that information embedded in visual structures is not lost during the retrieval process.

Quick Start & Requirements

Primary Install: pip install pixelrag
Prerequisites:
- Core functionality requires pip install pixelrag (includes Playwright/CDP for rendering).
- Embedding/indexing requires pip install 'pixelrag[embed]'.
- Serving requires pip install 'pixelrag[serve]'.
- Full pipeline orchestration requires pip install 'pixelrag[index]'.
- Training requires a specific, pinned environment (torch==2.9.1+cu129, transformers==4.57.1, cuDNN 9.20) managed within the train/ directory using uv. GPU is recommended for embedding and training.
Links:
- Live hosted API: https://api.pixelrag.ai
- Demo notebook: Available in Colab.
- Claude plugin installation: claude plugin marketplace add StarTrail-org/PixelRAG
- Pre-built Wikipedia index download: huggingface-cli download StarTrail-org/pixelrag-faiss-indexes --repo-type dataset --include "search_index_normed_v2/*" --local-dir ./index

Highlighted Details

Provides a live, hosted API (https://api.pixelrag.ai) serving a pre-built index of 8.28 million Wikipedia pages, requiring no setup.
Integrates as a Claude Code plugin (pixelbrowse skill), allowing Claude to screenshot pages and interpret visual content directly.
Supports image-based queries for visual search capabilities.
The pipeline is general-purpose and can render various document types, including web pages, PDFs, and images.

Maintenance & Community

Developed by Berkeley SkyLab, BAIR, and the Berkeley NLP Group. Notable contributors include Rulin Shao. Support was provided by Claude Code and OpenAI Codex. No specific community channels (e.g., Discord, Slack) or roadmap links are detailed in the provided README.

Licensing & Compatibility

The project is licensed under the Apache-2.0 license. This license is generally permissive, allowing for commercial use and integration into closed-source applications without significant copyleft restrictions.

Limitations & Caveats

The training environment is managed as a separate uv project within the train/ directory and requires specific, pinned dependencies, potentially complicating setup. The reproducibility status for data curation visualization is marked as "TBD". The Claude plugin executes the pixelshot command locally on the user's machine.

Health Check

Last Commit

14 hours ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3,593 stars in the last 25 days