PageIndex  by VectifyAI

Document index system for reasoning-based RAG

created 4 months ago
1,130 stars

Top 34.6% on sourcepulse

GitHubView on GitHub
Project Summary

PageIndex addresses the limitations of traditional vector-based RAG for long, professional documents by enabling reasoning-based retrieval. It transforms lengthy documents into hierarchical, semantic tree structures, allowing LLMs to navigate and extract information more precisely than similarity-based methods. This is ideal for users working with complex documents like financial reports, legal texts, or technical manuals where accuracy and relevance are paramount.

How It Works

PageIndex creates a semantic tree structure from documents, akin to an LLM-optimized table of contents. It avoids arbitrary chunking by segmenting nodes based on the document's natural structure, with each node containing a summary and precise page references. This hierarchical approach, inspired by tree search algorithms, allows LLMs to traverse documents logically, facilitating multi-step reasoning and pinpoint retrieval, which is advantageous for domain-specific tasks requiring nuanced understanding.

Quick Start & Requirements

  • Install dependencies: pip3 install -r requirements.txt
  • Set OpenAI API key: Create a .env file with CHATGPT_API_KEY=your_openai_key_here.
  • Run PageIndex: python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
  • Prerequisites: Python 3, OpenAI API key.
  • Documentation: https://github.com/VectifyAI/PageIndex

Highlighted Details

  • Achieved 98.7% accuracy on the FinanceBench benchmark when integrated with the Mafin 2.5 RAG model.
  • Generates hierarchical tree structures with precise page referencing, enabling logical document traversal.
  • Supports massive documents (hundreds or thousands of pages) without arbitrary chunking.
  • Offers a cloud-hosted API with advanced OCR for complex PDFs.

Maintenance & Community

  • Early beta development; welcomes contributions and issue reporting.
  • Contact: Discord server and email contact available.

Licensing & Compatibility

  • License: Not explicitly stated in the README.
  • Compatibility: Designed for use with LLMs; commercial use implications are unclear due to the unstated license.

Limitations & Caveats

The project is in early beta, and users may encounter instability due to the diverse structures of PDF documents. For a more stable and accurate experience, especially with complex PDFs, the README recommends their hosted API. The license is not specified, which may impact commercial adoption.

Health Check
Last commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
469 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.