PageIndex  by VectifyAI

Document index system for reasoning-based RAG

Created 5 months ago
2,537 stars

Top 18.4% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

PageIndex addresses the limitations of traditional vector-based RAG for long, professional documents by enabling reasoning-based retrieval. It transforms lengthy documents into hierarchical, semantic tree structures, allowing LLMs to navigate and extract information more precisely than similarity-based methods. This is ideal for users working with complex documents like financial reports, legal texts, or technical manuals where accuracy and relevance are paramount.

How It Works

PageIndex creates a semantic tree structure from documents, akin to an LLM-optimized table of contents. It avoids arbitrary chunking by segmenting nodes based on the document's natural structure, with each node containing a summary and precise page references. This hierarchical approach, inspired by tree search algorithms, allows LLMs to traverse documents logically, facilitating multi-step reasoning and pinpoint retrieval, which is advantageous for domain-specific tasks requiring nuanced understanding.

Quick Start & Requirements

  • Install dependencies: pip3 install -r requirements.txt
  • Set OpenAI API key: Create a .env file with CHATGPT_API_KEY=your_openai_key_here.
  • Run PageIndex: python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
  • Prerequisites: Python 3, OpenAI API key.
  • Documentation: https://github.com/VectifyAI/PageIndex

Highlighted Details

  • Achieved 98.7% accuracy on the FinanceBench benchmark when integrated with the Mafin 2.5 RAG model.
  • Generates hierarchical tree structures with precise page referencing, enabling logical document traversal.
  • Supports massive documents (hundreds or thousands of pages) without arbitrary chunking.
  • Offers a cloud-hosted API with advanced OCR for complex PDFs.

Maintenance & Community

  • Early beta development; welcomes contributions and issue reporting.
  • Contact: Discord server and email contact available.

Licensing & Compatibility

  • License: Not explicitly stated in the README.
  • Compatibility: Designed for use with LLMs; commercial use implications are unclear due to the unstated license.

Limitations & Caveats

The project is in early beta, and users may encounter instability due to the diverse structures of PDF documents. For a more stable and accurate experience, especially with complex PDFs, the README recommends their hosted API. The license is not specified, which may impact commercial adoption.

Health Check
Last Commit

23 hours ago

Responsiveness

1 week

Pull Requests (30d)
5
Issues (30d)
2
Star History
1,401 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.