PageIndex by VectifyAI

Document index system for reasoning-based RAG

Created 9 months ago

5,064 stars

Top 9.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Elvis Saravia

Founder of DAIR.AI

Project Summary

PageIndex addresses the limitations of traditional vector-based RAG for long, professional documents by enabling reasoning-based retrieval. It transforms lengthy documents into hierarchical, semantic tree structures, allowing LLMs to navigate and extract information more precisely than similarity-based methods. This is ideal for users working with complex documents like financial reports, legal texts, or technical manuals where accuracy and relevance are paramount.

How It Works

PageIndex creates a semantic tree structure from documents, akin to an LLM-optimized table of contents. It avoids arbitrary chunking by segmenting nodes based on the document's natural structure, with each node containing a summary and precise page references. This hierarchical approach, inspired by tree search algorithms, allows LLMs to traverse documents logically, facilitating multi-step reasoning and pinpoint retrieval, which is advantageous for domain-specific tasks requiring nuanced understanding.

Quick Start & Requirements

Install dependencies: pip3 install -r requirements.txt
Set OpenAI API key: Create a .env file with CHATGPT_API_KEY=your_openai_key_here.
Run PageIndex: python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
Prerequisites: Python 3, OpenAI API key.
Documentation: https://github.com/VectifyAI/PageIndex

Highlighted Details

Achieved 98.7% accuracy on the FinanceBench benchmark when integrated with the Mafin 2.5 RAG model.
Generates hierarchical tree structures with precise page referencing, enabling logical document traversal.
Supports massive documents (hundreds or thousands of pages) without arbitrary chunking.
Offers a cloud-hosted API with advanced OCR for complex PDFs.

Maintenance & Community

Early beta development; welcomes contributions and issue reporting.
Contact: Discord server and email contact available.

Licensing & Compatibility

License: Not explicitly stated in the README.
Compatibility: Designed for use with LLMs; commercial use implications are unclear due to the unstated license.

Limitations & Caveats

The project is in early beta, and users may encounter instability due to the diverse structures of PDF documents. For a more stable and accurate experience, especially with complex PDFs, the README recommends their hosted API. The license is not specified, which may impact commercial adoption.

Health Check

Last Commit

3 days ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

829 stars in the last 30 days