File parser for improved LLM document chunking
Top 16.1% on sourcepulse
Open-Parse is a Python library designed to improve document chunking for Retrieval Augmented Generation (RAG) systems by visually analyzing document layouts. It targets developers building AI applications who need to process complex documents like PDFs, offering a more semantically aware approach than traditional text splitters or basic layout parsers. The primary benefit is higher quality chunking, preserving document structure for more effective AI processing.
How It Works
Open-Parse employs a visually-driven approach, analyzing document layouts to group related content semantically. Unlike basic text splitters that discard structural information, or ML layout parsers that focus on element identification but not grouping, Open-Parse aims to intelligently chunk documents by understanding headings, sections, tables, and other structural elements. It supports basic markdown parsing and high-precision table extraction, converting tables into clean Markdown formats.
Quick Start & Requirements
pip install openparse
pip install "openparse[ml]"
followed by openparse-download
to get model weights.TESSDATA_PREFIX
environment variable needing to be set correctly.Highlighted Details
Maintenance & Community
The project mentions sponsors and encourages reaching out for special use cases. Links to cookbooks and documentation are provided.
Licensing & Compatibility
The core library is fully open source. Table extraction relies on PyMuPDF, which has its own license. The README notes that table-transformers are used, with performance noted as subpar, and suggests the possibility of unitable adding support for better models.
Limitations & Caveats
The project currently uses table-transformers for table detection, which is noted as having subpar performance affecting downstream results. Users requiring OCR functionality must correctly install and configure Tesseract-OCR and its language data.
8 months ago
1 day