open-parse by Filimoa

File parser for improved LLM document chunking

Created 1 year ago

3,148 stars

Top 15.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Xiaofan Luan

VP Engineering at Zilliz

Bryan Helmig

Cofounder of Zapier

Project Summary

Open-Parse is a Python library designed to improve document chunking for Retrieval Augmented Generation (RAG) systems by visually analyzing document layouts. It targets developers building AI applications who need to process complex documents like PDFs, offering a more semantically aware approach than traditional text splitters or basic layout parsers. The primary benefit is higher quality chunking, preserving document structure for more effective AI processing.

How It Works

Open-Parse employs a visually-driven approach, analyzing document layouts to group related content semantically. Unlike basic text splitters that discard structural information, or ML layout parsers that focus on element identification but not grouping, Open-Parse aims to intelligently chunk documents by understanding headings, sections, tables, and other structural elements. It supports basic markdown parsing and high-precision table extraction, converting tables into clean Markdown formats.

Quick Start & Requirements

Install: pip install openparse
ML Table Detection (Optional): pip install "openparse[ml]" followed by openparse-download to get model weights.
Prerequisites: Python 3.8+. For OCR, Tesseract-OCR and its language data are required, with the TESSDATA_PREFIX environment variable needing to be set correctly.
Documentation: https://filimoa.github.io/open-parse/
Cookbooks: https://github.com/Filimoa/open-parse/tree/main/src/cookbooks

Highlighted Details

Visually analyzes documents for superior LLM input.
Supports basic markdown parsing for headings, bold, and italics.
High-precision table extraction into Markdown formats.
Extensible with custom post-processing steps.
Utilizes pydantic for easy serialization of results.

Maintenance & Community

The project mentions sponsors and encourages reaching out for special use cases. Links to cookbooks and documentation are provided.

Licensing & Compatibility

The core library is fully open source. Table extraction relies on PyMuPDF, which has its own license. The README notes that table-transformers are used, with performance noted as subpar, and suggests the possibility of unitable adding support for better models.

Limitations & Caveats

The project currently uses table-transformers for table detection, which is noted as having subpar performance affecting downstream results. Users requiring OCR functionality must correctly install and configure Tesseract-OCR and its language data.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

15 stars in the last 30 days