open-parse  by Filimoa

File parser for improved LLM document chunking

created 1 year ago
3,033 stars

Top 16.1% on sourcepulse

GitHubView on GitHub
Project Summary

Open-Parse is a Python library designed to improve document chunking for Retrieval Augmented Generation (RAG) systems by visually analyzing document layouts. It targets developers building AI applications who need to process complex documents like PDFs, offering a more semantically aware approach than traditional text splitters or basic layout parsers. The primary benefit is higher quality chunking, preserving document structure for more effective AI processing.

How It Works

Open-Parse employs a visually-driven approach, analyzing document layouts to group related content semantically. Unlike basic text splitters that discard structural information, or ML layout parsers that focus on element identification but not grouping, Open-Parse aims to intelligently chunk documents by understanding headings, sections, tables, and other structural elements. It supports basic markdown parsing and high-precision table extraction, converting tables into clean Markdown formats.

Quick Start & Requirements

Highlighted Details

  • Visually analyzes documents for superior LLM input.
  • Supports basic markdown parsing for headings, bold, and italics.
  • High-precision table extraction into Markdown formats.
  • Extensible with custom post-processing steps.
  • Utilizes pydantic for easy serialization of results.

Maintenance & Community

The project mentions sponsors and encourages reaching out for special use cases. Links to cookbooks and documentation are provided.

Licensing & Compatibility

The core library is fully open source. Table extraction relies on PyMuPDF, which has its own license. The README notes that table-transformers are used, with performance noted as subpar, and suggests the possibility of unitable adding support for better models.

Limitations & Caveats

The project currently uses table-transformers for table detection, which is noted as having subpar performance affecting downstream results. Users requiring OCR functionality must correctly install and configure Tesseract-OCR and its language data.

Health Check
Last commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
100 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Paul Copplestone Paul Copplestone(Cofounder of Supabase), and
2 more.

MegaParse by QuivrHQ

0.5%
7k
File parser optimized for LLM ingestion
created 1 year ago
updated 5 months ago
Feedback? Help us improve.