MegaParse  by QuivrHQ

File parser optimized for LLM ingestion

created 1 year ago
7,053 stars

Top 7.4% on sourcepulse

GitHubView on GitHub
Project Summary

MegaParse is an open-source Python library designed for robust document parsing, specifically optimized for ingestion by Large Language Models (LLMs). It aims to minimize information loss across various file formats, including PDFs, Word documents, and PowerPoints, making it suitable for developers and researchers building LLM-powered applications.

How It Works

MegaParse employs a multi-faceted approach to parsing, leveraging specialized libraries for different document types. For image and PDF processing, it relies on external tools like Poppler and Tesseract OCR. The MegaParseVision component specifically targets multimodal LLMs (e.g., GPT-4o, Claude 3.5) to extract information from documents containing images or complex layouts, achieving a reported 0.87 similarity ratio in benchmarks against other parsers.

Quick Start & Requirements

  • Install via pip: pip install megaparse
  • Requires Python >= 3.11.
  • External dependencies: Poppler, Tesseract OCR. macOS users also need libmagic (brew install libmagic).
  • An OpenAI or Anthropic API key is required and should be placed in a .env file.
  • A Makefile is provided for local development (make dev), exposing API endpoints at localhost:8000/docs.

Highlighted Details

  • Supports Text, PDF, Powerpoint, Excel, CSV, and Word documents.
  • Extracts Tables, Table of Contents, Headers, Footers, and Images.
  • Benchmarked against unstructured and llama_parser, showing superior performance for megaparse_vision.
  • Includes an evaluation framework for community contributions and comparisons.

Maintenance & Community

The project is hosted by QuivrHQ. Further community engagement details (e.g., Discord/Slack, roadmap) are not explicitly detailed in the README.

Licensing & Compatibility

The README does not explicitly state a license. Given the project's association with QuivrHQ, which typically uses permissive licenses, users should verify the license for commercial or closed-source integration.

Limitations & Caveats

The library is marked as "In Construction" with plans to improve table checking and add structured output capabilities. The reliance on external binaries (Poppler, Tesseract) adds to the setup complexity.

Health Check
Last commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
730 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.