MegaParse by QuivrHQ

File parser optimized for LLM ingestion

Created 1 year ago

7,252 stars

Top 7.0% on SourcePulse

View on GitHub

6 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Paul Copplestone

Cofounder of Supabase

Dan Guido

Cofounder of Trail of Bits

Elvis Saravia

Founder of DAIR.AI

and 2 more!

Project Summary

MegaParse is an open-source Python library designed for robust document parsing, specifically optimized for ingestion by Large Language Models (LLMs). It aims to minimize information loss across various file formats, including PDFs, Word documents, and PowerPoints, making it suitable for developers and researchers building LLM-powered applications.

How It Works

MegaParse employs a multi-faceted approach to parsing, leveraging specialized libraries for different document types. For image and PDF processing, it relies on external tools like Poppler and Tesseract OCR. The MegaParseVision component specifically targets multimodal LLMs (e.g., GPT-4o, Claude 3.5) to extract information from documents containing images or complex layouts, achieving a reported 0.87 similarity ratio in benchmarks against other parsers.

Quick Start & Requirements

Install via pip: pip install megaparse
Requires Python >= 3.11.
External dependencies: Poppler, Tesseract OCR. macOS users also need libmagic (brew install libmagic).
An OpenAI or Anthropic API key is required and should be placed in a .env file.
A Makefile is provided for local development (make dev), exposing API endpoints at localhost:8000/docs.

Highlighted Details

Supports Text, PDF, Powerpoint, Excel, CSV, and Word documents.
Extracts Tables, Table of Contents, Headers, Footers, and Images.
Benchmarked against unstructured and llama_parser, showing superior performance for megaparse_vision.
Includes an evaluation framework for community contributions and comparisons.

Maintenance & Community

The project is hosted by QuivrHQ. Further community engagement details (e.g., Discord/Slack, roadmap) are not explicitly detailed in the README.

Licensing & Compatibility

The README does not explicitly state a license. Given the project's association with QuivrHQ, which typically uses permissive licenses, users should verify the license for commercial or closed-source integration.

Limitations & Caveats

The library is marked as "In Construction" with plans to improve table checking and add structured output capabilities. The reliance on external binaries (Poppler, Tesseract) adds to the setup complexity.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

31 stars in the last 30 days