liteparse by run-llama

Fast, local document parsing and screenshotting for AI

Created 5 months ago

11,471 stars

Top 4.6% on SourcePulse

View on GitHub

7 Experts Love This Project

Travis Fischer

Founder of Agentic

Luis Capelo

Cofounder of Lightning AI

Didier Lopes

Founder of OpenBB

Dan Guido

Cofounder of Trail of Bits

and 3 more!

Project Summary

A standalone, open-source document parser designed for fast, local processing. LiteParse offers high-quality spatial text extraction with bounding boxes, making it suitable for users who require document parsing without cloud dependencies or proprietary LLM features. It provides a flexible OCR system and supports multiple input formats, running entirely on the user's machine.

How It Works

LiteParse employs PDF.js for its core spatial text parsing capabilities, enabling precise text positioning. It includes a built-in, zero-setup OCR engine using Tesseract.js, with the flexibility to integrate external HTTP OCR servers like EasyOCR or PaddleOCR. The tool can also generate high-quality page screenshots, essential for LLM agents. Outputs are available in JSON or plain text formats, including bounding box data.

Quick Start & Requirements

Primary install: Global install via npm: npm i -g @llamaindex/liteparse. Alternatively, macOS/Linux users can use brew install llamaindex-liteparse.
Prerequisites: For multi-format support, LibreOffice is required for Office documents (Word, PowerPoint, Spreadsheets) and ImageMagick for image files.
Resource footprint: Designed to run entirely locally with no cloud dependencies.
Relevant pages: GitHub Repository

Highlighted Details

Multi-Format Input: Automatically converts Office documents (via LibreOffice) and images (via ImageMagick) to PDF for parsing, extending beyond typical PDF-only tools.
Screenshot Generation: Capable of creating high-quality page screenshots, crucial for visual information extraction by LLM agents.
Flexible OCR: Integrates Tesseract.js out-of-the-box and supports custom HTTP OCR servers, offering adaptability for different accuracy and performance needs.
Local Execution: Operates as a standalone binary, ensuring data privacy and offline usability.

Maintenance & Community

No specific details regarding maintainers, sponsorships, or community channels (like Discord/Slack) were found in the provided README.

Licensing & Compatibility

The project is licensed under the Apache 2.0 license. This license is generally permissive and compatible with commercial use and linking within closed-source projects.

Limitations & Caveats

For highly complex documents such as dense tables, multi-column layouts, charts, handwritten text, or heavily scanned PDFs, the cloud-based LlamaParse service is recommended for significantly better results. Setup for multi-format parsing requires the installation of external dependencies like LibreOffice or ImageMagick.

Health Check

Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1,634 stars in the last 30 days