liteparse  by run-llama

Fast, local document parsing and screenshotting for AI

Created 1 month ago
3,404 stars

Top 14.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

A standalone, open-source document parser designed for fast, local processing. LiteParse offers high-quality spatial text extraction with bounding boxes, making it suitable for users who require document parsing without cloud dependencies or proprietary LLM features. It provides a flexible OCR system and supports multiple input formats, running entirely on the user's machine.

How It Works

LiteParse employs PDF.js for its core spatial text parsing capabilities, enabling precise text positioning. It includes a built-in, zero-setup OCR engine using Tesseract.js, with the flexibility to integrate external HTTP OCR servers like EasyOCR or PaddleOCR. The tool can also generate high-quality page screenshots, essential for LLM agents. Outputs are available in JSON or plain text formats, including bounding box data.

Quick Start & Requirements

  • Primary install: Global install via npm: npm i -g @llamaindex/liteparse. Alternatively, macOS/Linux users can use brew install llamaindex-liteparse.
  • Prerequisites: For multi-format support, LibreOffice is required for Office documents (Word, PowerPoint, Spreadsheets) and ImageMagick for image files.
  • Resource footprint: Designed to run entirely locally with no cloud dependencies.
  • Relevant pages: GitHub Repository

Highlighted Details

  • Multi-Format Input: Automatically converts Office documents (via LibreOffice) and images (via ImageMagick) to PDF for parsing, extending beyond typical PDF-only tools.
  • Screenshot Generation: Capable of creating high-quality page screenshots, crucial for visual information extraction by LLM agents.
  • Flexible OCR: Integrates Tesseract.js out-of-the-box and supports custom HTTP OCR servers, offering adaptability for different accuracy and performance needs.
  • Local Execution: Operates as a standalone binary, ensuring data privacy and offline usability.

Maintenance & Community

No specific details regarding maintainers, sponsorships, or community channels (like Discord/Slack) were found in the provided README.

Licensing & Compatibility

The project is licensed under the Apache 2.0 license. This license is generally permissive and compatible with commercial use and linking within closed-source projects.

Limitations & Caveats

For highly complex documents such as dense tables, multi-column layouts, charts, handwritten text, or heavily scanned PDFs, the cloud-based LlamaParse service is recommended for significantly better results. Setup for multi-format parsing requires the installation of external dependencies like LibreOffice or ImageMagick.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
62
Issues (30d)
25
Star History
3,435 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.