liteparse_samples by jerryjliu

Local document parsing and AI Q&A with visual source citations

Created 3 months ago

515 stars

Top 60.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jerry Liu

Cofounder of LlamaIndex

Project Summary

Summary

This repository provides interactive demonstrations for LiteParse, a fast, local, and model-free document parsing engine developed by LlamaIndex. It targets engineers and researchers evaluating document processing solutions, offering benefits like direct parser comparisons, precise visual sourcing of extracted text, and AI-assisted querying with verifiable citations.

How It Works

LiteParse employs a model-free approach for rapid, local document analysis across various formats including PDF, DOCX, PPTX, XLSX, and images. Key innovations include side-by-side parser comparisons against established libraries like PyPDF and PyMuPDF, and a "Visual Citations" feature that enables exact keyword searches with bounding box overlays directly on source PDF pages. Additionally, a Claude Code Skill integrates LiteParse for AI-powered Q&A, generating reports with cited source pages.

Quick Start & Requirements

Installation: Primarily involves opening pre-generated HTML files (comparison/output/comparison.html, visual_citations/output/visual-citations.html). For custom data processing or regeneration, pip install -r requirements.txt is needed, followed by running Python scripts within respective directories (comparison/, visual_citations/). The Claude Code Skill installs via npx skills add run-llama/liteparse_samples --skill research_docs.
Prerequisites: Python 3.9+ and dependencies listed in requirements.txt (e.g., liteparse, pypdf, pymupdf).
Links: Direct URLs for LiteParse documentation or GitHub are not provided within the text.

Highlighted Details

Direct comparison of LiteParse against PyPDF and PyMuPDF using real government and financial documents.
Interactive keyword search with bounding box overlays on rendered PDF pages for precise match localization.
AI-powered question answering via Claude Code Skill, producing HTML reports with highlighted source citations.
Support for parsing PDF, DOCX, PPTX, XLSX, images, and plaintext files.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), sponsorships, or roadmap are present. The project is associated with LlamaIndex.

Licensing & Compatibility

The license type is not explicitly stated.

Limitations & Caveats

The "Visual Citations" search is a simple substring match, not supporting fuzzy matching or RAG. Other limitations, unsupported platforms, or known issues are not detailed.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days