liteparse_samples  by jerryjliu

Local document parsing and AI Q&A with visual source citations

Created 1 week ago

New!

361 stars

Top 77.8% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

This repository provides interactive demonstrations for LiteParse, a fast, local, and model-free document parsing engine developed by LlamaIndex. It targets engineers and researchers evaluating document processing solutions, offering benefits like direct parser comparisons, precise visual sourcing of extracted text, and AI-assisted querying with verifiable citations.

How It Works

LiteParse employs a model-free approach for rapid, local document analysis across various formats including PDF, DOCX, PPTX, XLSX, and images. Key innovations include side-by-side parser comparisons against established libraries like PyPDF and PyMuPDF, and a "Visual Citations" feature that enables exact keyword searches with bounding box overlays directly on source PDF pages. Additionally, a Claude Code Skill integrates LiteParse for AI-powered Q&A, generating reports with cited source pages.

Quick Start & Requirements

  • Installation: Primarily involves opening pre-generated HTML files (comparison/output/comparison.html, visual_citations/output/visual-citations.html). For custom data processing or regeneration, pip install -r requirements.txt is needed, followed by running Python scripts within respective directories (comparison/, visual_citations/). The Claude Code Skill installs via npx skills add run-llama/liteparse_samples --skill research_docs.
  • Prerequisites: Python 3.9+ and dependencies listed in requirements.txt (e.g., liteparse, pypdf, pymupdf).
  • Links: Direct URLs for LiteParse documentation or GitHub are not provided within the text.

Highlighted Details

  • Direct comparison of LiteParse against PyPDF and PyMuPDF using real government and financial documents.
  • Interactive keyword search with bounding box overlays on rendered PDF pages for precise match localization.
  • AI-powered question answering via Claude Code Skill, producing HTML reports with highlighted source citations.
  • Support for parsing PDF, DOCX, PPTX, XLSX, images, and plaintext files.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), sponsorships, or roadmap are present. The project is associated with LlamaIndex.

Licensing & Compatibility

The license type is not explicitly stated.

Limitations & Caveats

The "Visual Citations" search is a simple substring match, not supporting fuzzy matching or RAG. Other limitations, unsupported platforms, or known issues are not detailed.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
0
Star History
363 stars in the last 9 days

Explore Similar Projects

Feedback? Help us improve.