omniparse by adithya-s-k

Data ingestion/parsing platform for GenAI

Created 1 year ago

6,772 stars

Top 7.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Tobi Lutke

Cofounder of Shopify

Project Summary

OmniParse is a platform designed to ingest, parse, and structure diverse data formats—from documents and multimedia to web content—making them readily compatible with Generative AI frameworks like RAG and fine-tuning. It targets developers and researchers needing a unified, local data preparation pipeline for AI applications.

How It Works

OmniParse leverages a suite of deep learning models, including Surya OCR and Florence-2 for document and image processing, and Whisper for audio/video transcription. It converts various file types into structured markdown, aiming for a high-quality, AI-friendly output. The platform is designed for local execution, fitting within a T4 GPU, and offers an interactive UI powered by Gradio.

Quick Start & Requirements

Installation: Clone the repository, create a Python 3.10 virtual environment, and run poetry install or pip install -e ..
Docker: Pull savatar101/omniparse:0.1 or build locally. Run with --gpus all if a GPU is available.
Prerequisites: Linux-based OS is required. A GPU with 8-10 GB VRAM is recommended for deep learning models.
Usage: Run python server.py with flags like --documents, --media, --web to load specific parsers.
Docs: API Endpoints

Highlighted Details

Supports approximately 20 file types including documents, images, audio, and video.
Features include table extraction, image captioning, audio/video transcription, and web crawling.
Designed for local execution, fitting within a T4 GPU.
Offers an interactive UI powered by Gradio.

Maintenance & Community

The project acknowledges contributions from Marker, Surya-OCR, Texify, and Crawl4AI. Contact is available via email at adithyaskolavi@gmail.com.

Licensing & Compatibility

OmniParse is licensed under GPL-3.0. However, the underlying Marker and Surya OCR models have weights licensed under CC-BY-NC-SA-4.0, which restricts commercial use for organizations exceeding specific revenue/funding thresholds ($5M USD gross revenue or VC funding). Commercial licenses are available for these restrictions.

Limitations & Caveats

A GPU with at least 8-10 GB VRAM is necessary. The PDF parser may not perfectly convert all equations to LaTeX and might struggle with non-English text. Table formatting and whitespace preservation can be inconsistent. The models used are smaller variants, potentially impacting performance compared to best-in-class models.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

37 stars in the last 30 days