Data ingestion/parsing platform for GenAI
Top 7.8% on sourcepulse
OmniParse is a platform designed to ingest, parse, and structure diverse data formats—from documents and multimedia to web content—making them readily compatible with Generative AI frameworks like RAG and fine-tuning. It targets developers and researchers needing a unified, local data preparation pipeline for AI applications.
How It Works
OmniParse leverages a suite of deep learning models, including Surya OCR and Florence-2 for document and image processing, and Whisper for audio/video transcription. It converts various file types into structured markdown, aiming for a high-quality, AI-friendly output. The platform is designed for local execution, fitting within a T4 GPU, and offers an interactive UI powered by Gradio.
Quick Start & Requirements
poetry install
or pip install -e .
.savatar101/omniparse:0.1
or build locally. Run with --gpus all
if a GPU is available.python server.py
with flags like --documents
, --media
, --web
to load specific parsers.Highlighted Details
Maintenance & Community
The project acknowledges contributions from Marker, Surya-OCR, Texify, and Crawl4AI. Contact is available via email at adithyaskolavi@gmail.com.
Licensing & Compatibility
OmniParse is licensed under GPL-3.0. However, the underlying Marker and Surya OCR models have weights licensed under CC-BY-NC-SA-4.0, which restricts commercial use for organizations exceeding specific revenue/funding thresholds ($5M USD gross revenue or VC funding). Commercial licenses are available for these restrictions.
Limitations & Caveats
A GPU with at least 8-10 GB VRAM is necessary. The PDF parser may not perfectly convert all equations to LaTeX and might struggle with non-English text. Table formatting and whitespace preservation can be inconsistent. The models used are smaller variants, potentially impacting performance compared to best-in-class models.
1 month ago
1 week