omniparse  by adithya-s-k

Data ingestion/parsing platform for GenAI

Created 1 year ago
6,688 stars

Top 7.7% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

OmniParse is a platform designed to ingest, parse, and structure diverse data formats—from documents and multimedia to web content—making them readily compatible with Generative AI frameworks like RAG and fine-tuning. It targets developers and researchers needing a unified, local data preparation pipeline for AI applications.

How It Works

OmniParse leverages a suite of deep learning models, including Surya OCR and Florence-2 for document and image processing, and Whisper for audio/video transcription. It converts various file types into structured markdown, aiming for a high-quality, AI-friendly output. The platform is designed for local execution, fitting within a T4 GPU, and offers an interactive UI powered by Gradio.

Quick Start & Requirements

  • Installation: Clone the repository, create a Python 3.10 virtual environment, and run poetry install or pip install -e ..
  • Docker: Pull savatar101/omniparse:0.1 or build locally. Run with --gpus all if a GPU is available.
  • Prerequisites: Linux-based OS is required. A GPU with 8-10 GB VRAM is recommended for deep learning models.
  • Usage: Run python server.py with flags like --documents, --media, --web to load specific parsers.
  • Docs: API Endpoints

Highlighted Details

  • Supports approximately 20 file types including documents, images, audio, and video.
  • Features include table extraction, image captioning, audio/video transcription, and web crawling.
  • Designed for local execution, fitting within a T4 GPU.
  • Offers an interactive UI powered by Gradio.

Maintenance & Community

The project acknowledges contributions from Marker, Surya-OCR, Texify, and Crawl4AI. Contact is available via email at adithyaskolavi@gmail.com.

Licensing & Compatibility

OmniParse is licensed under GPL-3.0. However, the underlying Marker and Surya OCR models have weights licensed under CC-BY-NC-SA-4.0, which restricts commercial use for organizations exceeding specific revenue/funding thresholds ($5M USD gross revenue or VC funding). Commercial licenses are available for these restrictions.

Limitations & Caveats

A GPU with at least 8-10 GB VRAM is necessary. The PDF parser may not perfectly convert all equations to LaTeX and might struggle with non-English text. Table formatting and whitespace preservation can be inconsistent. The models used are smaller variants, potentially impacting performance compared to best-in-class models.

Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
30 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Paul Copplestone Paul Copplestone(Cofounder of Supabase), and
4 more.

MegaParse by QuivrHQ

0.1%
7k
File parser optimized for LLM ingestion
Created 1 year ago
Updated 6 months ago
Feedback? Help us improve.