omniparse  by adithya-s-k

Data ingestion/parsing platform for GenAI

Created 1 year ago
6,803 stars

Top 7.5% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

OmniParse is a platform designed to ingest, parse, and structure diverse data formats—from documents and multimedia to web content—making them readily compatible with Generative AI frameworks like RAG and fine-tuning. It targets developers and researchers needing a unified, local data preparation pipeline for AI applications.

How It Works

OmniParse leverages a suite of deep learning models, including Surya OCR and Florence-2 for document and image processing, and Whisper for audio/video transcription. It converts various file types into structured markdown, aiming for a high-quality, AI-friendly output. The platform is designed for local execution, fitting within a T4 GPU, and offers an interactive UI powered by Gradio.

Quick Start & Requirements

  • Installation: Clone the repository, create a Python 3.10 virtual environment, and run poetry install or pip install -e ..
  • Docker: Pull savatar101/omniparse:0.1 or build locally. Run with --gpus all if a GPU is available.
  • Prerequisites: Linux-based OS is required. A GPU with 8-10 GB VRAM is recommended for deep learning models.
  • Usage: Run python server.py with flags like --documents, --media, --web to load specific parsers.
  • Docs: API Endpoints

Highlighted Details

  • Supports approximately 20 file types including documents, images, audio, and video.
  • Features include table extraction, image captioning, audio/video transcription, and web crawling.
  • Designed for local execution, fitting within a T4 GPU.
  • Offers an interactive UI powered by Gradio.

Maintenance & Community

The project acknowledges contributions from Marker, Surya-OCR, Texify, and Crawl4AI. Contact is available via email at adithyaskolavi@gmail.com.

Licensing & Compatibility

OmniParse is licensed under GPL-3.0. However, the underlying Marker and Surya OCR models have weights licensed under CC-BY-NC-SA-4.0, which restricts commercial use for organizations exceeding specific revenue/funding thresholds ($5M USD gross revenue or VC funding). Commercial licenses are available for these restrictions.

Limitations & Caveats

A GPU with at least 8-10 GB VRAM is necessary. The PDF parser may not perfectly convert all equations to LaTeX and might struggle with non-English text. Table formatting and whitespace preservation can be inconsistent. The models used are smaller variants, potentially impacting performance compared to best-in-class models.

Health Check
Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Paul Copplestone Paul Copplestone(Cofounder of Supabase), and
4 more.

MegaParse by QuivrHQ

0.1%
7k
File parser optimized for LLM ingestion
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.