omniparse  by adithya-s-k

Data ingestion/parsing platform for GenAI

created 1 year ago
6,656 stars

Top 7.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

OmniParse is a platform designed to ingest, parse, and structure diverse data formats—from documents and multimedia to web content—making them readily compatible with Generative AI frameworks like RAG and fine-tuning. It targets developers and researchers needing a unified, local data preparation pipeline for AI applications.

How It Works

OmniParse leverages a suite of deep learning models, including Surya OCR and Florence-2 for document and image processing, and Whisper for audio/video transcription. It converts various file types into structured markdown, aiming for a high-quality, AI-friendly output. The platform is designed for local execution, fitting within a T4 GPU, and offers an interactive UI powered by Gradio.

Quick Start & Requirements

  • Installation: Clone the repository, create a Python 3.10 virtual environment, and run poetry install or pip install -e ..
  • Docker: Pull savatar101/omniparse:0.1 or build locally. Run with --gpus all if a GPU is available.
  • Prerequisites: Linux-based OS is required. A GPU with 8-10 GB VRAM is recommended for deep learning models.
  • Usage: Run python server.py with flags like --documents, --media, --web to load specific parsers.
  • Docs: API Endpoints

Highlighted Details

  • Supports approximately 20 file types including documents, images, audio, and video.
  • Features include table extraction, image captioning, audio/video transcription, and web crawling.
  • Designed for local execution, fitting within a T4 GPU.
  • Offers an interactive UI powered by Gradio.

Maintenance & Community

The project acknowledges contributions from Marker, Surya-OCR, Texify, and Crawl4AI. Contact is available via email at adithyaskolavi@gmail.com.

Licensing & Compatibility

OmniParse is licensed under GPL-3.0. However, the underlying Marker and Surya OCR models have weights licensed under CC-BY-NC-SA-4.0, which restricts commercial use for organizations exceeding specific revenue/funding thresholds ($5M USD gross revenue or VC funding). Commercial licenses are available for these restrictions.

Limitations & Caveats

A GPU with at least 8-10 GB VRAM is necessary. The PDF parser may not perfectly convert all equations to LaTeX and might struggle with non-English text. Table formatting and whitespace preservation can be inconsistent. The models used are smaller variants, potentially impacting performance compared to best-in-class models.

Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
1
Star History
175 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 23 hours ago
Feedback? Help us improve.