mineru-tianshu by magicyuan876

AI data preprocessing platform for multimodal unstructured data

Created 1 year ago

501 stars

Top 62.1% on SourcePulse

Project Summary

Summary

MinerU Tianshu (天枢) is an enterprise-grade, AI-focused data preprocessing platform designed to convert unstructured and multi-modal data into AI-ready structured formats. It targets engineers, researchers, and power users by offering a unified solution for processing documents (PDF, Office), images, audio, video, and specialized biological formats. The platform provides significant benefits through GPU acceleration, efficient data handling, and seamless integration with AI assistants via the Model Context Protocol (MCP).

How It Works

The platform employs a full-stack architecture combining a Vue 3 frontend with a FastAPI backend, utilizing LitServe for efficient GPU load balancing and worker management. It processes diverse data types through a plugin-based engine system, supporting multiple OCR engines (including PaddleOCR-VL for 109+ languages) and specialized processors for audio (SenseVoice) and video. Key innovations include asynchronous PDF splitting for large files, experimental watermark removal, and structured JSON output that preserves document hierarchy, all orchestrated for enterprise deployment with features like JWT authentication and task queuing.

Quick Start & Requirements

Docker deployment is the recommended method for a streamlined setup.

Primary Install: Execute make setup or use the provided scripts (./scripts/docker-setup.sh for Linux/Mac, scripts/docker-setup.bat for Windows).
Prerequisites: Docker 20.10+, Docker Compose 2.0+, and NVIDIA Container Toolkit (for GPU acceleration). Local development requires Node.js 18+, Python 3.8+, and optionally CUDA.
Resource Footprint: Docker Compose orchestrates all services. GPU workers are managed by LitServe, with configurable memory limits.
Links: Docker Quick Start (scripts/DOCKER_QUICK_START.txt), API Documentation (http://localhost:8000/docs), MCP Guide (backend/MCP_GUIDE.md).

Highlighted Details

Multi-modal Processing: Handles PDF, Office documents, images, audio, video, FASTA, and GenBank formats.
GPU Acceleration: Leverages LitServe for intelligent GPU load balancing and multi-GPU isolation.
MCP Protocol Integration: Enables AI assistants like Claude Desktop to directly invoke document parsing services.
Structured Output: Generates both Markdown and detailed, hierarchical JSON outputs.
Advanced Features: Includes PDF auto-splitting for large files, experimental watermark removal, video keyframe OCR, and audio transcription with speaker diarization.
Enterprise Capabilities: Features JWT authentication, role-based access control, API key management, and SSO readiness.
Object Storage: Integrates with RustFS for S3-compatible object storage of processed images.

Maintenance & Community

The README does not explicitly detail community channels (like Discord/Slack) or specific maintenance contributors beyond acknowledging "MinerU Tianshu Contributors" in the license section.

Licensing & Compatibility

The project is licensed under the Apache License 2.0, which is generally permissive for commercial use and integration into closed-source projects, subject to the terms of the license.

Limitations & Caveats

Watermark removal and video keyframe OCR functionalities are marked as experimental (🧪), indicating they may have limitations or require further refinement. Setting up GPU acceleration necessitates specific NVIDIA drivers and the NVIDIA Container Toolkit.

mineru-tianshu by magicyuan876

Explore Similar Projects

docuglean-ocr by cernis-intelligence

ferrules by AmineDiro

SmartResume by alibaba

DeepSeek-OCR-WebUI by neosun100

mPLUG-DocOwl by X-PLUG

pixeltable by pixeltable

PaddleMIX by PaddlePaddle

LandPPT by sligter

AIAS by mymagicpower

Daft by Eventual-Inc

omniparse by adithya-s-k

databend by databendlabs