Discover and explore top open-source AI tools and projects—updated daily.
magicyuan876AI data preprocessing platform for multimodal unstructured data
Top 80.0% on SourcePulse
Summary
MinerU Tianshu (天枢) is an enterprise-grade, AI-focused data preprocessing platform designed to convert unstructured and multi-modal data into AI-ready structured formats. It targets engineers, researchers, and power users by offering a unified solution for processing documents (PDF, Office), images, audio, video, and specialized biological formats. The platform provides significant benefits through GPU acceleration, efficient data handling, and seamless integration with AI assistants via the Model Context Protocol (MCP).
How It Works
The platform employs a full-stack architecture combining a Vue 3 frontend with a FastAPI backend, utilizing LitServe for efficient GPU load balancing and worker management. It processes diverse data types through a plugin-based engine system, supporting multiple OCR engines (including PaddleOCR-VL for 109+ languages) and specialized processors for audio (SenseVoice) and video. Key innovations include asynchronous PDF splitting for large files, experimental watermark removal, and structured JSON output that preserves document hierarchy, all orchestrated for enterprise deployment with features like JWT authentication and task queuing.
Quick Start & Requirements
Docker deployment is the recommended method for a streamlined setup.
make setup or use the provided scripts (./scripts/docker-setup.sh for Linux/Mac, scripts/docker-setup.bat for Windows).scripts/DOCKER_QUICK_START.txt), API Documentation (http://localhost:8000/docs), MCP Guide (backend/MCP_GUIDE.md).Highlighted Details
Maintenance & Community
The README does not explicitly detail community channels (like Discord/Slack) or specific maintenance contributors beyond acknowledging "MinerU Tianshu Contributors" in the license section.
Licensing & Compatibility
The project is licensed under the Apache License 2.0, which is generally permissive for commercial use and integration into closed-source projects, subject to the terms of the license.
Limitations & Caveats
Watermark removal and video keyframe OCR functionalities are marked as experimental (🧪), indicating they may have limitations or require further refinement. Setting up GPU acceleration necessitates specific NVIDIA drivers and the NVIDIA Container Toolkit.
2 days ago
Inactive
Eventual-Inc