Discover and explore top open-source AI tools and projects—updated daily.
Document parsing with a multimodal VLM
Top 8.9% on SourcePulse
Dolphin is a multimodal model for document image parsing, designed for researchers and developers working with complex document layouts. It addresses challenges in analyzing and extracting information from documents containing text, figures, formulas, and tables by employing a two-stage "analyze-then-parse" paradigm.
How It Works
Dolphin utilizes a novel heterogeneous anchor prompting approach within a single Vision-Language Model (VLM). The first stage performs page-level layout analysis by generating an element sequence in a natural reading order. The second stage then efficiently parses these elements in parallel using task-specific prompts and heterogeneous anchors, enabling robust extraction of diverse document components.
Quick Start & Requirements
pip install -r requirements.txt
.demo_page.py
, demo_page_hf.py
, demo_element.py
, demo_element_hf.py
).Highlighted Details
Maintenance & Community
The project is actively maintained, with recent updates including TensorRT-LLM and vLLM support, and the release of the Fox-Page Benchmark. The paper is accepted by ACL 2025. Users are encouraged to report "bad cases" via GitHub issues for model optimization.
Licensing & Compatibility
The repository does not explicitly state a license. However, its origin from ByteDance and the inclusion of Hugging Face integration suggest potential compatibility with common ML workflows.
Limitations & Caveats
Specific hardware requirements for optimal performance are not detailed. The project is relatively new, with its paper published in 2025, and users are encouraged to contribute challenging cases for ongoing improvement.
2 weeks ago
Inactive