Dolphin by bytedance

Document parsing with a multimodal VLM

Created 8 months ago

8,523 stars

Top 6.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Elie Bursztein

Cybersecurity Lead at Google DeepMind

Project Summary

Dolphin is a multimodal model for document image parsing, designed for researchers and developers working with complex document layouts. It addresses challenges in analyzing and extracting information from documents containing text, figures, formulas, and tables by employing a two-stage "analyze-then-parse" paradigm.

How It Works

Dolphin utilizes a novel heterogeneous anchor prompting approach within a single Vision-Language Model (VLM). The first stage performs page-level layout analysis by generating an element sequence in a natural reading order. The second stage then efficiently parses these elements in parallel using task-specific prompts and heterogeneous anchors, enabling robust extraction of diverse document components.

Quick Start & Requirements

Installation: Clone the repository and install dependencies via pip install -r requirements.txt.
Pre-trained Models: Download from Baidu Yun, Google Drive, or Hugging Face Hub.
Inference: Supports both page-level and element-level parsing via Python scripts (demo_page.py, demo_page_hf.py, demo_element.py, demo_element_hf.py).
Dependencies: Requires Python and standard ML libraries. Specific hardware requirements (e.g., GPU) are not explicitly stated but are implied for efficient operation.
Demo: Available at Demo-Dolphin.

Highlighted Details

Two-stage analyze-then-parse approach using a single VLM.
Natural reading order element sequence generation.
Heterogeneous anchor prompting for diverse document elements.
Efficient parallel parsing mechanism.
Supports Hugging Face Transformers integration.
Added TensorRT-LLM and vLLM support for accelerated inference.
Released Fox-Page Benchmark.

Maintenance & Community

The project is actively maintained, with recent updates including TensorRT-LLM and vLLM support, and the release of the Fox-Page Benchmark. The paper is accepted by ACL 2025. Users are encouraged to report "bad cases" via GitHub issues for model optimization.

Licensing & Compatibility

The repository does not explicitly state a license. However, its origin from ByteDance and the inclusion of Hugging Face integration suggest potential compatibility with common ML workflows.

Limitations & Caveats

Specific hardware requirements for optimal performance are not detailed. The project is relatively new, with its paper published in 2025, and users are encouraged to contribute challenging cases for ongoing improvement.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

631 stars in the last 30 days