Dolphin  by bytedance

Document parsing with a multimodal VLM

Created 4 months ago
5,809 stars

Top 8.9% on SourcePulse

GitHubView on GitHub
Project Summary

Dolphin is a multimodal model for document image parsing, designed for researchers and developers working with complex document layouts. It addresses challenges in analyzing and extracting information from documents containing text, figures, formulas, and tables by employing a two-stage "analyze-then-parse" paradigm.

How It Works

Dolphin utilizes a novel heterogeneous anchor prompting approach within a single Vision-Language Model (VLM). The first stage performs page-level layout analysis by generating an element sequence in a natural reading order. The second stage then efficiently parses these elements in parallel using task-specific prompts and heterogeneous anchors, enabling robust extraction of diverse document components.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies via pip install -r requirements.txt.
  • Pre-trained Models: Download from Baidu Yun, Google Drive, or Hugging Face Hub.
  • Inference: Supports both page-level and element-level parsing via Python scripts (demo_page.py, demo_page_hf.py, demo_element.py, demo_element_hf.py).
  • Dependencies: Requires Python and standard ML libraries. Specific hardware requirements (e.g., GPU) are not explicitly stated but are implied for efficient operation.
  • Demo: Available at Demo-Dolphin.

Highlighted Details

  • Two-stage analyze-then-parse approach using a single VLM.
  • Natural reading order element sequence generation.
  • Heterogeneous anchor prompting for diverse document elements.
  • Efficient parallel parsing mechanism.
  • Supports Hugging Face Transformers integration.
  • Added TensorRT-LLM and vLLM support for accelerated inference.
  • Released Fox-Page Benchmark.

Maintenance & Community

The project is actively maintained, with recent updates including TensorRT-LLM and vLLM support, and the release of the Fox-Page Benchmark. The paper is accepted by ACL 2025. Users are encouraged to report "bad cases" via GitHub issues for model optimization.

Licensing & Compatibility

The repository does not explicitly state a license. However, its origin from ByteDance and the inclusion of Hugging Face integration suggest potential compatibility with common ML workflows.

Limitations & Caveats

Specific hardware requirements for optimal performance are not detailed. The project is relatively new, with its paper published in 2025, and users are encouraged to contribute challenging cases for ongoing improvement.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
13
Star History
254 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Paul Copplestone Paul Copplestone(Cofounder of Supabase), and
4 more.

MegaParse by QuivrHQ

0.1%
7k
File parser optimized for LLM ingestion
Created 1 year ago
Updated 6 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.