mPLUG-DocOwl by X-PLUG

Multimodal LLM for OCR-free document understanding

Created 2 years ago

2,265 stars

Top 19.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Project Summary

mPLUG-DocOwl is a family of modularized multimodal large language models designed for OCR-free document understanding. It targets researchers and developers working with complex documents, offering state-of-the-art performance on tasks like visual question answering, information extraction, and chart analysis without relying on traditional OCR.

How It Works

The models employ a modular approach, integrating visual encoders with large language models. Key innovations include high-resolution image compression techniques that encode entire documents with a minimal number of visual tokens (e.g., 324 tokens for an 8B model), enabling efficient processing of multi-page documents. Some models incorporate Program-of-Thoughts for chart understanding, breaking down complex visual reasoning into executable steps.

Quick Start & Requirements

Installation: Primarily through Hugging Face 🤗 and ModelScope, with code releases for training and inference.
Prerequisites: Python, PyTorch, and specific dependencies detailed in individual model repositories. GPU acceleration is essential for practical use.
Resources: Training code for DocOwl2 and DocOwl1.5 is available, suggesting significant computational resources are needed for fine-tuning. Demos are provided for quick evaluation.
Links:
- DocOwl 1.5 Demo: https://huggingface.co/spaces/xplug/DocOwl1.5-Omni, https://www.modelscope.cn/studios/xplug/DocOwl1.5-Omni/overview
- TinyChart-3B Demo: https://huggingface.co/spaces/xplug/TinyChart-3B
- Documentation: Linked via individual model releases.

Highlighted Details

State-of-the-art performance on various document understanding benchmarks (e.g., DocVQA, InfoVQA, ChartQA).
OCR-free approach, directly processing visual information.
Modular design allows for specialized models like TinyChart for chart analysis and PaperOwl for scientific diagrams.
Training code and datasets are released for several models, enabling custom fine-tuning.

Maintenance & Community

The project is actively developed by Alibaba Group, with frequent releases and updates, including training code and new model versions. Links to Hugging Face Spaces and ModelScope provide community interaction points.

Licensing & Compatibility

Models and code are generally released for research purposes. Specific licensing details for commercial use would need to be verified per model release, but the focus appears to be on open-sourcing for research.

Limitations & Caveats

Hugging Face demos may have stability issues due to dynamic GPU allocation. The project is rapidly evolving, with multiple model versions (DocOwl, DocOwl1.5, DocOwl2) and related projects (TinyChart, PaperOwl, UReader), requiring careful selection based on specific needs and compatibility.

Health Check

Last Commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days