mPLUG-DocOwl  by X-PLUG

Multimodal LLM for OCR-free document understanding

created 2 years ago
2,228 stars

Top 20.7% on sourcepulse

GitHubView on GitHub
Project Summary

mPLUG-DocOwl is a family of modularized multimodal large language models designed for OCR-free document understanding. It targets researchers and developers working with complex documents, offering state-of-the-art performance on tasks like visual question answering, information extraction, and chart analysis without relying on traditional OCR.

How It Works

The models employ a modular approach, integrating visual encoders with large language models. Key innovations include high-resolution image compression techniques that encode entire documents with a minimal number of visual tokens (e.g., 324 tokens for an 8B model), enabling efficient processing of multi-page documents. Some models incorporate Program-of-Thoughts for chart understanding, breaking down complex visual reasoning into executable steps.

Quick Start & Requirements

Highlighted Details

  • State-of-the-art performance on various document understanding benchmarks (e.g., DocVQA, InfoVQA, ChartQA).
  • OCR-free approach, directly processing visual information.
  • Modular design allows for specialized models like TinyChart for chart analysis and PaperOwl for scientific diagrams.
  • Training code and datasets are released for several models, enabling custom fine-tuning.

Maintenance & Community

The project is actively developed by Alibaba Group, with frequent releases and updates, including training code and new model versions. Links to Hugging Face Spaces and ModelScope provide community interaction points.

Licensing & Compatibility

Models and code are generally released for research purposes. Specific licensing details for commercial use would need to be verified per model release, but the focus appears to be on open-sourcing for research.

Limitations & Caveats

Hugging Face demos may have stability issues due to dynamic GPU allocation. The project is rapidly evolving, with multiple model versions (DocOwl, DocOwl1.5, DocOwl2) and related projects (TinyChart, PaperOwl, UReader), requiring careful selection based on specific needs and compatibility.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
72 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.