MDocAgent by aiming-lab

Multi-modal multi-agent framework for document understanding

Created 11 months ago

299 stars

Top 89.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Li Jiang

Coauthor of AutoGen; Engineer at Microsoft

Project Summary

MDocAgent is a novel multi-modal, multi-agent framework designed for advanced document question answering. It addresses the complexity of real-world documents by integrating text and image understanding through specialized agents, offering a significant performance boost for researchers and engineers in document intelligence.

How It Works

The framework employs five distinct agents—general, critical, text, image, and summarizing—to collaboratively reason across both textual and visual information within documents. This multi-agent approach facilitates sophisticated information retrieval and analysis, enabling a more comprehensive understanding than single-modality or monolithic systems.

Quick Start & Requirements

Requires Python 3.12 and Conda for environment management.
Installation involves cloning the repository, activating a Conda environment (mdocagent), and running bash install.sh.
Data preparation includes downloading datasets from Hugging Face into a data directory, followed by extraction using python scripts/extract.py --config-name <dataset>.
Text and image retrieval can be configured and run via python scripts/retrieve.py --config-name <dataset>.
Inference is performed using python scripts/predict.py --config-name <dataset> run-name=<run-name>.
Evaluation requires an OpenAI API key configured in config/model/openai.yaml and is run via python scripts/eval.py --config-name <dataset> run-name=<run-name>.

Highlighted Details

Achieves a 12.1% improvement over state-of-the-art methods on five benchmarks for document question answering.

Maintenance & Community

Information regarding maintainers, community channels (e.g., Discord/Slack), or roadmap is not detailed in the provided README.

Licensing & Compatibility

The specific open-source license is not stated in the README. Compatibility for commercial use or closed-source linking cannot be determined without a license.

Limitations & Caveats

Evaluation requires an external OpenAI API key, potentially introducing costs and external dependencies. The framework's effectiveness is demonstrated on specific benchmarks, and performance on highly specialized or unconventional document types may vary.

Health Check

Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days