MDocAgent  by aiming-lab

Multi-modal multi-agent framework for document understanding

Created 10 months ago
278 stars

Top 93.5% on SourcePulse

GitHubView on GitHub
Project Summary

MDocAgent is a novel multi-modal, multi-agent framework designed for advanced document question answering. It addresses the complexity of real-world documents by integrating text and image understanding through specialized agents, offering a significant performance boost for researchers and engineers in document intelligence.

How It Works

The framework employs five distinct agents—general, critical, text, image, and summarizing—to collaboratively reason across both textual and visual information within documents. This multi-agent approach facilitates sophisticated information retrieval and analysis, enabling a more comprehensive understanding than single-modality or monolithic systems.

Quick Start & Requirements

  • Requires Python 3.12 and Conda for environment management.
  • Installation involves cloning the repository, activating a Conda environment (mdocagent), and running bash install.sh.
  • Data preparation includes downloading datasets from Hugging Face into a data directory, followed by extraction using python scripts/extract.py --config-name <dataset>.
  • Text and image retrieval can be configured and run via python scripts/retrieve.py --config-name <dataset>.
  • Inference is performed using python scripts/predict.py --config-name <dataset> run-name=<run-name>.
  • Evaluation requires an OpenAI API key configured in config/model/openai.yaml and is run via python scripts/eval.py --config-name <dataset> run-name=<run-name>.

Highlighted Details

  • Achieves a 12.1% improvement over state-of-the-art methods on five benchmarks for document question answering.

Maintenance & Community

Information regarding maintainers, community channels (e.g., Discord/Slack), or roadmap is not detailed in the provided README.

Licensing & Compatibility

The specific open-source license is not stated in the README. Compatibility for commercial use or closed-source linking cannot be determined without a license.

Limitations & Caveats

Evaluation requires an external OpenAI API key, potentially introducing costs and external dependencies. The framework's effectiveness is demonstrated on specific benchmarks, and performance on highly specialized or unconventional document types may vary.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.