MM-REACT by microsoft

MM-REACT is a system for multimodal reasoning and action

Created 2 years ago

963 stars

Top 38.2% on SourcePulse

Project Summary

MM-REACT is a system paradigm that integrates ChatGPT with specialized vision experts for multimodal reasoning and action on visual tasks. It targets researchers and developers working on complex visual understanding problems, enabling ChatGPT to interact with external vision APIs as a "black box" to extract specific information.

How It Works

MM-REACT leverages a "ReAct" (Reasoning and Acting) approach, where ChatGPT is prompted with image file paths as placeholders. When specific visual details are needed, ChatGPT calls upon designated "vision experts" (external APIs like Azure Computer Vision, Form Recognizer, Bing Search). The output from these experts is serialized into text and fed back to ChatGPT, facilitating a chain of reasoning and action to solve visual tasks.

Quick Start & Requirements

Install via pip install PIL imagesize.
Requires extensive Azure service setup: Computer Vision (Tags, Objects, Faces, Celebrities, Dense Captioning), Form Recognizer (OCR, Layout, Invoice, etc.), Bing Search, Bing Visual Search, and Azure OpenAI.
Environment variables must be configured for all Azure endpoints and subscription keys.
Code is based on Langchain; refer to Langchain for its installation and documentation.
Demo videos and a live demo are available on the project website.

Highlighted Details

Integrates ChatGPT with specialized vision experts for multimodal reasoning.
Uses image file paths as placeholders for ChatGPT to interact with vision APIs.
Supports various vision tasks including object detection, OCR, and dense captioning.
Designed to be extensible with custom vision experts.

Maintenance & Community

Developed by Microsoft.
Contributions are welcome via pull requests, subject to a Contributor License Agreement (CLA).
Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

The README does not explicitly state a license.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The system relies heavily on Azure services, requiring significant setup and configuration of multiple Azure Cognitive Services and OpenAI endpoints. Support for public endpoints for Azure OpenAI is planned but not yet implemented at the time of writing.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days