MM-REACT is a system for multimodal reasoning and action
Top 39.3% on sourcepulse
MM-REACT is a system paradigm that integrates ChatGPT with specialized vision experts for multimodal reasoning and action on visual tasks. It targets researchers and developers working on complex visual understanding problems, enabling ChatGPT to interact with external vision APIs as a "black box" to extract specific information.
How It Works
MM-REACT leverages a "ReAct" (Reasoning and Acting) approach, where ChatGPT is prompted with image file paths as placeholders. When specific visual details are needed, ChatGPT calls upon designated "vision experts" (external APIs like Azure Computer Vision, Form Recognizer, Bing Search). The output from these experts is serialized into text and fed back to ChatGPT, facilitating a chain of reasoning and action to solve visual tasks.
Quick Start & Requirements
pip install PIL imagesize
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The system relies heavily on Azure services, requiring significant setup and configuration of multiple Azure Cognitive Services and OpenAI endpoints. Support for public endpoints for Azure OpenAI is planned but not yet implemented at the time of writing.
1 year ago
Inactive