MM-REACT  by microsoft

MM-REACT is a system for multimodal reasoning and action

created 2 years ago
954 stars

Top 39.3% on sourcepulse

GitHubView on GitHub
Project Summary

MM-REACT is a system paradigm that integrates ChatGPT with specialized vision experts for multimodal reasoning and action on visual tasks. It targets researchers and developers working on complex visual understanding problems, enabling ChatGPT to interact with external vision APIs as a "black box" to extract specific information.

How It Works

MM-REACT leverages a "ReAct" (Reasoning and Acting) approach, where ChatGPT is prompted with image file paths as placeholders. When specific visual details are needed, ChatGPT calls upon designated "vision experts" (external APIs like Azure Computer Vision, Form Recognizer, Bing Search). The output from these experts is serialized into text and fed back to ChatGPT, facilitating a chain of reasoning and action to solve visual tasks.

Quick Start & Requirements

  • Install via pip install PIL imagesize.
  • Requires extensive Azure service setup: Computer Vision (Tags, Objects, Faces, Celebrities, Dense Captioning), Form Recognizer (OCR, Layout, Invoice, etc.), Bing Search, Bing Visual Search, and Azure OpenAI.
  • Environment variables must be configured for all Azure endpoints and subscription keys.
  • Code is based on Langchain; refer to Langchain for its installation and documentation.
  • Demo videos and a live demo are available on the project website.

Highlighted Details

  • Integrates ChatGPT with specialized vision experts for multimodal reasoning.
  • Uses image file paths as placeholders for ChatGPT to interact with vision APIs.
  • Supports various vision tasks including object detection, OCR, and dense captioning.
  • Designed to be extensible with custom vision experts.

Maintenance & Community

  • Developed by Microsoft.
  • Contributions are welcome via pull requests, subject to a Contributor License Agreement (CLA).
  • Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

  • The README does not explicitly state a license.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The system relies heavily on Azure services, requiring significant setup and configuration of multiple Azure Cognitive Services and OpenAI endpoints. Support for public endpoints for Azure OpenAI is planned but not yet implemented at the time of writing.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Toran Bruce Richards Toran Bruce Richards(Founder of AutoGPT), and
2 more.

OS-Copilot by OS-Copilot

0.1%
2k
OS agent for automating daily tasks
created 1 year ago
updated 10 months ago
Starred by Addy Osmani Addy Osmani(Engineering Leader on Google Chrome), Victor Taelin Victor Taelin(Author of Bend, Kind, HVM), and
1 more.

chatbox by chatboxai

0.3%
36k
Desktop client app for AI models/LLMs
created 2 years ago
updated 6 days ago
Feedback? Help us improve.