MM-REACT  by microsoft

MM-REACT is a system for multimodal reasoning and action

Created 2 years ago
955 stars

Top 38.5% on SourcePulse

GitHubView on GitHub
Project Summary

MM-REACT is a system paradigm that integrates ChatGPT with specialized vision experts for multimodal reasoning and action on visual tasks. It targets researchers and developers working on complex visual understanding problems, enabling ChatGPT to interact with external vision APIs as a "black box" to extract specific information.

How It Works

MM-REACT leverages a "ReAct" (Reasoning and Acting) approach, where ChatGPT is prompted with image file paths as placeholders. When specific visual details are needed, ChatGPT calls upon designated "vision experts" (external APIs like Azure Computer Vision, Form Recognizer, Bing Search). The output from these experts is serialized into text and fed back to ChatGPT, facilitating a chain of reasoning and action to solve visual tasks.

Quick Start & Requirements

  • Install via pip install PIL imagesize.
  • Requires extensive Azure service setup: Computer Vision (Tags, Objects, Faces, Celebrities, Dense Captioning), Form Recognizer (OCR, Layout, Invoice, etc.), Bing Search, Bing Visual Search, and Azure OpenAI.
  • Environment variables must be configured for all Azure endpoints and subscription keys.
  • Code is based on Langchain; refer to Langchain for its installation and documentation.
  • Demo videos and a live demo are available on the project website.

Highlighted Details

  • Integrates ChatGPT with specialized vision experts for multimodal reasoning.
  • Uses image file paths as placeholders for ChatGPT to interact with vision APIs.
  • Supports various vision tasks including object detection, OCR, and dense captioning.
  • Designed to be extensible with custom vision experts.

Maintenance & Community

  • Developed by Microsoft.
  • Contributions are welcome via pull requests, subject to a Contributor License Agreement (CLA).
  • Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

  • The README does not explicitly state a license.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The system relies heavily on Azure services, requiring significant setup and configuration of multiple Azure Cognitive Services and OpenAI endpoints. Support for public endpoints for Azure OpenAI is planned but not yet implemented at the time of writing.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
1 more.

InternGPT by OpenGVLab

0.1%
3k
Interactive demo platform for showcasing AI models
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jinze Bai Jinze Bai(Research Scientist at Alibaba Qwen), and
4 more.

self-operating-computer by OthersideAI

0.1%
10k
Framework for multimodal computer operation
Created 1 year ago
Updated 4 months ago
Feedback? Help us improve.