SoM by microsoft

Visual prompting method for GPT-4V and LMMs

Created 2 years ago

1,502 stars

Top 27.3% on SourcePulse

View on GitHub

4 Experts Love This Project

Pawel Garbacki

Cofounder of Fireworks AI

Travis Fischer

Founder of Agentic

Omar Sanseviero

DevRel at Google DeepMind

Jianwei Yang

Research Scientist at Meta Superintelligence Lab

Project Summary

This project introduces Set-of-Mark (SoM) prompting, a technique designed to enhance the visual grounding capabilities of Large Multimodal Models (LMMs), particularly GPT-4V. It targets researchers and developers working with LMMs who need to improve their models' ability to understand and reason about specific regions within images, offering a method to achieve more precise and grounded visual understanding.

How It Works

SoM prompting involves overlaying spatially defined, numbered marks directly onto images. These marks act as explicit references, allowing users to query specific image regions through interleaved text and visual prompts. The system leverages state-of-the-art segmentation models like Mask DINO, OpenSeeD, GroundingDINO, SEEM, Semantic-SAM, and Segment Anything to generate these marks at various granularities, enabling fine-grained control over the visual grounding process.

Quick Start & Requirements

Install: Requires installing several segmentation packages via pip (e.g., pip install git+https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git). A make.sh script within the ops directory needs to be executed for Deformable Convolution.
Prerequisites: GPT-4V API access is required for the primary demo. Dependencies include Python, PyTorch, and various segmentation libraries.
Demo: Run python demo_gpt4v_som.py after setting OPENAI_API_KEY.
Resources: Downloading pretrained models is handled by sh download_ckpt.sh.
Links: Project Page, arXiv Paper, Hugging Face Demo.

Highlighted Details

Enables interleaved text-and-visual prompts for precise region referencing.
Demonstrates significant improvements in tasks like GUI navigation, anomaly detection, and CAPTCHA solving.
Achieves performance comparable to specialized models on COCO panoptic segmentation.
Offers a toolbox for generating automatic or interactive image masks.

Maintenance & Community

The project is led by Jianwei Yang and includes core contributors from Microsoft. A related project, SoM-LLaVA, has been released to extend SoM prompting to open-source MLLMs.

Licensing & Compatibility

The repository appears to be under a permissive license, but specific details are not explicitly stated in the README. Compatibility with commercial or closed-source applications would require verification of the exact license terms.

Limitations & Caveats

The primary demo relies on the GPT-4V API, which may have associated costs and usage restrictions. The setup involves installing multiple external segmentation models, which could lead to dependency conflicts or complex environment management.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days