SoM  by microsoft

Visual prompting method for GPT-4V and LMMs

created 1 year ago
1,433 stars

Top 29.1% on sourcepulse

GitHubView on GitHub
Project Summary

This project introduces Set-of-Mark (SoM) prompting, a technique designed to enhance the visual grounding capabilities of Large Multimodal Models (LMMs), particularly GPT-4V. It targets researchers and developers working with LMMs who need to improve their models' ability to understand and reason about specific regions within images, offering a method to achieve more precise and grounded visual understanding.

How It Works

SoM prompting involves overlaying spatially defined, numbered marks directly onto images. These marks act as explicit references, allowing users to query specific image regions through interleaved text and visual prompts. The system leverages state-of-the-art segmentation models like Mask DINO, OpenSeeD, GroundingDINO, SEEM, Semantic-SAM, and Segment Anything to generate these marks at various granularities, enabling fine-grained control over the visual grounding process.

Quick Start & Requirements

  • Install: Requires installing several segmentation packages via pip (e.g., pip install git+https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git). A make.sh script within the ops directory needs to be executed for Deformable Convolution.
  • Prerequisites: GPT-4V API access is required for the primary demo. Dependencies include Python, PyTorch, and various segmentation libraries.
  • Demo: Run python demo_gpt4v_som.py after setting OPENAI_API_KEY.
  • Resources: Downloading pretrained models is handled by sh download_ckpt.sh.
  • Links: Project Page, arXiv Paper, Hugging Face Demo.

Highlighted Details

  • Enables interleaved text-and-visual prompts for precise region referencing.
  • Demonstrates significant improvements in tasks like GUI navigation, anomaly detection, and CAPTCHA solving.
  • Achieves performance comparable to specialized models on COCO panoptic segmentation.
  • Offers a toolbox for generating automatic or interactive image masks.

Maintenance & Community

The project is led by Jianwei Yang and includes core contributors from Microsoft. A related project, SoM-LLaVA, has been released to extend SoM prompting to open-source MLLMs.

Licensing & Compatibility

The repository appears to be under a permissive license, but specific details are not explicitly stated in the README. Compatibility with commercial or closed-source applications would require verification of the exact license terms.

Limitations & Caveats

The primary demo relies on the GPT-4V API, which may have associated costs and usage restrictions. The setup involves installing multiple external segmentation models, which could lead to dependency conflicts or complex environment management.

Health Check
Last commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
64 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
1 more.

EditAnything by sail-sg

0.0%
3k
Image editing research paper using segmentation and diffusion
created 2 years ago
updated 5 months ago
Feedback? Help us improve.