Visual prompting method for GPT-4V and LMMs
Top 29.1% on sourcepulse
This project introduces Set-of-Mark (SoM) prompting, a technique designed to enhance the visual grounding capabilities of Large Multimodal Models (LMMs), particularly GPT-4V. It targets researchers and developers working with LMMs who need to improve their models' ability to understand and reason about specific regions within images, offering a method to achieve more precise and grounded visual understanding.
How It Works
SoM prompting involves overlaying spatially defined, numbered marks directly onto images. These marks act as explicit references, allowing users to query specific image regions through interleaved text and visual prompts. The system leverages state-of-the-art segmentation models like Mask DINO, OpenSeeD, GroundingDINO, SEEM, Semantic-SAM, and Segment Anything to generate these marks at various granularities, enabling fine-grained control over the visual grounding process.
Quick Start & Requirements
pip install git+https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git
). A make.sh
script within the ops
directory needs to be executed for Deformable Convolution.python demo_gpt4v_som.py
after setting OPENAI_API_KEY
.sh download_ckpt.sh
.Highlighted Details
Maintenance & Community
The project is led by Jianwei Yang and includes core contributors from Microsoft. A related project, SoM-LLaVA, has been released to extend SoM prompting to open-source MLLMs.
Licensing & Compatibility
The repository appears to be under a permissive license, but specific details are not explicitly stated in the README. Compatibility with commercial or closed-source applications would require verification of the exact license terms.
Limitations & Caveats
The primary demo relies on the GPT-4V API, which may have associated costs and usage restrictions. The setup involves installing multiple external segmentation models, which could lead to dependency conflicts or complex environment management.
11 months ago
Inactive