Multi-modal segmentation research paper
Top 10.7% on sourcepulse
SEEM (Segment Everything Everywhere All at Once) is a NeurIPS 2023 paper and official implementation that enables versatile image segmentation using multi-modal prompts. It targets researchers and developers looking for a unified, interactive segmentation model that handles various prompt types and offers semantic awareness.
How It Works
SEEM utilizes a unified prompt encoder to process diverse inputs—including points, boxes, scribbles, text, and audio—into a joint representation space. This approach allows for compositional and interactive segmentation, where users can refine masks through multi-round interactions, leveraging the model's memory of session history. The architecture is built upon the X-Decoder, a generalist decoder capable of multiple tasks.
Quick Start & Requirements
git clone git@github.com:UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git && sh assets/scripts/run_demo.sh
Highlighted Details
Maintenance & Community
The project is associated with NeurIPS 2023 and has seen recent updates, including integration into LLaVA-Interactive and use in Set-of-Mark Prompting. Related projects include FocalNet, DaViT, UniCL, Semantic-SAM, OpenSeed, Grounding SAM, Grounding DINO, and X-GPT.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The model's default vocabulary is limited to COCO 80 categories; other objects may be misclassified or labeled as 'others'. Open-vocabulary segmentation requires including text labels with specific prompts.
11 months ago
Inactive