grounded-segment-any-parts by Saiyan-World

Part-level segmentation with language prompts, extending SAM

Created 2 years ago

417 stars

Top 70.3% on SourcePulse

Project Summary

This repository extends the Segment Anything Model (SAM) to enable segmentation based on text prompts, supporting both object-level (e.g., "dog") and fine-grained part-level (e.g., "dog head") queries. It also integrates with a Visual ChatGPT-based dialogue system for natural language interaction with various segmentation models, offering a more intuitive and flexible approach to image editing and analysis.

How It Works

The project leverages a combination of foundation models: Segment Anything (SAM) for general segmentation, GLIP or VLPart for grounded language-image understanding, and Visual ChatGPT to orchestrate interactions. GLIP/VLPart provides open-vocabulary object detection capabilities, mapping text prompts to image regions, which are then fed into SAM for precise mask generation. VLPart specifically focuses on denser, open-vocabulary part segmentation.

Quick Start & Requirements

Install: Follow installation instructions in the repository.
Prerequisites: Python, PyTorch, CUDA (for GPU acceleration), OpenAI API key for Visual ChatGPT integration. Specific model checkpoints (e.g., swinbase_part_0a0000.pth, sam_vit_h_4b8939.pth, glip_large.pth) need to be downloaded.
Usage:
- Part-level segmentation: python demo_vlpart_sam.py --input_image assets/twodogs.jpeg --output_dir outputs_demo --text_prompt "dog head"
- Object-level segmentation: python demo_glip_sam.py --input_image assets/demo2.jpeg --output_dir outputs_demo --text_prompt "frog"
- Visual ChatGPT integration: python chatbot.py --load "ImageCaptioning_cuda:0, SegmentAnything_cuda:1, PartPromptSegmentAnything_cuda:1, ObjectPromptSegmentAnything_cuda:0" (requires export OPENAI_API_KEY={Your_Private_Openai_Key})
Links: Blog, Chinese Blog

Highlighted Details

Supports both object-level and part-level text-prompted segmentation.
Integrates with Visual ChatGPT for natural language-driven image editing.
Utilizes GLIP/VLPart for grounded language-image pre-training.
VLPart enables denser, open-vocabulary part segmentation.

Maintenance & Community

The project acknowledges significant contributions from the Segment Anything, EditAnything, CLIP, GLIP, Grounded-Segment-Anything, and Visual ChatGPT projects. Citation information is provided for research use.

Licensing & Compatibility

Licensed under CC-BY-NC 4.0. This license restricts commercial use and redistribution.

Limitations & Caveats

The CC-BY-NC 4.0 license prohibits commercial use, which may limit adoption in commercial products or services. The project relies on external models and may inherit their limitations or require specific versions for compatibility.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days