Grounded-Segment-Anything by IDEA-Research

Framework for open-world visual tasks, combining multiple models

Created 2 years ago

17,308 stars

Top 2.7% on SourcePulse

View on GitHub

8 Experts Love This Project

Ishaan Jaffer

Cofounder of LiteLLM

Jiaming Song

Chief Scientist at Luma AI

Benjamin Bolte

Cofounder of K-Scale Labs

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

and 4 more!

Project Summary

This project provides a powerful pipeline for open-world object detection and segmentation, combining state-of-the-art models like Grounding DINO and Segment Anything (SAM). It targets researchers and developers needing flexible visual task solutions, enabling text-guided detection, segmentation, and even image generation/editing.

How It Works

The core approach chains Grounding DINO for text-based object detection (outputting bounding boxes and class labels) with SAM for precise segmentation of detected objects. This modular design allows for easy integration of other models, such as Stable Diffusion for inpainting or RAM/Tag2Text for automatic image labeling, creating versatile workflows for complex visual tasks.

Quick Start & Requirements

Install: pip install -e . (after cloning and setting up dependencies). Docker installation is also provided.
Prerequisites: Python >= 3.8, PyTorch >= 1.7, TorchVision >= 0.8. CUDA support is strongly recommended.
Resources: Requires downloading pre-trained weights for Grounding DINO and SAM (e.g., sam_vit_h_4b8939.pth).
Docs: Grounded SAM

Highlighted Details

Integrates with SAM-HQ for higher quality segmentation.
Supports inpainting tasks by combining detection, segmentation, and Stable Diffusion.
Enables automatic labeling pipelines using models like RAM, Tag2Text, and BLIP.
Extends to audio-driven segmentation via Whisper.
Offers integrations for 3D mesh recovery (OSX) and object tracking (VISAM).

Maintenance & Community

The project is actively developed by IDEA Research, with frequent updates and numerous community extensions highlighted. Links to Huggingface demos and a technical report on arXiv are available.

Licensing & Compatibility

The project's components are derived from other open-source projects, each with its own license. Grounding DINO and SAM are typically permissive (e.g., MIT, Apache 2.0), allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

While highly versatile, the setup involves managing multiple large model checkpoints. Some advanced features, like those involving ChatGPT, require API keys and may incur costs. The project is a research endeavor, and stability for production use may vary.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

105 stars in the last 30 days