X-SAM by wanghao9610

Unified MLLM for advanced segmentation tasks

Created 5 months ago

347 stars

Top 80.0% on SourcePulse

Project Summary

X-SAM addresses the limitations of existing segmentation models like SAM by extending capabilities to "any segmentation" through a unified multimodal large language model (MLLM) framework. It targets researchers and developers seeking advanced pixel-level perceptual understanding in MLLMs, offering state-of-the-art performance and a novel Visual GrounDed (VGD) segmentation task.

How It Works

X-SAM employs a unified MLLM framework to overcome LLMs' inherent deficiency in pixel-level understanding and SAM's limitations in multi-mask and category-specific segmentation. It introduces the Visual GrounDed (VGD) segmentation task, which segments all instance objects using interactive visual prompts, grounding MLLMs with pixel-wise interpretative capabilities. A unified training strategy enables co-training across diverse datasets, leading to state-of-the-art results on segmentation benchmarks.

Quick Start & Requirements

Primary install: git clone --depth=1 https://github.com/wanghao9610/X-SAM.git, followed by environment setup and pip/conda installs.
Prerequisites: Python 3.10, PyTorch 2.5.1 with CUDA 12.4, xtuner v0.2.0, deepspeed, flash-attention v2.7.3, and GCC 11 (recommended).
Setup: Requires careful environment configuration, including CUDA path setup and specific library installations.
Links: Online demos available; Technical Report and Model Weights on HuggingFace.

Highlighted Details

Unified MLLM framework for advanced pixel-level perceptual understanding.
Introduces Visual GrounDed (VGD) segmentation task for instance segmentation via visual prompts.
Unified training strategy supports co-training across multiple datasets.
Achieves state-of-the-art performance on various image segmentation benchmarks.

Maintenance & Community

The project is under active development with ongoing updates planned. Communication is encouraged in English via GitHub issues. No specific community channels or sponsorship details are provided.

Licensing & Compatibility

The repository's license is not explicitly stated in the README, which requires clarification for commercial or closed-source integration.

Limitations & Caveats

X-SAM is actively under development, with key features like mixed fine-tuning training code and transformer-compatible inference/demo code marked as TODOs. Current evaluation support is limited to generic and VGD segmentation. The project mandates specific library versions (e.g., PyTorch 2.5.1, CUDA 12.4), potentially impacting broader compatibility.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

14 stars in the last 30 days