Sa2VA  by magic-research

Multimodal model for dense grounded image/video understanding

created 6 months ago
1,201 stars

Top 33.3% on sourcepulse

GitHubView on GitHub
Project Summary

Sa2VA is a unified multimodal model for dense grounded understanding of images and videos, enabling tasks like referring segmentation and conversational analysis. It targets researchers and developers working with vision-language models and video understanding, offering a versatile solution that integrates segmentation and conversational capabilities with minimal fine-tuning.

How It Works

Sa2VA merges the capabilities of SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model. It achieves this by unifying text, image, and video data into a shared Large Language Model (LLM) token space. This approach allows for a single model to handle diverse tasks across different modalities, leveraging the strengths of both segmentation and conversational AI.

Quick Start & Requirements

  • Install: pip install gradio==4.42.0 (for demo)
  • Prerequisites: PyTorch 2.3.1 with CUDA 12.1, mmcv 2.1.0, transformers, peft. Requires downloading SAM-2 and InternVL2.5 pretrained models.
  • Data: Training datasets need to be downloaded and placed in the data directory.
  • Demo: Run PYTHONPATH=. python projects/llava_sam2/gradio/app.py ByteDance/Sa2VA-4B or use the provided script scripts/demo/demo.py.
  • Links: Sa2VA, arXiv, HuggingFace, Gradio Demo, Replicate Demo

Highlighted Details

  • First unified model for dense grounded understanding of both images and videos.
  • Supports referring segmentation and conversation with minimal one-shot instruction tuning.
  • Offers model variants from 1B to 26B parameters.
  • Achieved first and third place in the 4th PVUW Workshop@CVPR 2025.

Maintenance & Community

The project is associated with researchers from UC Merced, ByteDance, WHU, and PKU. Open-source progress includes released training datasets, evaluation code, and model weights (1B, 4B, 8B, 26B). Future releases planned for Qwen-VL based models and Pixel-SAIL.

Licensing & Compatibility

The repository does not explicitly state a license in the README. The project is presented as open-source with released code and models.

Limitations & Caveats

The README indicates that releases for Pixel-SAIL models and Qwen-VL related models are "To be done." The sam_v_full dataset is not included in the primary download link and requires separate download from Meta. Training is suggested for 8 A100 GPUs.

Health Check
Last commit

4 days ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
4
Star History
145 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.