Sa2VA  by magic-research

Multimodal model for dense grounded image/video understanding

Created 8 months ago
1,252 stars

Top 31.6% on SourcePulse

GitHubView on GitHub
Project Summary

Sa2VA is a unified multimodal model for dense grounded understanding of images and videos, enabling tasks like referring segmentation and conversational analysis. It targets researchers and developers working with vision-language models and video understanding, offering a versatile solution that integrates segmentation and conversational capabilities with minimal fine-tuning.

How It Works

Sa2VA merges the capabilities of SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model. It achieves this by unifying text, image, and video data into a shared Large Language Model (LLM) token space. This approach allows for a single model to handle diverse tasks across different modalities, leveraging the strengths of both segmentation and conversational AI.

Quick Start & Requirements

  • Install: pip install gradio==4.42.0 (for demo)
  • Prerequisites: PyTorch 2.3.1 with CUDA 12.1, mmcv 2.1.0, transformers, peft. Requires downloading SAM-2 and InternVL2.5 pretrained models.
  • Data: Training datasets need to be downloaded and placed in the data directory.
  • Demo: Run PYTHONPATH=. python projects/llava_sam2/gradio/app.py ByteDance/Sa2VA-4B or use the provided script scripts/demo/demo.py.
  • Links: Sa2VA, arXiv, HuggingFace, Gradio Demo, Replicate Demo

Highlighted Details

  • First unified model for dense grounded understanding of both images and videos.
  • Supports referring segmentation and conversation with minimal one-shot instruction tuning.
  • Offers model variants from 1B to 26B parameters.
  • Achieved first and third place in the 4th PVUW Workshop@CVPR 2025.

Maintenance & Community

The project is associated with researchers from UC Merced, ByteDance, WHU, and PKU. Open-source progress includes released training datasets, evaluation code, and model weights (1B, 4B, 8B, 26B). Future releases planned for Qwen-VL based models and Pixel-SAIL.

Licensing & Compatibility

The repository does not explicitly state a license in the README. The project is presented as open-source with released code and models.

Limitations & Caveats

The README indicates that releases for Pixel-SAIL models and Qwen-VL related models are "To be done." The sam_v_full dataset is not included in the primary download link and requires separate download from Meta. Training is suggested for 8 A100 GPUs.

Health Check
Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
27
Star History
34 stars in the last 30 days

Explore Similar Projects

Starred by Matei Zaharia Matei Zaharia(Cofounder of Databricks), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

LWM by LargeWorldModel

0.1%
7k
Multimodal autoregressive model for long-context video/text
Created 1 year ago
Updated 11 months ago
Feedback? Help us improve.