Multimodal model for dense grounded image/video understanding
Top 33.3% on sourcepulse
Sa2VA is a unified multimodal model for dense grounded understanding of images and videos, enabling tasks like referring segmentation and conversational analysis. It targets researchers and developers working with vision-language models and video understanding, offering a versatile solution that integrates segmentation and conversational capabilities with minimal fine-tuning.
How It Works
Sa2VA merges the capabilities of SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model. It achieves this by unifying text, image, and video data into a shared Large Language Model (LLM) token space. This approach allows for a single model to handle diverse tasks across different modalities, leveraging the strengths of both segmentation and conversational AI.
Quick Start & Requirements
pip install gradio==4.42.0
(for demo)data
directory.PYTHONPATH=. python projects/llava_sam2/gradio/app.py ByteDance/Sa2VA-4B
or use the provided script scripts/demo/demo.py
.Highlighted Details
Maintenance & Community
The project is associated with researchers from UC Merced, ByteDance, WHU, and PKU. Open-source progress includes released training datasets, evaluation code, and model weights (1B, 4B, 8B, 26B). Future releases planned for Qwen-VL based models and Pixel-SAIL.
Licensing & Compatibility
The repository does not explicitly state a license in the README. The project is presented as open-source with released code and models.
Limitations & Caveats
The README indicates that releases for Pixel-SAIL models and Qwen-VL related models are "To be done." The sam_v_full
dataset is not included in the primary download link and requires separate download from Meta. Training is suggested for 8 A100 GPUs.
4 days ago
1 week