Sa2VA by bytedance

Multimodal model for dense grounded image/video understanding

Created 1 year ago

1,489 stars

Top 27.5% on SourcePulse

Project Summary

Sa2VA is a unified multimodal model for dense grounded understanding of images and videos, enabling tasks like referring segmentation and conversational analysis. It targets researchers and developers working with vision-language models and video understanding, offering a versatile solution that integrates segmentation and conversational capabilities with minimal fine-tuning.

How It Works

Sa2VA merges the capabilities of SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model. It achieves this by unifying text, image, and video data into a shared Large Language Model (LLM) token space. This approach allows for a single model to handle diverse tasks across different modalities, leveraging the strengths of both segmentation and conversational AI.

Quick Start & Requirements

Install: pip install gradio==4.42.0 (for demo)
Prerequisites: PyTorch 2.3.1 with CUDA 12.1, mmcv 2.1.0, transformers, peft. Requires downloading SAM-2 and InternVL2.5 pretrained models.
Data: Training datasets need to be downloaded and placed in the data directory.
Demo: Run PYTHONPATH=. python projects/llava_sam2/gradio/app.py ByteDance/Sa2VA-4B or use the provided script scripts/demo/demo.py.
Links: Sa2VA, arXiv, HuggingFace, Gradio Demo, Replicate Demo

Highlighted Details

First unified model for dense grounded understanding of both images and videos.
Supports referring segmentation and conversation with minimal one-shot instruction tuning.
Offers model variants from 1B to 26B parameters.
Achieved first and third place in the 4th PVUW Workshop@CVPR 2025.

Maintenance & Community

The project is associated with researchers from UC Merced, ByteDance, WHU, and PKU. Open-source progress includes released training datasets, evaluation code, and model weights (1B, 4B, 8B, 26B). Future releases planned for Qwen-VL based models and Pixel-SAIL.

Licensing & Compatibility

The repository does not explicitly state a license in the README. The project is presented as open-source with released code and models.

Limitations & Caveats

The README indicates that releases for Pixel-SAIL models and Qwen-VL related models are "To be done." The sam_v_full dataset is not included in the primary download link and requires separate download from Meta. Training is suggested for 8 A100 GPUs.

Sa2VA by bytedance

Explore Similar Projects

PAM by Perceive-Anything

LongVA by EvolvingLMMs-Lab

PixelRefer by alibaba-damo-academy

VideoGPT-plus by mbzuai-oryx

tarsier by bytedance

Chat-UniVi by PKU-YuanGroup

MiniGPT4-video by Vision-CAIR

describe-anything by NVlabs

Emu3 by baaivision

Video-ChatGPT by mbzuai-oryx

Video-LLaMA by DAMO-NLP-SG

LWM by LargeWorldModel