PixelRefer by alibaba-damo-academy

Video LLMs for fine-grained spatial-temporal understanding

Created 1 year ago

337 stars

Top 81.8% on SourcePulse

Project Summary

VideoRefer Suite enhances Video Large Language Models (Video LLMs) with fine-grained spatial-temporal object understanding. It provides a model (VideoRefer), a large-scale object-level video instruction dataset (VideoRefer-700K), and a benchmark (VideoRefer-Bench) for evaluating these capabilities, targeting researchers and developers in video AI.

How It Works

VideoRefer integrates a visual encoder (e.g., SigLIP) with a language decoder (e.g., Qwen) to process video inputs. It supports fine-grained perception and reasoning on user-defined regions across single or multiple frames. This approach allows for detailed object understanding and retrieval within videos, addressing limitations in current Video LLMs' ability to precisely locate and describe specific objects.

Quick Start & Requirements

Install: Clone the repository, cd VideoRefer, pip install -r requirements.txt, pip install flash-attn --no-build-isolation.
Prerequisites: Python >= 3.8, PyTorch >= 2.2.0, CUDA >= 11.8, transformers == 4.40.0, tokenizers == 0.19.1. Integration with SAM2 requires separate installation and model download.
Resources: Links to Hugging Face for datasets, models, and demos are provided. Notebooks offer detailed inference examples.

Highlighted Details

Offers multiple model variants, including VideoRefer-7B and VideoRefer-VideoLLaMA3 (2B and 7B parameters), utilizing different visual encoders and language decoders.
The VideoRefer-700K dataset is designed for object-level instruction tuning, sourced from datasets like Panda-70M, MeViS, A2D, and Youtube-VOS.
VideoRefer-Bench includes sub-benchmarks for description generation (VideoRefer-Bench-D) and question-answering (VideoRefer-Bench-Q).
A live demo of VideoRefer-VideoLLaMA3 is available on Hugging Face Spaces.

Maintenance & Community

The project is associated with DAMO-NLP-SG and has recent activity, including CVPR 2025 acceptance and model/dataset releases in early 2025. It builds upon previous work like Video-LLaMA 2 and Video-LLaMA 3.

Licensing & Compatibility

The repository does not explicitly state a license. However, the project is presented as open-source code for research purposes. Compatibility for commercial use would require clarification.

Limitations & Caveats

The project is presented as research code, with the latest models and datasets released in early 2025. Specific performance benchmarks are mentioned but not detailed in the README. The integration with SAM2 requires additional setup.

PixelRefer by alibaba-damo-academy

Explore Similar Projects

NExT-Chat by NExT-ChatV

PAM by Perceive-Anything

MiraData by mira-space

ml-slowfast-llava by apple

cambrian-s by cambrian-mllm

VideoGPT-plus by mbzuai-oryx

tarsier by bytedance

Osprey by CircleRadon

Eagle by NVlabs

VideoLLaMA3 by DAMO-NLP-SG

Sa2VA by bytedance

sam2 by facebookresearch