VideoRefer  by DAMO-NLP-SG

Video LLMs for fine-grained spatial-temporal understanding

created 7 months ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

VideoRefer Suite enhances Video Large Language Models (Video LLMs) with fine-grained spatial-temporal object understanding. It provides a model (VideoRefer), a large-scale object-level video instruction dataset (VideoRefer-700K), and a benchmark (VideoRefer-Bench) for evaluating these capabilities, targeting researchers and developers in video AI.

How It Works

VideoRefer integrates a visual encoder (e.g., SigLIP) with a language decoder (e.g., Qwen) to process video inputs. It supports fine-grained perception and reasoning on user-defined regions across single or multiple frames. This approach allows for detailed object understanding and retrieval within videos, addressing limitations in current Video LLMs' ability to precisely locate and describe specific objects.

Quick Start & Requirements

  • Install: Clone the repository, cd VideoRefer, pip install -r requirements.txt, pip install flash-attn --no-build-isolation.
  • Prerequisites: Python >= 3.8, PyTorch >= 2.2.0, CUDA >= 11.8, transformers == 4.40.0, tokenizers == 0.19.1. Integration with SAM2 requires separate installation and model download.
  • Resources: Links to Hugging Face for datasets, models, and demos are provided. Notebooks offer detailed inference examples.

Highlighted Details

  • Offers multiple model variants, including VideoRefer-7B and VideoRefer-VideoLLaMA3 (2B and 7B parameters), utilizing different visual encoders and language decoders.
  • The VideoRefer-700K dataset is designed for object-level instruction tuning, sourced from datasets like Panda-70M, MeViS, A2D, and Youtube-VOS.
  • VideoRefer-Bench includes sub-benchmarks for description generation (VideoRefer-Bench-D) and question-answering (VideoRefer-Bench-Q).
  • A live demo of VideoRefer-VideoLLaMA3 is available on Hugging Face Spaces.

Maintenance & Community

The project is associated with DAMO-NLP-SG and has recent activity, including CVPR 2025 acceptance and model/dataset releases in early 2025. It builds upon previous work like Video-LLaMA 2 and Video-LLaMA 3.

Licensing & Compatibility

The repository does not explicitly state a license. However, the project is presented as open-source code for research purposes. Compatibility for commercial use would require clarification.

Limitations & Caveats

The project is presented as research code, with the latest models and datasets released in early 2025. Specific performance benchmarks are mentioned but not detailed in the README. The integration with SAM2 requires additional setup.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
14 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.