Video LLMs for fine-grained spatial-temporal understanding
Top 99.6% on SourcePulse
VideoRefer Suite enhances Video Large Language Models (Video LLMs) with fine-grained spatial-temporal object understanding. It provides a model (VideoRefer), a large-scale object-level video instruction dataset (VideoRefer-700K), and a benchmark (VideoRefer-Bench) for evaluating these capabilities, targeting researchers and developers in video AI.
How It Works
VideoRefer integrates a visual encoder (e.g., SigLIP) with a language decoder (e.g., Qwen) to process video inputs. It supports fine-grained perception and reasoning on user-defined regions across single or multiple frames. This approach allows for detailed object understanding and retrieval within videos, addressing limitations in current Video LLMs' ability to precisely locate and describe specific objects.
Quick Start & Requirements
cd VideoRefer
, pip install -r requirements.txt
, pip install flash-attn --no-build-isolation
.transformers == 4.40.0
, tokenizers == 0.19.1
. Integration with SAM2 requires separate installation and model download.Highlighted Details
Maintenance & Community
The project is associated with DAMO-NLP-SG and has recent activity, including CVPR 2025 acceptance and model/dataset releases in early 2025. It builds upon previous work like Video-LLaMA 2 and Video-LLaMA 3.
Licensing & Compatibility
The repository does not explicitly state a license. However, the project is presented as open-source code for research purposes. Compatibility for commercial use would require clarification.
Limitations & Caveats
The project is presented as research code, with the latest models and datasets released in early 2025. Specific performance benchmarks are mentioned but not detailed in the README. The integration with SAM2 requires additional setup.
1 month ago
Inactive