Video LLM for fine-grained video moment understanding
Top 93.5% on sourcepulse
VTimeLLM is a PyTorch implementation for fine-grained video moment understanding and temporal reasoning, targeting researchers and developers in video-language modeling. It offers enhanced temporal awareness and intent alignment for LLMs processing video content.
How It Works
VTimeLLM employs a novel boundary-aware three-stage training strategy. It first aligns features using image-text pairs, then enhances temporal-boundary awareness with multi-event videos and temporal QA, and finally refines temporal understanding and human intent alignment through instruction tuning on high-quality dialogue datasets. This approach aims to outperform existing Video LLMs in fine-grained temporal tasks.
Quick Start & Requirements
pip install -r requirements.txt
within a conda environment (python=3.10
).pip install ninja flash-attn --no-build-isolation
.offline_demo.md
.train.md
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project is released under a non-commercial license, restricting its use in commercial products. Specific hardware requirements for training or running the models are not detailed in the README.
1 year ago
Inactive