Video-language temporal grounding model
Top 79.4% on sourcepulse
UniVTG is a novel video-language temporal grounding pretraining model designed to unify diverse temporal annotations. It addresses moment retrieval, highlight detection, and video summarization, targeting researchers and practitioners in video understanding and multimodal AI. The primary benefit is a unified framework that enhances performance across various temporal grounding tasks.
How It Works
UniVTG employs a unified pretraining strategy that leverages diverse temporal annotations (interval, curve, point) to build a robust video-language understanding model. This approach allows the model to learn a generalized representation of temporal relationships within videos, enabling it to adapt to different grounding granularities without task-specific architectural changes.
Quick Start & Requirements
install.md
.install.md
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README mentions a "Todo" item to connect UniVTG with LLMs like ChatGPT, indicating this integration is not yet implemented. Training instructions are geared towards Slurm, potentially requiring adaptation for non-Slurm environments.
1 year ago
1 day