UniVTG by showlab

Video-language temporal grounding model

Created 2 years ago

373 stars

Top 76.0% on SourcePulse

Project Summary

UniVTG is a novel video-language temporal grounding pretraining model designed to unify diverse temporal annotations. It addresses moment retrieval, highlight detection, and video summarization, targeting researchers and practitioners in video understanding and multimodal AI. The primary benefit is a unified framework that enhances performance across various temporal grounding tasks.

How It Works

UniVTG employs a unified pretraining strategy that leverages diverse temporal annotations (interval, curve, point) to build a robust video-language understanding model. This approach allows the model to learn a generalized representation of temporal relationships within videos, enabling it to adapt to different grounding granularities without task-specific architectural changes.

Quick Start & Requirements

Install: Follow instructions in install.md.
Prerequisites: Python, PyTorch. Specific dependencies detailed in install.md.
Demo: Huggingface space demo available.
Resources: Can run on a single GPU with < 4GB memory for inference. Pretraining requires multi-GPU.
Links: arXiv, Huggingface Space Demo, Model Zoo

Highlighted Details

Achieves state-of-the-art results on QVHL, Charades, and NLQ benchmarks.
Efficient inference: < 1 second for 10-minute videos on a single GPU (< 4GB VRAM).
Supports scalable pseudo-annotation generation using CLIP teacher models.
Unified pretraining framework for diverse temporal grounding tasks.

Maintenance & Community

Maintained by Kevin (kevin.qh.lin@gmail.com).
Codebase based on moment_detr, HERO_Video_Feature_Extractor, UMT.
Open to questions and discussions via email or GitHub issues.

Licensing & Compatibility

License not explicitly stated in the README.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions a "Todo" item to connect UniVTG with LLMs like ChatGPT, indicating this integration is not yet implemented. Training instructions are geared towards Slurm, potentially requiring adaptation for non-Slurm environments.

UniVTG by showlab

Explore Similar Projects

Youku-mPLUG by X-PLUG

VTimeLLM by huangb23

MiraData by mira-space

Awesome_Long_Form_Video_Understanding by ttengwang

EgoVLP by showlab

tarsier by bytedance

METER by zdou0830

moment_detr by jayleicn

Awesome-CLIP by yzhuoning

MiniGPT4-video by Vision-CAIR

grounded-video-description by facebookresearch

LWM by LargeWorldModel