Video moment retrieval via natural language queries (NeurIPS 2021 paper)
Top 85.7% on sourcepulse
Moment-DETR provides an end-to-end solution for video moment retrieval and highlight detection based on natural language queries. It is designed for researchers and developers working on video understanding, temporal localization, and multimodal retrieval tasks. The project offers a novel model that predicts moment coordinates and saliency scores, facilitating accurate video content analysis.
How It Works
Moment-DETR employs a transformer-based architecture, building upon DETR, to process video features and text queries. It predicts moment coordinates and saliency scores simultaneously, enabling a unified approach to both tasks. The model leverages pre-trained features from models like SlowFast and OpenAI CLIP, enhancing its understanding of video content and language semantics. This integrated approach allows for efficient and accurate temporal localization of relevant video segments.
Quick Start & Requirements
conda create --name moment_detr python=3.7
) and activate it. Install PyTorch with CUDA 11.0 (conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch
), then other dependencies (pip install tqdm ipython easydict tensorboard tabulate scikit-learn pandas
).moment_detr_features.tar.gz
(8GB).Highlighted Details
Maintenance & Community
The project is associated with Jie Lei, Tamara L. Berg, and Mohit Bansal. It builds upon existing frameworks like DETR and TVRetrieval XML, incorporating resources from mdetr, MMAction2, CLIP, SlowFast, and HERO_Video_Feature_Extractor.
Licensing & Compatibility
Limitations & Caveats
The project requires specific versions of Python and PyTorch, along with CUDA 11.0. The dataset features are substantial (8GB), and feature extraction relies on external tools. The annotation license restricts commercial use.
1 year ago
Inactive