moment_detr by jayleicn

Video moment retrieval via natural language queries (NeurIPS 2021 paper)

Created 4 years ago

339 stars

Top 81.4% on SourcePulse

Project Summary

Moment-DETR provides an end-to-end solution for video moment retrieval and highlight detection based on natural language queries. It is designed for researchers and developers working on video understanding, temporal localization, and multimodal retrieval tasks. The project offers a novel model that predicts moment coordinates and saliency scores, facilitating accurate video content analysis.

How It Works

Moment-DETR employs a transformer-based architecture, building upon DETR, to process video features and text queries. It predicts moment coordinates and saliency scores simultaneously, enabling a unified approach to both tasks. The model leverages pre-trained features from models like SlowFast and OpenAI CLIP, enhancing its understanding of video content and language semantics. This integrated approach allows for efficient and accurate temporal localization of relevant video segments.

Quick Start & Requirements

Install: Create a conda environment (conda create --name moment_detr python=3.7) and activate it. Install PyTorch with CUDA 11.0 (conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch), then other dependencies (pip install tqdm ipython easydict tensorboard tabulate scikit-learn pandas).
Features: Download and extract moment_detr_features.tar.gz (8GB).
Prerequisites: Python 3.7, PyTorch 1.9.0, CUDA 11.0, ffmpeg-python, ftfy, regex.
Setup Time: Training can be completed in ~4 hours on a single RTX 2080Ti.
Links: QVHighlights Dataset, HERO_Video_Feature_Extractor, CLIP.

Highlighted Details

NeurIPS 2021 paper.
Supports pre-training, fine-tuning, and evaluation on QVHighlights dataset.
Enables prediction on custom videos and text queries.
Achieves accurate temporal localization and saliency scoring.

Maintenance & Community

The project is associated with Jie Lei, Tamara L. Berg, and Mohit Bansal. It builds upon existing frameworks like DETR and TVRetrieval XML, incorporating resources from mdetr, MMAction2, CLIP, SlowFast, and HERO_Video_Feature_Extractor.

Licensing & Compatibility

Code: MIT License.
Annotations: CC BY-NC-SA 4.0 License.
Compatibility: Non-commercial use is permitted for annotations. Commercial use of the code is allowed under MIT terms.

Limitations & Caveats

The project requires specific versions of Python and PyTorch, along with CUDA 11.0. The dataset features are substantial (8GB), and feature extraction relies on external tools. The annotation license restricts commercial use.

moment_detr by jayleicn

Explore Similar Projects

Youku-mPLUG by X-PLUG

MiraData by mira-space

VideoGPT-plus by mbzuai-oryx

tarsier by bytedance

UniVTG by showlab

awesome-video-text-retrieval by danieljf24

MiniGPT4-video by Vision-CAIR

grounded-video-description by facebookresearch

VideoLLaMA3 by DAMO-NLP-SG

video-diffusion-pytorch by lucidrains

AI-Shorts-Creator by NisaarAgharia

LAVIS by salesforce