moment_detr  by jayleicn

Video moment retrieval via natural language queries (NeurIPS 2021 paper)

created 4 years ago
321 stars

Top 85.7% on sourcepulse

GitHubView on GitHub
Project Summary

Moment-DETR provides an end-to-end solution for video moment retrieval and highlight detection based on natural language queries. It is designed for researchers and developers working on video understanding, temporal localization, and multimodal retrieval tasks. The project offers a novel model that predicts moment coordinates and saliency scores, facilitating accurate video content analysis.

How It Works

Moment-DETR employs a transformer-based architecture, building upon DETR, to process video features and text queries. It predicts moment coordinates and saliency scores simultaneously, enabling a unified approach to both tasks. The model leverages pre-trained features from models like SlowFast and OpenAI CLIP, enhancing its understanding of video content and language semantics. This integrated approach allows for efficient and accurate temporal localization of relevant video segments.

Quick Start & Requirements

  • Install: Create a conda environment (conda create --name moment_detr python=3.7) and activate it. Install PyTorch with CUDA 11.0 (conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch), then other dependencies (pip install tqdm ipython easydict tensorboard tabulate scikit-learn pandas).
  • Features: Download and extract moment_detr_features.tar.gz (8GB).
  • Prerequisites: Python 3.7, PyTorch 1.9.0, CUDA 11.0, ffmpeg-python, ftfy, regex.
  • Setup Time: Training can be completed in ~4 hours on a single RTX 2080Ti.
  • Links: QVHighlights Dataset, HERO_Video_Feature_Extractor, CLIP.

Highlighted Details

  • NeurIPS 2021 paper.
  • Supports pre-training, fine-tuning, and evaluation on QVHighlights dataset.
  • Enables prediction on custom videos and text queries.
  • Achieves accurate temporal localization and saliency scoring.

Maintenance & Community

The project is associated with Jie Lei, Tamara L. Berg, and Mohit Bansal. It builds upon existing frameworks like DETR and TVRetrieval XML, incorporating resources from mdetr, MMAction2, CLIP, SlowFast, and HERO_Video_Feature_Extractor.

Licensing & Compatibility

  • Code: MIT License.
  • Annotations: CC BY-NC-SA 4.0 License.
  • Compatibility: Non-commercial use is permitted for annotations. Commercial use of the code is allowed under MIT terms.

Limitations & Caveats

The project requires specific versions of Python and PyTorch, along with CUDA 11.0. The dataset features are substantial (8GB), and feature extraction relies on external tools. The annotation license restricts commercial use.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
2
Star History
17 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.