grounded-video-description  by facebookresearch

Code for video grounding and captioning research paper

created 6 years ago
326 stars

Top 84.8% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the source code for "Grounded Video Description," a paper focusing on video grounding and captioning, specifically for the ActivityNet Entities dataset. It enables object localization within videos and generates descriptive captions, targeting researchers and practitioners in video understanding and generation.

How It Works

The project implements a Masked Transformer architecture for video description and object grounding. It processes video features (region and frame-wise) and uses a language model to generate captions while simultaneously localizing objects mentioned in the text. This approach allows for joint understanding of video content and linguistic description.

Quick Start & Requirements

  • Install: Clone recursively (git clone --recursive), create a conda environment using cfgs/conda_env_gvd_py3.yml.
  • Prerequisites: CUDA 9.0+, CUDNN v7.1+, Miniconda, Python 3.7 (or 2.7). PyTorch 1.1.0 is recommended; compatibility with 1.2+ may require changes. Java and Stanford CoreNLP are needed for SPICE evaluation.
  • Data: Download all data and pre-trained models (216GB) via bash tools/download_all.sh.
  • Demo: Run python main.py --inference_only --start_from save/anet-sup-0.05-0-0.1-run1 --id anet-sup-0.05-0-0.1-run1 --language_eval --eval_obj_grounding.
  • Links: ActivityNet Challenge, Stanford CoreNLP, Pre-trained Models.

Highlighted Details

  • Supports ActivityNet-Entities and Flickr30k-Entities datasets.
  • Includes code for object localization and language evaluation.
  • Offers both supervised and unsupervised training modes.
  • Requires significant disk space (216GB) for full data download.

Maintenance & Community

The project is from Facebook AI Research. No specific community channels (Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

The project is licensed under a custom license (see LICENSE file). Portions are based on the Neural Baby Talk project. Compatibility with commercial or closed-source applications depends on the specific terms of the project's license.

Limitations & Caveats

The code is primarily tested with PyTorch 1.1.0; newer versions may require modifications. Evaluation requires at least 9GB of GPU memory. The setup involves downloading a substantial amount of data.

Health Check
Last commit

3 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.