grounded-video-description by facebookresearch

Code for video grounding and captioning research paper

Created 6 years ago

332 stars

Top 82.7% on SourcePulse

Project Summary

This repository provides the source code for "Grounded Video Description," a paper focusing on video grounding and captioning, specifically for the ActivityNet Entities dataset. It enables object localization within videos and generates descriptive captions, targeting researchers and practitioners in video understanding and generation.

How It Works

The project implements a Masked Transformer architecture for video description and object grounding. It processes video features (region and frame-wise) and uses a language model to generate captions while simultaneously localizing objects mentioned in the text. This approach allows for joint understanding of video content and linguistic description.

Quick Start & Requirements

Install: Clone recursively (git clone --recursive), create a conda environment using cfgs/conda_env_gvd_py3.yml.
Prerequisites: CUDA 9.0+, CUDNN v7.1+, Miniconda, Python 3.7 (or 2.7). PyTorch 1.1.0 is recommended; compatibility with 1.2+ may require changes. Java and Stanford CoreNLP are needed for SPICE evaluation.
Data: Download all data and pre-trained models (216GB) via bash tools/download_all.sh.
Demo: Run python main.py --inference_only --start_from save/anet-sup-0.05-0-0.1-run1 --id anet-sup-0.05-0-0.1-run1 --language_eval --eval_obj_grounding.
Links: ActivityNet Challenge, Stanford CoreNLP, Pre-trained Models.

Highlighted Details

Supports ActivityNet-Entities and Flickr30k-Entities datasets.
Includes code for object localization and language evaluation.
Offers both supervised and unsupervised training modes.
Requires significant disk space (216GB) for full data download.

Maintenance & Community

The project is from Facebook AI Research. No specific community channels (Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

The project is licensed under a custom license (see LICENSE file). Portions are based on the Neural Baby Talk project. Compatibility with commercial or closed-source applications depends on the specific terms of the project's license.

Limitations & Caveats

The code is primarily tested with PyTorch 1.1.0; newer versions may require modifications. Evaluation requires at least 9GB of GPU memory. The setup involves downloading a substantial amount of data.

Health Check

Last Commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days