Code for video grounding and captioning research paper
Top 84.8% on sourcepulse
This repository provides the source code for "Grounded Video Description," a paper focusing on video grounding and captioning, specifically for the ActivityNet Entities dataset. It enables object localization within videos and generates descriptive captions, targeting researchers and practitioners in video understanding and generation.
How It Works
The project implements a Masked Transformer architecture for video description and object grounding. It processes video features (region and frame-wise) and uses a language model to generate captions while simultaneously localizing objects mentioned in the text. This approach allows for joint understanding of video content and linguistic description.
Quick Start & Requirements
git clone --recursive
), create a conda environment using cfgs/conda_env_gvd_py3.yml
.bash tools/download_all.sh
.python main.py --inference_only --start_from save/anet-sup-0.05-0-0.1-run1 --id anet-sup-0.05-0-0.1-run1 --language_eval --eval_obj_grounding
.Highlighted Details
Maintenance & Community
The project is from Facebook AI Research. No specific community channels (Discord/Slack) or roadmap are mentioned in the README.
Licensing & Compatibility
The project is licensed under a custom license (see LICENSE file). Portions are based on the Neural Baby Talk project. Compatibility with commercial or closed-source applications depends on the specific terms of the project's license.
Limitations & Caveats
The code is primarily tested with PyTorch 1.1.0; newer versions may require modifications. Evaluation requires at least 9GB of GPU memory. The setup involves downloading a substantial amount of data.
3 years ago
1 day