LaViLa  by facebookresearch

Video pretraining research paper using LLMs

created 2 years ago
528 stars

Top 60.7% on sourcepulse

GitHubView on GitHub
Project Summary

LaViLa (Language Augmented Video Language Pretraining) is a novel approach for learning video representations by leveraging Large Language Models (LLMs) as "Narrators" to generate descriptive text for videos. This method targets researchers and practitioners in video understanding and multimodal AI, enabling the creation of high-quality video-language paired data and achieving state-of-the-art performance on various video tasks.

How It Works

LaViLa repurposes LLMs to act as visually conditioned "Narrators" that generate dense textual descriptions for video clips. These generated narrations, along with human annotations, are used to train a dual-encoder model (video encoder and text encoder) via a contrastive loss, similar to CLIP. This approach allows for automatic data augmentation and improved video-language representation learning.

Quick Start & Requirements

  • Installation: Refer to INSTALL.md.
  • Narrator Demo:
    • Colab (CPU): python demo_narrator.py [--video-path $TEST_VIDEO]
    • GPU: python demo_narrator.py --cuda
    • 3rd Person Demo: python demo_narrator_3rd_person.py [--video-path $TEST_VIDEO] [--cuda]
  • Prerequisites: Python, CUDA (for GPU mode).
  • Resources: Colab demo available, local execution may require GPU.
  • Links: Colab Demo, 🤗 Spaces Demo, Website, Paper

Highlighted Details

  • Achieves state-of-the-art zero-shot performance on egocentric video benchmarks like EK-100 MIR and Charades-Ego.
  • Demonstrates significant improvements over previous methods, e.g., +7.6 mAP on EK-100 MIR avg. with TSF-B backbone.
  • Offers both a "Narrator" for generating video descriptions and a pre-trained dual-encoder model.
  • Fine-tuning the dual-encoder further boosts performance on downstream tasks.

Maintenance & Community

  • Developed by Facebook Research.
  • Associated with CVPR 2023 (Highlight paper).
  • No explicit community links (Discord/Slack) mentioned in the README.

Licensing & Compatibility

  • Primarily licensed under MIT License.
  • Some components may have separate licenses (e.g., episodic-memory under MIT).
  • Sample videos used are under the Mixkit Stock Video Free License.
  • Generally compatible with commercial use, but check specific component licenses.

Limitations & Caveats

  • The narrator's output style may differ from ground-truth captions due to reliance on ASR transcriptions in pre-training data.
  • Colab demo has limited RAM; local execution recommended for larger models.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.