LaViLa by facebookresearch

Video pretraining research paper using LLMs

Created 3 years ago

537 stars

Top 59.1% on SourcePulse

Project Summary

LaViLa (Language Augmented Video Language Pretraining) is a novel approach for learning video representations by leveraging Large Language Models (LLMs) as "Narrators" to generate descriptive text for videos. This method targets researchers and practitioners in video understanding and multimodal AI, enabling the creation of high-quality video-language paired data and achieving state-of-the-art performance on various video tasks.

How It Works

LaViLa repurposes LLMs to act as visually conditioned "Narrators" that generate dense textual descriptions for video clips. These generated narrations, along with human annotations, are used to train a dual-encoder model (video encoder and text encoder) via a contrastive loss, similar to CLIP. This approach allows for automatic data augmentation and improved video-language representation learning.

Quick Start & Requirements

Installation: Refer to INSTALL.md.
Narrator Demo:
- Colab (CPU): python demo_narrator.py [--video-path $TEST_VIDEO]
- GPU: python demo_narrator.py --cuda
- 3rd Person Demo: python demo_narrator_3rd_person.py [--video-path $TEST_VIDEO] [--cuda]
Prerequisites: Python, CUDA (for GPU mode).
Resources: Colab demo available, local execution may require GPU.
Links: Colab Demo, 🤗 Spaces Demo, Website, Paper

Highlighted Details

Achieves state-of-the-art zero-shot performance on egocentric video benchmarks like EK-100 MIR and Charades-Ego.
Demonstrates significant improvements over previous methods, e.g., +7.6 mAP on EK-100 MIR avg. with TSF-B backbone.
Offers both a "Narrator" for generating video descriptions and a pre-trained dual-encoder model.
Fine-tuning the dual-encoder further boosts performance on downstream tasks.

Maintenance & Community

Developed by Facebook Research.
Associated with CVPR 2023 (Highlight paper).
No explicit community links (Discord/Slack) mentioned in the README.

Licensing & Compatibility

Primarily licensed under MIT License.
Some components may have separate licenses (e.g., episodic-memory under MIT).
Sample videos used are under the Mixkit Stock Video Free License.
Generally compatible with commercial use, but check specific component licenses.

Limitations & Caveats

The narrator's output style may differ from ground-truth captions due to reliance on ASR transcriptions in pre-training data.
Colab demo has limited RAM; local execution recommended for larger models.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

1 stars in the last 30 days

Explore Similar Projects

Youku-mPLUG by X-PLUG

Chinese video-language dataset and benchmarks for pre-training

Created 2 years ago

Updated 2 years ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

MotionLLM by IDEA-Research

MotionLLM: Research paper for multimodal human behavior understanding

Created 1 year ago

Updated 1 year ago

VideoGPT-plus by mbzuai-oryx

Video-language model integrating image/video encoders for enhanced video understanding

Created 1 year ago

Updated 5 months ago

EgoVLP by showlab

Egocentric video-language pretraining for understanding first-person perspectives

Created 3 years ago

Updated 1 year ago

VSP-LLM by Sally-SH

PyTorch code for visual speech processing research paper

Created 1 year ago

Updated 10 months ago

tarsier by bytedance

Video-language model for high-quality video descriptions and video understanding

Created 1 year ago

Updated 5 months ago

VLog by showlab

Video-language model via generative retrieval of narration vocabulary

Created 2 years ago

Updated 10 months ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory) and

Jesse Clark

Jesse Clark(Cofounder of Marqo).

LanguageBind by PKU-YuanGroup

Multimodal pretraining research paper using language-based semantic alignment

Created 2 years ago

Updated 1 year ago

VideoLLaMA2 by DAMO-NLP-SG

Video-LLM research paper advancing multimodal understanding

Created 1 year ago

Updated 11 months ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

Video-ChatGPT by mbzuai-oryx

Video conversation model for detailed video understanding (ACL 2024 paper)

Created 2 years ago

Updated 5 months ago

Awesome-LLMs-for-Video-Understanding by yunlong10

Survey of video understanding via LLMs

Created 2 years ago

Updated 3 weeks ago

Starred by

Paras Jain

Paras Jain(Cofounder of Genmo),

Jesse Clark

Jesse Clark(Cofounder of Marqo), and

2 more.

Video-LLaMA by DAMO-NLP-SG

Multimodal model for video understanding research

Created 2 years ago

Updated 1 year ago

Feedback? Help us improve.