Discover and explore top open-source AI tools and projects—updated daily.
showlabEgocentric video-language pretraining for understanding first-person perspectives
Top 99.6% on SourcePulse
Summary
EgoVLP pioneers Egocentric Video-Language Pretraining, offering a robust framework and pretrained models for understanding first-person video content. It addresses the need for specialized models in egocentric domains, benefiting researchers and practitioners by significantly boosting performance on diverse downstream tasks like video question answering and retrieval.
How It Works
The project implements an Egocentric Video-Language Pretraining (EgoVLP) approach, leveraging large-scale egocentric datasets such as EgoClip and Ego4D. It utilizes an EgoNCE pretraining objective that incorporates semantic similarity for verbs and nouns, aiming to capture richer video-language relationships. The framework is built on PyTorch with DistributedDataParallel (DDP) for efficient large-scale training.
Quick Start & Requirements
Installation involves creating a Conda environment from environment.yml and activating it. Significant data preparation is required, including downloading ~7TB of Ego4D source videos and associated metadata, or EgoClip metadata. Video preprocessing scripts (utils/video_resize.py, utils/video_chunk.py) are provided for optimizing pretraining. Pretraining is designed for distributed setups, exemplified by training on 4 nodes with 8 A100 GPUs each.
Highlighted Details
Maintenance & Community
The project is maintained by Kevin Qinghong Lin (kevin.qh.lin@gmail.com). The authors express willingness to integrate contributions for extending EgoVLP to other egocentric tasks or datasets. The codebase is based on the Frozen framework.
Licensing & Compatibility
The project is released under the MIT license, which permits broad use, including commercial applications and integration into closed-source projects.
Limitations & Caveats
The README notes that standard VLP approaches (CC3M+WebVid2M, EgoClip) can degrade significantly on the Charades-Ego dataset after the first pretraining epoch, necessitating the use of specific early-epoch weights for evaluation on this task. Pretraining requires substantial computational resources and storage.
1 year ago
Inactive
salesforce