EgoVLP  by showlab

Egocentric video-language pretraining for understanding first-person perspectives

Created 3 years ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

EgoVLP pioneers Egocentric Video-Language Pretraining, offering a robust framework and pretrained models for understanding first-person video content. It addresses the need for specialized models in egocentric domains, benefiting researchers and practitioners by significantly boosting performance on diverse downstream tasks like video question answering and retrieval.

How It Works

The project implements an Egocentric Video-Language Pretraining (EgoVLP) approach, leveraging large-scale egocentric datasets such as EgoClip and Ego4D. It utilizes an EgoNCE pretraining objective that incorporates semantic similarity for verbs and nouns, aiming to capture richer video-language relationships. The framework is built on PyTorch with DistributedDataParallel (DDP) for efficient large-scale training.

Quick Start & Requirements

Installation involves creating a Conda environment from environment.yml and activating it. Significant data preparation is required, including downloading ~7TB of Ego4D source videos and associated metadata, or EgoClip metadata. Video preprocessing scripts (utils/video_resize.py, utils/video_chunk.py) are provided for optimizing pretraining. Pretraining is designed for distributed setups, exemplified by training on 4 nodes with 8 A100 GPUs each.

Highlighted Details

  • EgoVLP achieved top rankings in multiple Ego4D challenges (OSCC, NLQ, PNR) and EPIC-Kitchens Multi-Instance Retrieval in 2022.
  • The pretrained EgoVLP model (EgoClip w/ EgoNCE) demonstrates strong performance on the EgoMCQ benchmark, reaching 90.7% (inter-video) and 57.2% (intra-video).
  • EgoVLPv2, released in July 2023 and accepted to ICCV 2023, offers improved performance and efficiency.
  • Pretrained weights are available for direct use or fine-tuning on various downstream tasks.

Maintenance & Community

The project is maintained by Kevin Qinghong Lin (kevin.qh.lin@gmail.com). The authors express willingness to integrate contributions for extending EgoVLP to other egocentric tasks or datasets. The codebase is based on the Frozen framework.

Licensing & Compatibility

The project is released under the MIT license, which permits broad use, including commercial applications and integration into closed-source projects.

Limitations & Caveats

The README notes that standard VLP approaches (CC3M+WebVid2M, EgoClip) can degrade significantly on the Charades-Ego dataset after the first pretraining epoch, necessitating the use of specific early-epoch weights for evaluation on this task. Pretraining requires substantial computational resources and storage.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Simon Willison Simon Willison(Coauthor of Django), and
10 more.

LAVIS by salesforce

0.1%
11k
Library for language-vision AI research
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.