unmasked_teacher  by OpenGVLab

Research paper for training-efficient video foundation models

created 2 years ago
336 stars

Top 83.0% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides the official implementation for "Unmasked Teacher: Towards Training-Efficient Video Foundation Models," a method designed to accelerate the training of video foundation models (VFMs). It addresses the high computational costs and data scarcity challenges in VFM development, offering a more efficient approach for researchers and practitioners working with video understanding tasks.

How It Works

Unmasked Teacher (UMT) tackles VFM training inefficiency by masking most low-semantic video tokens and selectively aligning the unmasked tokens with an Image Foundation Model (IFM) acting as a "teacher." This semantic guidance from the IFM facilitates faster convergence and better multimodal alignment compared to low-level reconstruction methods. A progressive pre-training framework enables UMT to handle diverse video tasks, from scene and temporal understanding to complex video-language tasks.

Quick Start & Requirements

  • Installation: Code is available via GitHub. Specific installation instructions are not detailed in the README.
  • Prerequisites: Requires significant computational resources, as indicated by the mention of 32 A100 GPUs for pre-training. Specific software dependencies (e.g., Python version, deep learning frameworks) are not explicitly listed.
  • Resources: Pre-training took 6 days on 32 A100 GPUs.
  • Links: Model Zoo

Highlighted Details

  • Achieved state-of-the-art performance on various video tasks using a scratch-built ViT-L/16 after 6 days of pre-training on 32 A100 GPUs.
  • Won the Perception Test Challenge at ICCV 2023.
  • UMTScore shows high consistency with human judgment in video-text alignment.
  • Supports single-modality (Action Classification, Action Detection) and multi-modality (Video-Text Retrieval, Video Question Answering) tasks.

Maintenance & Community

  • The project is associated with OpenGVLab and Shanghai AI Lab.
  • A WeChat group is available for discussion and suggestions.
  • The project is actively updated, with recent bug fixes and performance improvements (e.g., halved pretraining time with autocast).

Licensing & Compatibility

  • The repository is released under an unspecified license. The README does not detail licensing terms or restrictions for commercial use.

Limitations & Caveats

The README does not explicitly detail limitations, but the significant hardware requirements (32 A100 GPUs) suggest a high barrier to entry for training or fine-tuning without substantial resources. Specific software dependencies are also not clearly outlined.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
11 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.