unmasked_teacher by OpenGVLab

Research paper for training-efficient video foundation models

Created 2 years ago

344 stars

Top 80.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Andreas Jansson

Cofounder of Replicate

Project Summary

This repository provides the official implementation for "Unmasked Teacher: Towards Training-Efficient Video Foundation Models," a method designed to accelerate the training of video foundation models (VFMs). It addresses the high computational costs and data scarcity challenges in VFM development, offering a more efficient approach for researchers and practitioners working with video understanding tasks.

How It Works

Unmasked Teacher (UMT) tackles VFM training inefficiency by masking most low-semantic video tokens and selectively aligning the unmasked tokens with an Image Foundation Model (IFM) acting as a "teacher." This semantic guidance from the IFM facilitates faster convergence and better multimodal alignment compared to low-level reconstruction methods. A progressive pre-training framework enables UMT to handle diverse video tasks, from scene and temporal understanding to complex video-language tasks.

Quick Start & Requirements

Installation: Code is available via GitHub. Specific installation instructions are not detailed in the README.
Prerequisites: Requires significant computational resources, as indicated by the mention of 32 A100 GPUs for pre-training. Specific software dependencies (e.g., Python version, deep learning frameworks) are not explicitly listed.
Resources: Pre-training took 6 days on 32 A100 GPUs.
Links: Model Zoo

Highlighted Details

Achieved state-of-the-art performance on various video tasks using a scratch-built ViT-L/16 after 6 days of pre-training on 32 A100 GPUs.
Won the Perception Test Challenge at ICCV 2023.
UMTScore shows high consistency with human judgment in video-text alignment.
Supports single-modality (Action Classification, Action Detection) and multi-modality (Video-Text Retrieval, Video Question Answering) tasks.

Maintenance & Community

The project is associated with OpenGVLab and Shanghai AI Lab.
A WeChat group is available for discussion and suggestions.
The project is actively updated, with recent bug fixes and performance improvements (e.g., halved pretraining time with autocast).

Licensing & Compatibility

The repository is released under an unspecified license. The README does not detail licensing terms or restrictions for commercial use.

Limitations & Caveats

The README does not explicitly detail limitations, but the significant hardware requirements (32 A100 GPUs) suggest a high barrier to entry for training or fine-tuning without substantial resources. Specific software dependencies are also not clearly outlined.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days