Research paper for training-efficient video foundation models
Top 83.0% on sourcepulse
This repository provides the official implementation for "Unmasked Teacher: Towards Training-Efficient Video Foundation Models," a method designed to accelerate the training of video foundation models (VFMs). It addresses the high computational costs and data scarcity challenges in VFM development, offering a more efficient approach for researchers and practitioners working with video understanding tasks.
How It Works
Unmasked Teacher (UMT) tackles VFM training inefficiency by masking most low-semantic video tokens and selectively aligning the unmasked tokens with an Image Foundation Model (IFM) acting as a "teacher." This semantic guidance from the IFM facilitates faster convergence and better multimodal alignment compared to low-level reconstruction methods. A progressive pre-training framework enables UMT to handle diverse video tasks, from scene and temporal understanding to complex video-language tasks.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
autocast
).Licensing & Compatibility
Limitations & Caveats
The README does not explicitly detail limitations, but the significant hardware requirements (32 A100 GPUs) suggest a high barrier to entry for training or fine-tuning without substantial resources. Specific software dependencies are also not clearly outlined.
1 year ago
1 day