Generative model for knowledge learning from unlabeled videos (CVPR 2025 paper)
Top 54.8% on sourcepulse
VideoWorld is a generative model that learns complex knowledge and skills purely from unlabeled video data, targeting researchers in computer vision and AI. It aims to demonstrate that visual observation alone is sufficient for learning tasks, rivaling traditional reinforcement learning approaches without explicit search or reward mechanisms.
How It Works
VideoWorld employs a latent dynamics model (LDM) to compress multi-frame visual changes into compact latent codes. An autoregressive transformer then processes these codes, enabling the model to predict future states and learn sequential dependencies. This approach enhances knowledge acquisition efficiency and effectiveness by focusing on salient visual transitions.
Quick Start & Requirements
cd VideoWorld
, bash install.sh
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
transformers
library might require a patch for inference due to bos_token_id
issues.1 week ago
Inactive