VideoWorld by ByteDance-Seed

Generative model for knowledge learning from unlabeled videos (CVPR 2025 paper)

Created 1 year ago

656 stars

Top 51.2% on SourcePulse

Project Summary

VideoWorld is a generative model that learns complex knowledge and skills purely from unlabeled video data, targeting researchers in computer vision and AI. It aims to demonstrate that visual observation alone is sufficient for learning tasks, rivaling traditional reinforcement learning approaches without explicit search or reward mechanisms.

How It Works

VideoWorld employs a latent dynamics model (LDM) to compress multi-frame visual changes into compact latent codes. An autoregressive transformer then processes these codes, enabling the model to predict future states and learn sequential dependencies. This approach enhances knowledge acquisition efficiency and effectiveness by focusing on salient visual transitions.

Quick Start & Requirements

Install: Clone repo, cd VideoWorld, bash install.sh.
Prerequisites: Python 3.10, PyTorch 2.1.0 with CUDA 12.1.
Dependencies: Includes KataGo for Go battles and CALVIN for robotics. Automated installation scripts are provided.
Resources: Requires downloading pre-trained weights for LDM initialization and Go battle inference.
Links: Project Page, Paper, Weights.

Highlighted Details

Achieves a 5-dan professional level in Go using a 300M parameter model.
Generalizes across robotic control tasks (CALVIN, RLBench), approaching oracle model performance.
Introduces Video-GoBench, a large-scale video-based Go dataset.
Explores knowledge learning from visual data, a novel direction compared to LLMs.

Maintenance & Community

Project accepted to CVPR 2025.
Code, dataset, and models are open-sourced.
Primary contributors and affiliations are listed in the paper.

Licensing & Compatibility

The repository does not explicitly state a license. The presence of pre-trained weights on Hugging Face may have separate terms.

Limitations & Caveats

The provided installation scripts may require manual intervention for specific dependencies like KataGo.
The transformers library might require a patch for inference due to bos_token_id issues.
The license status for commercial use or closed-source linking is unclear.

Health Check

Last Commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)

0

Issues (30d)

0

Star History

7 stars in the last 30 days

Explore Similar Projects

Motus by thu-ml

Unified latent action world model for robotics

Created 1 month ago

Updated 6 days ago

DeepVideoDiscovery by microsoft

Agentic search for understanding extra-long videos

Created 7 months ago

Updated 2 months ago

Awesome_Long_Form_Video_Understanding by ttengwang

Curated list of research on long-term video understanding

Created 3 years ago

Updated 3 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

MotionLLM by IDEA-Research

MotionLLM: Research paper for multimodal human behavior understanding

Created 1 year ago

Updated 1 year ago

Starred by

Andreas Jansson

Andreas Jansson(Cofounder of Replicate).

unmasked_teacher by OpenGVLab

Research paper for training-efficient video foundation models

Created 2 years ago

Updated 1 year ago

EgoVLP by showlab

Egocentric video-language pretraining for understanding first-person perspectives

Created 3 years ago

Updated 1 year ago

ViFi-CLIP by muzairkhattak

Research paper exploring fine-tuned CLIP models for video learning

Created 3 years ago

Updated 1 year ago

tarsier by bytedance

Video-language model for high-quality video descriptions and video understanding

Created 1 year ago

Updated 5 months ago

VideoMind by yeliudev

Agent framework for advanced long video reasoning

Created 10 months ago

Updated 3 months ago

Video-R1 by tulerfeng

Video reasoning in MLLMs via reinforcement learning

Created 10 months ago

Updated 4 weeks ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo).

InternVideo by OpenGVLab

Video foundation models & data for multimodal understanding (research paper)

Created 3 years ago

Updated 3 weeks ago

Awesome-LLMs-for-Video-Understanding by yunlong10

Survey of video understanding via LLMs

Created 2 years ago

Updated 3 weeks ago

Feedback? Help us improve.