tinyworlds by AlmondGod

Minimal world model for generating interactive video

Created 9 months ago

1,166 stars

Top 33.0% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> TinyWorlds offers a minimal, educational implementation of DeepMind's Genie world model architecture. It addresses the challenge of scaling world models using action-less internet video by inferring actions between frames. Designed for engineers and researchers, it provides a clear, understandable codebase to explore the autoregressive, unsupervised methods likely used by DeepMind, enabling deeper insights into creating scalable world models.

How It Works

The project employs an autoregressive transformer over discrete tokens, significantly simplifying the prediction task. Core components include a Video Tokenizer (an FSQ VAE) that compresses video frames into a small set of discrete tokens, and an Action Tokenizer that infers action tokens between frames without explicit labels. A Dynamics Model, inspired by MaskGIT and BERT, then predicts future frame tokens conditioned on past video and inferred action tokens. This approach allows for scalable world model training from unlabeled video data by learning the underlying dynamics and actions.

Quick Start & Requirements

Installation involves cloning the repository and installing requirements: pip install -r requirements.txt. A WANDB_API_KEY is required. Datasets, such as zelda_frames.h5 or sonic_frames.h5, must be downloaded from Huggingface using provided scripts. Training is initiated via python scripts/full_train.py, and inference can be run after pulling pre-trained checkpoints. The project supports acceleration through Torch compile, Distributed Data Parallel (DDP), Automatic Mixed Precision (AMP), and TF32 training.

Highlighted Details

Features a Space-Time Transformer (STT) with spatial and temporal attention mechanisms.
Utilizes Finite Scalar Quantization (FSQ) VAEs to learn structured discrete token vocabularies.
The Action Tokenizer is trained adversarially to infer actions from frame sequences, crucial for unsupervised learning.
Supports inference on various retro game environments, including Zelda, Sonic, and Pong.

Maintenance & Community

The project appears open for contributions, with a "Next Steps" section detailing numerous planned enhancements and areas for improvement. No specific community channels (e.g., Discord, Slack) or formal maintenance structures are detailed in the provided README.

Licensing & Compatibility

The provided README does not specify a software license. Users should verify licensing terms before adoption, especially for commercial use.

Limitations & Caveats

Described as a "minimal implementation," TinyWorlds is intended for understanding and extension rather than immediate production deployment. Key features like Mixture of Experts, advanced positional embeddings, and distributed training (FSDP) are listed as future work, indicating the project is in an active development phase.

tinyworlds by AlmondGod

Explore Similar Projects

MotionStreamer by zju3dv

dolphin by kaleido-lab

unmasked_teacher by OpenGVLab

TATS by songweige

ViFi-CLIP by muzairkhattak

SkyReels-V3 by SkyworkAI

Pusa-VidGen by Yaofang-Liu

Allegro by rhymes-ai

VBench by Vchitect

VideoGPT by wilson1yan

Step-Video-T2V by stepfun-ai

SkyReels-V2 by SkyworkAI