HY-WorldPlay by Tencent-Hunyuan

Systematic framework for real-time interactive world modeling

Created 2 months ago

1,155 stars

Top 33.2% on SourcePulse

Project Summary

Summary

HY-World 1.5, also known as WorldPlay, is an open-source framework for real-time interactive world modeling that prioritizes long-term geometric consistency. It addresses the limitations of previous methods that required lengthy offline generation processes and lacked interactivity. This project targets researchers and developers seeking to create dynamic, consistent 3D environments with low latency, enabling applications like 3D reconstruction, promptable events, and infinite world extension. The primary benefit is achieving real-time streaming video generation (24 FPS) while maintaining high visual quality and temporal coherence.

How It Works

WorldPlay employs a streaming video diffusion model built upon four key innovations. A Dual Action Representation enables robust control via keyboard and mouse inputs. Reconstituted Context Memory dynamically rebuilds past frame context, using temporal reframing to retain geometrically important, distant frames and combat memory attenuation. The WorldCompass framework, a novel Reinforcement Learning (RL) post-training method, directly enhances action-following and visual quality over long horizons. Finally, Context Forcing, a memory-aware distillation technique, aligns teacher and student model contexts to preserve long-range information capacity and prevent error drift, facilitating real-time performance.

Quick Start & Requirements

Primary Install: Setup involves creating a conda environment with Python 3.10, activating it, and running pip install -r requirements.txt.
Prerequisites: NVIDIA GPU with CUDA support is required. Minimum GPU memory is 14 GB (with model offloading enabled). Access to a gated Hugging Face model (black-forest-labs/FLUX.1-Redux-dev) is necessary for the vision encoder; access must be requested and approved.
Dependencies: Flash Attention is recommended for faster inference and reduced memory usage. Model weights require downloading (~60GB+ total), including action models, base video models, text encoders, and vision encoders.
Links: Online demo available at https://3d.hunyuan.tencent.com/sceneTo3D. Technical report and research paper details are linked within the README.

Highlighted Details

Achieves real-time streaming inference at 24 FPS.
Demonstrates superior performance across quantitative metrics (PSNR, SSIM, LPIPS, $R_{dist}$, $T_{dist}$) compared to existing methods for both short-term and long-term consistency.
Supports versatile applications, including first-person and third-person perspectives across real-world and stylized environments.
Provides a comprehensive, open-sourced training framework covering data, training, and inference deployment stages.

Maintenance & Community

The project actively encourages community discussion via WeChat and Discord groups. Specific details on maintainers, sponsorships, or a public roadmap are not provided in the README.

Licensing & Compatibility

The license type is not explicitly stated in the provided README text. This omission requires further investigation for commercial use or closed-source integration.

Limitations & Caveats

The project's TODO list indicates that acceleration, quantization, and open-source training code are planned features, suggesting they are not yet available. Access to a gated Hugging Face model is a prerequisite for full functionality, potentially posing an adoption hurdle.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

132 stars in the last 30 days