YUME by stdstu12

Interactive world generation from text, image, or video

Created 4 months ago

330 stars

Top 82.7% on SourcePulse

Project Summary

Yume is an interactive world generation model designed for creating realistic and dynamic visual content from text, image, or video inputs. It targets researchers and developers in AI-driven content creation, offering a framework for long-form video generation with fine-grained control over camera and character actions.

How It Works

Yume leverages a distillation recipe for video Diffusion Transformer (DiT) models, implementing a FramePack-like training code. It supports long video generation and distributed training (DDP/FSDP) for efficient sampling. The model allows for interactive control through text prompts that can specify camera movement and character actions, enabling dynamic scene generation.

Quick Start & Requirements

Installation: pip install -r requirements.txt and pip install . after code modifications.
Prerequisites: Python 3.10.0, CUDA 12.1, A100 GPU (tested on).
Inference: Scripts scripts/inference/sample_jpg.sh (image-to-video) and scripts/inference/sample.sh (general video) are provided.
Training: Requires at least 16 A100 GPUs for the MVDT framework.
Dataset: Refer to https://github.com/Lixsp11/sekai-codebase for dataset download.
Project Page: https://stdstu12.github.io/YUME-Project/
Model: https://huggingface.co/stdstu123/Yume-I2V-540P

Highlighted Details

Supports image-to-video and text-to-video generation.
Interactive control via text prompts for camera and character actions.
Distillation recipes for video DiT models.
DDP/FSDP sampling support for long video generation.
Training requires a minimum of 16 A100 GPUs.

Maintenance & Community

The project is associated with an arXiv paper (2507.17744) and a Hugging Face model repository. Contributions are welcomed.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Training is resource-intensive, requiring a minimum of 16 A100 GPUs. The project is under active development with a stated plan for FP8 support and quantized models.

YUME by stdstu12

Explore Similar Projects

LAMP by RQ-Wu

TATS by songweige

kandinsky-5 by kandinskylab

Lumina-mGPT-2.0 by Alpha-VLLM

Allegro by rhymes-ai

Lumina-T2X by Alpha-VLLM

InternLM-XComposer by InternLM

EasyAnimate by aigc-apps

Pyramid-Flow by jy0205

mmagic by open-mmlab

Open-Sora by hpcaitech

generative-models by Stability-AI