Research paper for 3D/4D scene generation from a single image using video diffusion
Top 31.8% on sourcepulse
DimensionX is a framework for generating 3D and 4D scenes from single images using controllable video diffusion. It targets researchers and developers in computer vision and graphics who need to create complex spatial and temporal scene representations from limited input. The primary benefit is enabling precise control over scene structure and motion, bridging the gap between generated videos and real-world scene reconstruction.
How It Works
DimensionX employs a novel ST-Director module to decouple spatial and temporal factors in video diffusion models. It achieves this by learning dimension-aware LoRAs from dimension-variant datasets. This approach allows for fine-grained manipulation of spatial layout and temporal dynamics, facilitating the reconstruction of 3D and 4D scene representations from sequential frames. For 3D generation, a trajectory-aware mechanism is used, while 4D generation incorporates an identity-preserving denoising strategy.
Quick Start & Requirements
src/gradio_demo/
and run pip install -r requirements.txt
.OPENAI_API_KEY
and OPENAI_BASE_URL
, then execute python app.py
.diffusers
library, and potentially VLM for image captioning. GPU acceleration is highly recommended for inference.CogVideoXImageToVideoPipeline
is provided.Highlighted Details
Maintenance & Community
Licensing & Compatibility
diffusers
library.Limitations & Caveats
ValueError
with fuse_lora
related to text_encoder
not being found, suggesting a workaround.7 months ago
1 day