DimensionX  by wenqsun

Research paper for 3D/4D scene generation from a single image using video diffusion

created 8 months ago
1,274 stars

Top 31.8% on sourcepulse

GitHubView on GitHub
Project Summary

DimensionX is a framework for generating 3D and 4D scenes from single images using controllable video diffusion. It targets researchers and developers in computer vision and graphics who need to create complex spatial and temporal scene representations from limited input. The primary benefit is enabling precise control over scene structure and motion, bridging the gap between generated videos and real-world scene reconstruction.

How It Works

DimensionX employs a novel ST-Director module to decouple spatial and temporal factors in video diffusion models. It achieves this by learning dimension-aware LoRAs from dimension-variant datasets. This approach allows for fine-grained manipulation of spatial layout and temporal dynamics, facilitating the reconstruction of 3D and 4D scene representations from sequential frames. For 3D generation, a trajectory-aware mechanism is used, while 4D generation incorporates an identity-preserving denoising strategy.

Quick Start & Requirements

  • Install: Navigate to src/gradio_demo/ and run pip install -r requirements.txt.
  • Run: Set OPENAI_API_KEY and OPENAI_BASE_URL, then execute python app.py.
  • Prerequisites: Requires Python, diffusers library, and potentially VLM for image captioning. GPU acceleration is highly recommended for inference.
  • Demo: An online Hugging Face demo is available.
  • Inference Code: Example inference code using CogVideoXImageToVideoPipeline is provided.

Highlighted Details

  • Generates photorealistic 3D and 4D scenes from single images.
  • Utilizes ST-Director for controllable spatial and temporal video diffusion.
  • Includes trajectory-aware mechanism for 3D and identity-preserving denoising for 4D.
  • Released partial model checkpoints (Orbit Left, Orbit Up) on Google Drive and Hugging Face.

Maintenance & Community

  • The project is actively under development with a roadmap including releasing more checkpoints, T-Director, long video generation, video interpolation, and 3DGS optimization code.
  • A Hugging Face demo is available.

Licensing & Compatibility

  • The project is released under the Apache 2.0 license.
  • The code is based on CogVideoX and uses diffusers library.

Limitations & Caveats

  • Currently, only partial model checkpoints are released.
  • The provided inference code notes a potential ValueError with fuse_lora related to text_encoder not being found, suggesting a workaround.
  • The full suite of features, including T-Director and 4D generation code, is still under development.
Health Check
Last commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
48 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.