gcd by basilevh

Generative model for extreme monocular dynamic novel view synthesis

Created 1 year ago

280 stars

Top 93.0% on SourcePulse

Project Summary

This repository provides the official implementation for "Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis," a method for generating novel views of dynamic scenes from a single camera input. It is targeted at researchers and practitioners in computer vision and graphics interested in monocular video understanding and synthesis. The project offers pretrained models, inference, training, and evaluation code, along with dataset generation tools.

How It Works

The Generative Camera Dolly (GCD) approach leverages Stable Video Diffusion (SVD) models to perform dynamic novel view synthesis. It processes input videos by first converting them into point cloud representations, which are then used to train a diffusion model capable of generating new views. The system supports both gradual camera movements and direct view synthesis, with options for interpolating camera trajectories or directly predicting future frames.

Quick Start & Requirements

Installation: Use Conda to create an environment (conda create -n gcd python=3.10), activate it (conda activate gcd), install PyTorch with CUDA 12.1, and then install project dependencies (pip install git+https://github.com/OpenAI/CLIP.git, pip install git+https://github.com/Stability-AI/datapipelines.git, pip install -r requirements.txt).
Prerequisites: Python 3.10+, PyTorch 2.0.1+ with CUDA 12.1, and significant disk space (7 TB for Kubric-4D, 4.4 TB for ParallelDomain-4D processed data).
Resources: Training requires multiple GPUs (e.g., 8x NVIDIA A100 or A6000) with substantial VRAM (around 50 GB per GPU). Dataset processing also heavily relies on GPUs.
Links: Paper, Website, Results, Datasets, Models.

Highlighted Details

Achieves up to 23.47 dB PSNR on ParallelDomain-4D (RGB output) and 39.0% mIoU for semantic segmentation.
Supports novel view synthesis with up to 180 degrees of horizontal camera displacement.
Includes tools for dataset generation using Kubric and processing for ParallelDomain-4D.
Offers Gradio-based inference for quick experimentation.

Maintenance & Community

The project is maintained by Basile Van Hoorick and collaborators from Columbia University and Toyota Research Institute. The codebase has been refactored for public release, with a note that thorough vetting is ongoing. Users are encouraged to report issues and suggest fixes.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, it depends on Stable Video Diffusion, which has its own licensing terms. Compatibility for commercial use or closed-source linking would require checking the specific licenses of all dependencies, including SVD.

Limitations & Caveats

The codebase has undergone refactoring and may contain undiscovered issues. The project primarily targets synthetic datasets (Kubric-4D, ParallelDomain-4D) and may perform best on similar data, with a note that models may not perform well on videos containing humans. Some dataset folders for ParallelDomain-4D may be missing frames.

gcd by basilevh

Explore Similar Projects

LAMP by RQ-Wu

kandinsky-5 by kandinskylab

WonderJourney by KovenYu

VBench by Vchitect

Awesome-Video-Diffusion-Models by ChenHsing

EasyAnimate by aigc-apps

zero123 by cvlab-columbia

LLFF by Fyusion

VGen by ali-vilab

Step-Video-T2V by stepfun-ai

Open-Sora-Plan by PKU-YuanGroup

generative-models by Stability-AI