DVD by EnVision-Research

Deterministic video depth estimation framework

Created 4 months ago

329 stars

Top 82.7% on SourcePulse

Project Summary

Deterministic Video Depth (DVD) addresses the trade-off between semantic ambiguity in discriminative video depth estimation models and temporal flickering/hallucinations in generative models. It offers a deterministic framework by adapting pre-trained video diffusion models into single-pass depth regressors. This approach benefits researchers and practitioners seeking highly detailed, temporally stable, and data-efficient video depth estimation solutions.

How It Works

DVD fundamentally breaks the ambiguity-hallucination dilemma by adapting generative video diffusion models into deterministic depth regressors. It achieves this by stripping away generative stochasticity, thereby uniting the semantic priors of diffusion models with the structural stability of discriminative methods. Key innovations include Latent Manifold Rectification (LMR) for unparalleled structural fidelity and boundary precision, and a training-free Global Affine Coherence (GAC) module for seamless long-video inference with minimal scale drift.

Quick Start & Requirements

Installation involves cloning the repository, creating a Conda environment with Python 3.10, and installing the package in editable mode (pip install -e .). For speedup, pip install sageattention is recommended (but not for training). Pre-trained weights must be downloaded from Hugging Face using huggingface-cli login and huggingface-cli download. Links to the project page, arXiv paper, and Hugging Face model/demo are provided.

Highlighted Details

Extreme Data Efficiency: Leverages generative priors using only 367K frames, a 163x reduction compared to leading discriminative baselines.
Deterministic & Fast Inference: Operates in a single forward pass, bypassing iterative ODE integration for absolute temporal stability and eliminating generative hallucinations.
Unparalleled Structural Fidelity: Achieves state-of-the-art high-frequency boundary precision through the Latent Manifold Rectification (LMR) module.
Long-Video Inference: Supports extended video processing with negligible scale drift via the Global Affine Coherence (GAC) module.

Maintenance & Community

The project has seen recent activity with the release of its paper, project page, pre-trained weights, and inference code. A ComfyUI node has been contributed by the community, indicating active development and integration efforts.

Licensing & Compatibility

DVD employs a split-license strategy: the source code is released under the permissive Apache 2.0 License. However, the pre-trained model weights are distributed under the CC BY-NC 4.0 License, which strictly restricts usage to non-commercial, academic, and research purposes.

Limitations & Caveats

The primary limitation is the non-commercial restriction imposed by the CC BY-NC 4.0 license on the pre-trained model weights, precluding their use in commercial applications. The online Gradio demo is limited to processing videos up to 5 seconds due to GPU resource constraints. Potential installation issues may arise from dependencies like PyTorch, sentencepiece, cmake, and cupy.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

28 stars in the last 30 days