SpatialVID by NJU-3DV

Large-scale video dataset for 3D vision and spatial intelligence

Created 5 months ago

502 stars

Top 62.1% on SourcePulse

Project Summary

SpatialVID addresses the critical need for large-scale, high-quality training data in spatial intelligence, particularly for dynamic real-world scenes requiring accurate camera motion. It provides researchers in 3D vision and related fields with a comprehensive dataset featuring dense 3D annotations, enabling significant improvements in model generalization and performance for tasks like spatial reconstruction and world exploration.

How It Works

The SpatialVID dataset comprises over 7,089 hours of dynamic video content, processed from an initial collection of 21,000+ hours into 2.7 million clips using a hierarchical filtering pipeline. Each clip is enriched through an annotation pipeline that generates per-frame camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions, leveraging state-of-the-art models for these tasks.

Quick Start & Requirements

Primary Install: Clone the repository recursively, create and activate a Conda environment (conda create -n SpatialVID python=3.10.13, conda activate SpatialVID), and install requirements (pip install -r requirements/requirements.txt).
Prerequisites: Python 3.10.13, GPU with CUDA support (CUDA 12.6 mentioned for PaddlePaddle), and FFMPEG (configuration for CUDA acceleration and VMAF is noted as necessary). Model weights and dataset download scripts are provided.
Links: GitHub repository (implied by clone URL), ModelScope (for SpatialVid-HQ).

Highlighted Details

Scale: Features 7,089 hours of dynamic video content across 2.7 million clips.
Annotation Richness: Includes per-frame camera poses, depth maps, dynamic masks, structured captions, and motion instructions.
Data Diversity: Utilizes "in-the-wild" videos, offering diverse scenes and camera movements.
Model Integration: Leverages advanced models such as MegaSaM, Depth Anything V2, UniDepthV2, and SAM2.

Maintenance & Community

The project is authored by researchers from Nanjing University and the Institute of Automation, Chinese Academy of Science. No specific community channels (like Discord/Slack), social handles, or roadmaps are detailed in the provided README.

Licensing & Compatibility

The repository itself is licensed under the Apache 2.0 License. However, users must adhere to the individual licenses of integrated third-party models like MegaSaM, which may impose restrictions on commercial or closed-source usage.

Limitations & Caveats

Users may encounter dependency version warnings (e.g., nvidia-nccl-cu12, numpy) that require careful management. Setting up FFMPEG for optimal performance with CUDA acceleration and VMAF necessitates specific configuration. Crucially, the licensing terms of dependent models must be thoroughly reviewed and respected, especially for any commercial applications.

Health Check

Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

18 stars in the last 30 days