Discover and explore top open-source AI tools and projects—updated daily.
Scaling 3D vision-language learning for grounded scene understanding
Top 97.3% on SourcePulse
SceneVerse provides the first million-scale 3D vision-language dataset and official implementation for grounded scene understanding. It targets researchers and practitioners in 3D computer vision and natural language processing, enabling state-of-the-art performance on 3D visual grounding benchmarks and facilitating zero-shot transfer capabilities.
How It Works
SceneVerse leverages a large-scale dataset comprising 68K 3D indoor scenes and 2.5M vision-language pairs. The core approach involves a GPS (Grounded Pre-training for Scenes) model, which is pre-trained on this extensive dataset. This pre-training strategy is designed to capture rich semantic relationships between 3D environments and textual descriptions, enabling robust generalization and zero-shot transfer to downstream tasks.
Quick Start & Requirements
TRAIN.md
for detailed instructions on training and inference.DATA.md
.TRAIN.md
.Highlighted Details
Maintenance & Community
The project is associated with the ECCV 2024 paper "SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding." Pre-trained checkpoints and training/inference code were released in late 2024.
Licensing & Compatibility
The README does not explicitly state the license. However, the project heavily adapts code from other open-source datasets and projects, suggesting a potential need to review those licenses for compatibility.
Limitations & Caveats
The dataset includes "template" entries for HM3D and Structured3D, indicating that not all data modalities or annotations might be fully available or processed for these specific datasets. The project is relatively new, with code and checkpoints released in mid-to-late 2024.
6 months ago
Inactive