Discover and explore top open-source AI tools and projects—updated daily.
Scaling indoor 3D object detection and spatial understanding
Top 82.2% on SourcePulse
Summary
This repository provides the public implementation of the Cubify Transformer (CuTR) model and the associated CA-1M dataset for indoor 3D object detection. It also includes CA-VQA annotations for multimodal LLM tasks. Aimed at researchers and developers in 3D computer vision and AI, it offers a scalable approach to understanding indoor spatial environments.
How It Works
The project centers on the Cubify Transformer (CuTR), a model designed for 3D object detection. It supports both RGB and RGB-Depth inputs. The core innovation lies in the CA-1M dataset, which builds upon ARKitScenes captures but features extensive, class-agnostic 3D bounding box annotations, registered ground-truth poses, and rendered depth maps. The CA-VQA dataset extends this by providing annotations for various multimodal question-answering tasks.
Quick Start & Requirements
Installation requires Python 3.10 and PyTorch 2.x. After installing PyTorch (pip install torch torchvision
), run pip install -r requirements.txt
followed by pip install -e .
. The system supports MPS (Apple Silicon) and CUDA-enabled GPUs for accelerated inference, with CPU fallback. Data download links are provided in data/train.txt
and data/val.txt
. A demo script (tools/demo.py
) facilitates running inference and visualization.
Highlighted Details
rerun
for interactive visualization of data and predictions, supporting options like --viz-only
and --viz-on-gt-points
.Maintenance & Community
The README does not specify details regarding maintainers, community channels (e.g., Discord, Slack), or a public roadmap.
Licensing & Compatibility
The data is released under a CC-by-NC-ND license, prohibiting commercial use and derivative works. The sample code is under the Apple Sample Code License, and models follow the Apple ML Research Model Terms of Use. These licenses impose significant restrictions on commercial deployment and redistribution.
Limitations & Caveats
The CC-by-NC-ND license for the dataset is a primary limitation for commercial applications. The CA-1M dataset is a subset of ARKitScenes, containing only captures successfully registered to a laser scanner. Support for custom device captures is described as "basic."
2 weeks ago
Inactive