ml-cubifyanything by apple

Scaling indoor 3D object detection and spatial understanding

Created 1 year ago

424 stars

Top 69.0% on SourcePulse

Project Summary

Summary

This repository provides the public implementation of the Cubify Transformer (CuTR) model and the associated CA-1M dataset for indoor 3D object detection. It also includes CA-VQA annotations for multimodal LLM tasks. Aimed at researchers and developers in 3D computer vision and AI, it offers a scalable approach to understanding indoor spatial environments.

How It Works

The project centers on the Cubify Transformer (CuTR), a model designed for 3D object detection. It supports both RGB and RGB-Depth inputs. The core innovation lies in the CA-1M dataset, which builds upon ARKitScenes captures but features extensive, class-agnostic 3D bounding box annotations, registered ground-truth poses, and rendered depth maps. The CA-VQA dataset extends this by providing annotations for various multimodal question-answering tasks.

Quick Start & Requirements

Installation requires Python 3.10 and PyTorch 2.x. After installing PyTorch (pip install torch torchvision), run pip install -r requirements.txt followed by pip install -e .. The system supports MPS (Apple Silicon) and CUDA-enabled GPUs for accelerated inference, with CPU fallback. Data download links are provided in data/train.txt and data/val.txt. A demo script (tools/demo.py) facilitates running inference and visualization.

Highlighted Details

CA-1M Dataset: Offers class-agnostic 3D bounding boxes, per-frame ground-truth 3D boxes derived from rendering, registered laser scanner poses, and high-resolution (512x384) rendered depth maps.
CA-VQA Dataset: Provides diverse question-answering tasks including binary classification, cardinality estimation, 2D/3D grounding, and regression, leveraging multimodal inputs.
Visualization: Integrates with rerun for interactive visualization of data and predictions, supporting options like --viz-only and --viz-on-gt-points.
Custom Capture Support: Includes basic functionality to process RGB/Depth data captured using the NeRF Capture app.

Maintenance & Community

The README does not specify details regarding maintainers, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

The data is released under a CC-by-NC-ND license, prohibiting commercial use and derivative works. The sample code is under the Apple Sample Code License, and models follow the Apple ML Research Model Terms of Use. These licenses impose significant restrictions on commercial deployment and redistribution.

Limitations & Caveats

The CC-by-NC-ND license for the dataset is a primary limitation for commercial applications. The CA-1M dataset is a subset of ARKitScenes, containing only captures successfully registered to a laser scanner. Support for custom device captures is described as "basic."

Health Check

Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days