OpenYOLO3D by aminebdj

Fast and accurate open-vocabulary 3D instance segmentation

Created 2 years ago

259 stars

Top 97.7% on SourcePulse

Project Summary

Summary

Open-YOLO 3D addresses the significant computational cost and slow inference times associated with existing open-vocabulary 3D instance segmentation methods. It proposes a novel, fast, and accurate approach by leveraging only 2D object detection from multi-view RGB images, making it practical for real-world applications. This project targets researchers and engineers in 3D computer vision, robotics, and augmented/virtual reality who require efficient and precise 3D scene understanding.

How It Works

The core innovation of Open-YOLO 3D lies in its departure from computationally intensive 3D clip features and multi-view aggregation from heavy 2D foundation models like SAM. Instead, it employs a 2D object detector to efficiently generate class-agnostic 3D masks and associate them with text prompts. This design leverages the inherent instance information present in projected 3D point cloud instances, significantly reducing inference time while maintaining state-of-the-art accuracy.

Quick Start & Requirements

Setup involves configuring a Conda environment and downloading model checkpoints, pre-computed class-agnostic masks, and ground truth masks, as detailed in separate installation and data preparation guides. Example Python code is provided for single-scene inference and visualization. Specific hardware requirements, such as GPUs, are implied but not explicitly stated in the README.

Highlighted Details

Achieves state-of-the-art performance on the ScanNet200 and Replica datasets.
Delivers up to approximately 16x speedup compared to prior leading methods.
Reports 24.7% mAP on the ScanNet200 validation set with an inference time of 22 seconds per scene.
Recognized with an ICLR 2025 (Oral) presentation.

Maintenance & Community

The project is authored by researchers affiliated with MBZUAI, Technical University of Munich, Aalto University, Australian National University, and Linköping University. The code and accompanying paper were released on May 30, 2024. No community channels (e.g., Discord, Slack) or roadmap links are provided within the README.

Licensing & Compatibility

The README does not specify a software license. This omission prevents an assessment of compatibility for commercial use or closed-source linking, which is a critical factor for adoption decisions.

Limitations & Caveats

For reproducible results that precisely match the paper's reported metrics, the use of the provided pre-computed masks is recommended. Deviations in results may occur if these masks are not utilized, potentially due to stochastic elements inherent in underlying models like Mask3D, such as furthest point sampling.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days