3D LLM for structured indoor modeling from point clouds
Top 13.9% on sourcepulse
SpatialLM is a 3D large language model designed for structured indoor scene understanding from diverse 3D point cloud data. It targets researchers and developers in robotics and 3D computer vision, enabling the extraction of architectural elements and oriented object bounding boxes from unstructured point clouds generated by monocular video, RGBD, or LiDAR sensors.
How It Works
SpatialLM employs a multimodal architecture that integrates a point cloud encoder (SceneScript) with large language models (Llama or Qwen variants). This approach allows it to process raw 3D geometric data and translate it into structured semantic representations, such as walls, doors, windows, and semantically categorized objects with precise orientation. The advantage lies in its ability to handle noisy, real-world data from monocular sources, bridging the gap between raw geometry and high-level scene understanding without requiring specialized capture equipment.
Quick Start & Requirements
poetry install
, and then poe install-torchsparse
.sparsehash
, poetry
.python inference.py --point_cloud <path> --output <path> --model_path <model_name>
.rerun
for visualization after converting output with python visualize.py
.Highlighted Details
rerun
.Maintenance & Community
The project is developed by the ManyCore Research Team. Key dependencies include Llama3.2, Qwen2.5, Transformers, SceneScript, and TorchSparse.
Licensing & Compatibility
Limitations & Caveats
The CC-BY-NC-4.0 license on the SceneScript encoder, a core component, prohibits commercial use of the models. The project's test set is derived from monocular RGB videos, which are noted as being more challenging due to noise and occlusions compared to clean RGBD scans.
1 week ago
1 day