SpatialLM by manycore-research

3D LLM for structured indoor modeling from point clouds

Created 10 months ago

4,178 stars

Top 11.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

SpatialLM is a 3D large language model designed for structured indoor scene understanding from diverse 3D point cloud data. It targets researchers and developers in robotics and 3D computer vision, enabling the extraction of architectural elements and oriented object bounding boxes from unstructured point clouds generated by monocular video, RGBD, or LiDAR sensors.

How It Works

SpatialLM employs a multimodal architecture that integrates a point cloud encoder (SceneScript) with large language models (Llama or Qwen variants). This approach allows it to process raw 3D geometric data and translate it into structured semantic representations, such as walls, doors, windows, and semantically categorized objects with precise orientation. The advantage lies in its ability to handle noisy, real-world data from monocular sources, bridging the gap between raw geometry and high-level scene understanding without requiring specialized capture equipment.

Quick Start & Requirements

Install: Clone the repository, create a conda environment with Python 3.11 and CUDA 12.4, install dependencies via poetry install, and then poe install-torchsparse.
Prerequisites: Python 3.11, PyTorch 2.4.1, CUDA 12.4, sparsehash, poetry.
Inference: Download example point clouds from HuggingFace. Run inference using python inference.py --point_cloud <path> --output <path> --model_path <model_name>.
Visualization: Use rerun for visualization after converting output with python visualize.py.
Links: HuggingFace Models, SpatialLM-Testset.

Highlighted Details

Achieves competitive benchmark results on the challenging SpatialLM-Testset, which consists of point clouds reconstructed from monocular RGB videos.
Supports two model variants: SpatialLM-Llama-1B and SpatialLM-Qwen-0.5B.
Provides scripts for inference, evaluation, and visualization using rerun.
Includes an example for processing custom videos using SLAM3R.

Maintenance & Community

The project is developed by the ManyCore Research Team. Key dependencies include Llama3.2, Qwen2.5, Transformers, SceneScript, and TorchSparse.

Licensing & Compatibility

SpatialLM-Llama-1B is under the Llama3.2 license.
SpatialLM-Qwen-0.5B is under the Apache 2.0 License.
SceneScript encoder is CC-BY-NC-4.0 (Non-Commercial).
TorchSparse is MIT.
The CC-BY-NC-4.0 license on SceneScript restricts commercial use.

Limitations & Caveats

The CC-BY-NC-4.0 license on the SceneScript encoder, a core component, prohibits commercial use of the models. The project's test set is derived from monocular RGB videos, which are noted as being more challenging due to noise and occlusions compared to clean RGBD scans.

Health Check

Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

65 stars in the last 30 days