SpatialRGPT by AnjieCheng

Grounded spatial reasoning for vision-language models

Created 1 year ago

305 stars

Top 87.9% on SourcePulse

Project Summary

SpatialRGPT addresses complex spatial reasoning in vision-language models (VLMs) by processing 2D and 3D spatial arrangements and region proposals (boxes, masks). It enables VLMs to answer intricate spatial queries, benefiting researchers and developers seeking enhanced scene understanding capabilities.

How It Works

This model integrates depth estimation (via Depth-Anything) and advanced segmentation (SAM-HQ) to ground spatial understanding. It processes arbitrary region proposals, allowing for detailed analysis of object relationships and spatial configurations within images. The architecture builds upon VILA training methodologies, enhancing its foundational VLM capabilities with specialized spatial reasoning skills.

Quick Start & Requirements

Installation requires setting up separate Conda environments for training (srgpt) and the Gradio demo (srgpt-demo). Key dependencies include Gradio, DeepSpeed, Detectron2 (requiring CUDA_HOME configuration), Depth-Anything (requiring checkpoint download and path export), and SAM-HQ (requiring checkpoint download and path export). The demo can be launched via python gradio_web_server_multi.py --model-path PATH_TO_CHECKPOINT after environment setup. Training scripts are available for different LLM backbones (e.g., Llama3 8B). Users need to download the Open Spatial Dataset and potentially OpenImagesV7.

Highlighted Details

Accepted to NeurIPS 2024.
Supports both 2D and 3D spatial reasoning.
Processes arbitrary region proposals (boxes, masks).
Leverages Depth-Anything for depth estimation and SAM-HQ for segmentation.
Compatible with VILA evaluation scripts and includes a dataset synthesis pipeline.

Maintenance & Community

The code, dataset, and benchmark were released on October 7, 2024. The project acknowledges contributions from several other repositories, including VILA, Omni3D, GLaMM, VQASynth, and ConceptGraphs. No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The license type and compatibility notes for commercial use or closed-source linking are not specified in the provided README.

Limitations & Caveats

The Gradio demo environment has known pydantic version conflicts, making it incompatible with the training environment. The project notes that recent package updates may introduce bugs, and users are encouraged to report issues. Detectron2 installation may require manual configuration of the CUDA_HOME environment variable.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days