SpatialRGPT  by AnjieCheng

Grounded spatial reasoning for vision-language models

Created 1 year ago
273 stars

Top 94.5% on SourcePulse

GitHubView on GitHub
Project Summary

SpatialRGPT addresses complex spatial reasoning in vision-language models (VLMs) by processing 2D and 3D spatial arrangements and region proposals (boxes, masks). It enables VLMs to answer intricate spatial queries, benefiting researchers and developers seeking enhanced scene understanding capabilities.

How It Works

This model integrates depth estimation (via Depth-Anything) and advanced segmentation (SAM-HQ) to ground spatial understanding. It processes arbitrary region proposals, allowing for detailed analysis of object relationships and spatial configurations within images. The architecture builds upon VILA training methodologies, enhancing its foundational VLM capabilities with specialized spatial reasoning skills.

Quick Start & Requirements

Installation requires setting up separate Conda environments for training (srgpt) and the Gradio demo (srgpt-demo). Key dependencies include Gradio, DeepSpeed, Detectron2 (requiring CUDA_HOME configuration), Depth-Anything (requiring checkpoint download and path export), and SAM-HQ (requiring checkpoint download and path export). The demo can be launched via python gradio_web_server_multi.py --model-path PATH_TO_CHECKPOINT after environment setup. Training scripts are available for different LLM backbones (e.g., Llama3 8B). Users need to download the Open Spatial Dataset and potentially OpenImagesV7.

Highlighted Details

  • Accepted to NeurIPS 2024.
  • Supports both 2D and 3D spatial reasoning.
  • Processes arbitrary region proposals (boxes, masks).
  • Leverages Depth-Anything for depth estimation and SAM-HQ for segmentation.
  • Compatible with VILA evaluation scripts and includes a dataset synthesis pipeline.

Maintenance & Community

The code, dataset, and benchmark were released on October 7, 2024. The project acknowledges contributions from several other repositories, including VILA, Omni3D, GLaMM, VQASynth, and ConceptGraphs. No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The license type and compatibility notes for commercial use or closed-source linking are not specified in the provided README.

Limitations & Caveats

The Gradio demo environment has known pydantic version conflicts, making it incompatible with the training environment. The project notes that recent package updates may introduce bugs, and users are encouraged to report issues. Detectron2 installation may require manual configuration of the CUDA_HOME environment variable.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.