Discover and explore top open-source AI tools and projects—updated daily.
AnjieChengGrounded spatial reasoning for vision-language models
Top 94.5% on SourcePulse
SpatialRGPT addresses complex spatial reasoning in vision-language models (VLMs) by processing 2D and 3D spatial arrangements and region proposals (boxes, masks). It enables VLMs to answer intricate spatial queries, benefiting researchers and developers seeking enhanced scene understanding capabilities.
How It Works
This model integrates depth estimation (via Depth-Anything) and advanced segmentation (SAM-HQ) to ground spatial understanding. It processes arbitrary region proposals, allowing for detailed analysis of object relationships and spatial configurations within images. The architecture builds upon VILA training methodologies, enhancing its foundational VLM capabilities with specialized spatial reasoning skills.
Quick Start & Requirements
Installation requires setting up separate Conda environments for training (srgpt) and the Gradio demo (srgpt-demo). Key dependencies include Gradio, DeepSpeed, Detectron2 (requiring CUDA_HOME configuration), Depth-Anything (requiring checkpoint download and path export), and SAM-HQ (requiring checkpoint download and path export). The demo can be launched via python gradio_web_server_multi.py --model-path PATH_TO_CHECKPOINT after environment setup. Training scripts are available for different LLM backbones (e.g., Llama3 8B). Users need to download the Open Spatial Dataset and potentially OpenImagesV7.
Highlighted Details
Maintenance & Community
The code, dataset, and benchmark were released on October 7, 2024. The project acknowledges contributions from several other repositories, including VILA, Omni3D, GLaMM, VQASynth, and ConceptGraphs. No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README.
Licensing & Compatibility
The license type and compatibility notes for commercial use or closed-source linking are not specified in the provided README.
Limitations & Caveats
The Gradio demo environment has known pydantic version conflicts, making it incompatible with the training environment. The project notes that recent package updates may introduce bugs, and users are encouraged to report issues. Detectron2 installation may require manual configuration of the CUDA_HOME environment variable.
10 months ago
Inactive