3D scene understanding from video using vision-language models
Top 75.2% on sourcepulse
GPT4Scene enables understanding of 3D scenes from videos using vision-language models. It targets researchers and developers in computer vision and natural language processing, offering a novel approach to scene comprehension by integrating large language models with 3D scene data.
How It Works
GPT4Scene leverages the Qwen2-VL-7B-Instruct model, fine-tuned on a custom dataset for 3D scene understanding. The approach involves processing video frames and associated 3D scene information to generate descriptive text about the scene, facilitating tasks like visual question answering and scene description.
Quick Start & Requirements
pip install -e ".[torch,metrics]"
.qwen_vl_utils
, and flash-attn
.python download.py
.Highlighted Details
evaluate/infer.sh
and training via provided bash scripts.Maintenance & Community
The project is associated with researchers from The University of Hong Kong and Shanghai AI Laboratory. Links to relevant datasets and models are provided on Huggingface.
Licensing & Compatibility
Licensed under the Apache-2.0 License. This license is permissive and generally compatible with commercial use and closed-source linking.
Limitations & Caveats
The installation instructions note potential PyTorch download errors, requiring manual installation. Training is recommended to start with GPU disabled initially until the tokenizer is processed.
4 weeks ago
1 day