GPT4Scene-and-VLN-R1  by Qi-Zhangyang

3D scene understanding from video using vision-language models

created 6 months ago
387 stars

Top 75.2% on sourcepulse

GitHubView on GitHub
Project Summary

GPT4Scene enables understanding of 3D scenes from videos using vision-language models. It targets researchers and developers in computer vision and natural language processing, offering a novel approach to scene comprehension by integrating large language models with 3D scene data.

How It Works

GPT4Scene leverages the Qwen2-VL-7B-Instruct model, fine-tuned on a custom dataset for 3D scene understanding. The approach involves processing video frames and associated 3D scene information to generate descriptive text about the scene, facilitating tasks like visual question answering and scene description.

Quick Start & Requirements

  • Install via pip install -e ".[torch,metrics]".
  • Requires Python 3.10, PyTorch 2.5.0 with CUDA 12.1, qwen_vl_utils, and flash-attn.
  • Dataset and model weights can be downloaded using python download.py.
  • Official documentation and model weights are available on Huggingface.

Highlighted Details

  • Utilizes the Qwen2-VL-7B-Instruct model for multimodal understanding.
  • Fine-tuned weights for GPT4Scene are available.
  • Supports inference via evaluate/infer.sh and training via provided bash scripts.
  • Dataset includes annotations for 3D scene understanding tasks.

Maintenance & Community

The project is associated with researchers from The University of Hong Kong and Shanghai AI Laboratory. Links to relevant datasets and models are provided on Huggingface.

Licensing & Compatibility

Licensed under the Apache-2.0 License. This license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The installation instructions note potential PyTorch download errors, requiring manual installation. Training is recommended to start with GPU disabled initially until the tokenizer is processed.

Health Check
Last commit

4 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
5
Star History
133 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.