GPT4Scene-and-VLN-R1 by Qi-Zhangyang

3D scene understanding from video using vision-language models

Created 11 months ago

475 stars

Top 64.2% on SourcePulse

Project Summary

GPT4Scene enables understanding of 3D scenes from videos using vision-language models. It targets researchers and developers in computer vision and natural language processing, offering a novel approach to scene comprehension by integrating large language models with 3D scene data.

How It Works

GPT4Scene leverages the Qwen2-VL-7B-Instruct model, fine-tuned on a custom dataset for 3D scene understanding. The approach involves processing video frames and associated 3D scene information to generate descriptive text about the scene, facilitating tasks like visual question answering and scene description.

Quick Start & Requirements

Install via pip install -e ".[torch,metrics]".
Requires Python 3.10, PyTorch 2.5.0 with CUDA 12.1, qwen_vl_utils, and flash-attn.
Dataset and model weights can be downloaded using python download.py.
Official documentation and model weights are available on Huggingface.

Highlighted Details

Utilizes the Qwen2-VL-7B-Instruct model for multimodal understanding.
Fine-tuned weights for GPT4Scene are available.
Supports inference via evaluate/infer.sh and training via provided bash scripts.
Dataset includes annotations for 3D scene understanding tasks.

Maintenance & Community

The project is associated with researchers from The University of Hong Kong and Shanghai AI Laboratory. Links to relevant datasets and models are provided on Huggingface.

Licensing & Compatibility

Licensed under the Apache-2.0 License. This license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The installation instructions note potential PyTorch download errors, requiring manual installation. Training is recommended to start with GPU disabled initially until the tokenizer is processed.

GPT4Scene-and-VLN-R1 by Qi-Zhangyang

Explore Similar Projects

SceneVerse by scene-verse

Open-R1-Video by Wang-Xiaodong1899

Vitron by SkyworkAI

VideoLLaMA3 by DAMO-NLP-SG

molmo by allenai

Vary by Ucas-HaoranWei

HunyuanVideo-I2V by Tencent-Hunyuan

Otter by EvolvingLMMs-Lab

open_flamingo by mlfoundations

minimind-v by jingyaogong

X-AnyLabeling by CVHub520

open_clip by mlfoundations