3D-R1 by AIGeeksGroup

3D scene understanding model

Created 7 months ago

396 stars

Top 73.0% on SourcePulse

Project Summary

3D-R1 is a foundation model designed to enhance the reasoning capabilities of 3D Vision-Language Models (VLMs) for unified scene understanding. It addresses the limitations of current 3D VLMs in robust reasoning and generalization, offering a solution for researchers and practitioners working with 3D spatial data.

How It Works

3D-R1 employs a multi-faceted approach to improve 3D scene understanding. It utilizes a novel synthetic dataset, Scene-30K, created with Chain-of-Thought (CoT) reasoning and leveraging Gemini 2.5 Pro. For enhanced reasoning, it incorporates Reinforcement Learning from Human Feedback (RLHF) techniques, specifically GRPO, guided by three reward functions: perception, semantic similarity, and format rewards. A dynamic view selection strategy adaptively chooses the most informative perspectives, further boosting performance.

Quick Start & Requirements

Installation: Requires h5py, scipy, cython, plyfile, trimesh, networkx, torch (2.0.1+cu118), google-generative-ai, peft, transformers, accelerate, tqdm, orjson, and specific git installations for CLIP and Depth-Anything. PointNet++ and accelerated GIOU need to be built from source.
Prerequisites: CUDA 11.8 and Python 3.9.16 are tested environments. Data preparation involves downloading ScanNetV2, ScanRefer, Nr3D, ScanQA, and 3D-LLM datasets, along with optional synthesis of Scene-30K. Pre-trained weights from Qwen2.5-VL-7B-Instruct are also required.
Resources: Setup involves data downloading and compilation steps. Training scripts for SFT and RL are provided.
Links: Paper, Website, Data, Models, HF Paper, YouTube Video.

Highlighted Details

Achieves an average improvement of 10% across various 3D scene benchmarks.
Introduces Scene-30K, a high-quality synthetic dataset with CoT reasoning.
Employs GRPO with perception, semantic similarity, and format rewards for enhanced reasoning.
Features a dynamic view selection strategy for adaptive perspective selection.

Maintenance & Community

The project is associated with Ting Huang, Zeyu Zhang, and Hao Tang. Further community engagement details (like Discord/Slack) are not specified in the README.

Licensing & Compatibility

The README does not explicitly state the license type or compatibility for commercial use.

Limitations & Caveats

The project acknowledges a bounding box drift issue in visualizations, which is currently being addressed. A detailed visualization tutorial and a Hugging Face demo are planned but not yet released.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days