LL3DA  by Open3DA

Large Language 3D Assistant for visual, textual interactions in 3D environments

created 1 year ago
299 stars

Top 90.0% on sourcepulse

GitHubView on GitHub
Project Summary

LL3DA is a Large Language 3D Assistant designed for omni-3D understanding, reasoning, and planning in complex 3D environments. It targets researchers and developers working with 3D vision-language models, offering direct point cloud input processing to overcome the limitations of 2D feature projection methods.

How It Works

LL3DA directly processes point cloud data, a permutation-invariant 3D representation, to comprehend and respond to textual instructions and visual prompts. This approach avoids the computational overhead and performance degradation associated with projecting 2D features into 3D space, enabling more accurate understanding and disambiguation in cluttered scenes.

Quick Start & Requirements

  • Install: Requires manual compilation of pointnet2 and accelerated giou from source.
  • Dependencies: Python 3.8.16, CUDA 11.6, torch=1.13.1+cu116, transformers>=4.37.0, h5py, scipy, cython, plyfile, trimesh>=2.35.39,<2.35.40, networkx>=2.2,<2.3.
  • Data: Requires ScanNet V2 dataset, ScanRefer, Nr3D, ScanQA, and 3D-LLM datasets. Pre-processed ScanNet data is available.
  • Weights: BERT embeddings and pre-trained LLM weights (e.g., opt-1.3b) need to be downloaded.
  • Setup Time: Significant time required for data preparation and dependency compilation.
  • Links: Project Page, Arxiv Paper, YouTube, HuggingFace Demo (WIP).

Highlighted Details

  • Achieves state-of-the-art results on 3D Dense Captioning and 3D Question Answering benchmarks.
  • Supports various decoder-only LLMs including OPT, GPT-2, Llama-2, and Qwen.
  • Provides training and evaluation scripts for generalist models and task-specific fine-tuning (ScanQA, ScanRefer, Nr3D, OVDet).
  • Code released for training customized models.

Maintenance & Community

  • Code fully released March 2024. Accepted to CVPR 2024.
  • Pre-trained weights available on HuggingFace.

Licensing & Compatibility

  • MIT LICENSE. Permissive for commercial use and closed-source linking.

Limitations & Caveats

  • The released version has minor differences from the paper's implementation; specific scripts are provided to reproduce reported results.
  • A local demo interface is still under development.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.