3D-LLM is a novel Large Language Model capable of processing 3D object and scene data, enabling a deeper understanding of spatial information. It targets researchers and developers working with 3D computer vision and multimodal AI, offering a foundation for advanced 3D-aware reasoning and generation tasks.
How It Works
3D-LLM integrates 3D world representations into LLMs by leveraging a multimodal approach. It processes 3D data (point clouds, scene graphs) and converts them into a format understandable by LLMs, likely through feature extraction and projection techniques similar to existing vision-language models. This allows the LLM to reason about spatial relationships, object properties, and scene semantics.
Quick Start & Requirements
- Installation: Requires
conda
environment setup and installation of salesforce-lavis
and positional_encodings
.
- Prerequisites: Python 3.8, PyTorch, and potentially CUDA for GPU acceleration. Specific data downloads (Objaverse, ScanNet, HM3D) and pre-trained checkpoints are necessary.
- Resources: Significant disk space for datasets (~250GB for scene data) and computational resources for inference and fine-tuning.
- Links: Salesforce-LAVIS, DEMO.md
Highlighted Details
- First LLM capable of taking 3D representations as input.
- Supports both object (Objaverse) and scene (ScanNet, HM3D) data.
- Released pre-training and fine-tuning checkpoints for various tasks (ScanQA, SQA3d, 3DMV_VQA).
- Detailed instructions for generating 3D features from raw data using tools like Blender, ChatCaptioner, Mask2Former, and Segment Anything.
Maintenance & Community
- Project associated with UMass Embodied AGI.
- Mentions NeurIPS 2023 Spotlight publication.
- Acknowledgements list several key open-source projects, indicating community reliance and contribution.
Licensing & Compatibility
- The repository itself does not explicitly state a license in the README. However, it heavily relies on and acknowledges
salesforce-lavis
, which is typically released under a permissive license (e.g., MIT). Users should verify the licensing of all components and datasets.
Limitations & Caveats
- Some data generation steps and features are marked as "TODO" or "still cleaning," indicating ongoing development.
- Hugging Face auto-loading for checkpoints is also a TODO item.
- The data generation process is complex, involving multiple external tools and significant computational effort.