3D-LLM  by UMass-Embodied-AGI

3D-LLM injects 3D data into large language models

created 2 years ago
1,123 stars

Top 34.8% on sourcepulse

GitHubView on GitHub
Project Summary

3D-LLM is a novel Large Language Model capable of processing 3D object and scene data, enabling a deeper understanding of spatial information. It targets researchers and developers working with 3D computer vision and multimodal AI, offering a foundation for advanced 3D-aware reasoning and generation tasks.

How It Works

3D-LLM integrates 3D world representations into LLMs by leveraging a multimodal approach. It processes 3D data (point clouds, scene graphs) and converts them into a format understandable by LLMs, likely through feature extraction and projection techniques similar to existing vision-language models. This allows the LLM to reason about spatial relationships, object properties, and scene semantics.

Quick Start & Requirements

  • Installation: Requires conda environment setup and installation of salesforce-lavis and positional_encodings.
  • Prerequisites: Python 3.8, PyTorch, and potentially CUDA for GPU acceleration. Specific data downloads (Objaverse, ScanNet, HM3D) and pre-trained checkpoints are necessary.
  • Resources: Significant disk space for datasets (~250GB for scene data) and computational resources for inference and fine-tuning.
  • Links: Salesforce-LAVIS, DEMO.md

Highlighted Details

  • First LLM capable of taking 3D representations as input.
  • Supports both object (Objaverse) and scene (ScanNet, HM3D) data.
  • Released pre-training and fine-tuning checkpoints for various tasks (ScanQA, SQA3d, 3DMV_VQA).
  • Detailed instructions for generating 3D features from raw data using tools like Blender, ChatCaptioner, Mask2Former, and Segment Anything.

Maintenance & Community

  • Project associated with UMass Embodied AGI.
  • Mentions NeurIPS 2023 Spotlight publication.
  • Acknowledgements list several key open-source projects, indicating community reliance and contribution.

Licensing & Compatibility

  • The repository itself does not explicitly state a license in the README. However, it heavily relies on and acknowledges salesforce-lavis, which is typically released under a permissive license (e.g., MIT). Users should verify the licensing of all components and datasets.

Limitations & Caveats

  • Some data generation steps and features are marked as "TODO" or "still cleaning," indicating ongoing development.
  • Hugging Face auto-loading for checkpoints is also a TODO item.
  • The data generation process is complex, involving multiple external tools and significant computational effort.
Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
56 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.