3D-LLM  by UMass-Embodied-AGI

3D-LLM injects 3D data into large language models

Created 2 years ago
1,137 stars

Top 33.8% on SourcePulse

GitHubView on GitHub
Project Summary

3D-LLM is a novel Large Language Model capable of processing 3D object and scene data, enabling a deeper understanding of spatial information. It targets researchers and developers working with 3D computer vision and multimodal AI, offering a foundation for advanced 3D-aware reasoning and generation tasks.

How It Works

3D-LLM integrates 3D world representations into LLMs by leveraging a multimodal approach. It processes 3D data (point clouds, scene graphs) and converts them into a format understandable by LLMs, likely through feature extraction and projection techniques similar to existing vision-language models. This allows the LLM to reason about spatial relationships, object properties, and scene semantics.

Quick Start & Requirements

  • Installation: Requires conda environment setup and installation of salesforce-lavis and positional_encodings.
  • Prerequisites: Python 3.8, PyTorch, and potentially CUDA for GPU acceleration. Specific data downloads (Objaverse, ScanNet, HM3D) and pre-trained checkpoints are necessary.
  • Resources: Significant disk space for datasets (~250GB for scene data) and computational resources for inference and fine-tuning.
  • Links: Salesforce-LAVIS, DEMO.md

Highlighted Details

  • First LLM capable of taking 3D representations as input.
  • Supports both object (Objaverse) and scene (ScanNet, HM3D) data.
  • Released pre-training and fine-tuning checkpoints for various tasks (ScanQA, SQA3d, 3DMV_VQA).
  • Detailed instructions for generating 3D features from raw data using tools like Blender, ChatCaptioner, Mask2Former, and Segment Anything.

Maintenance & Community

  • Project associated with UMass Embodied AGI.
  • Mentions NeurIPS 2023 Spotlight publication.
  • Acknowledgements list several key open-source projects, indicating community reliance and contribution.

Licensing & Compatibility

  • The repository itself does not explicitly state a license in the README. However, it heavily relies on and acknowledges salesforce-lavis, which is typically released under a permissive license (e.g., MIT). Users should verify the licensing of all components and datasets.

Limitations & Caveats

  • Some data generation steps and features are marked as "TODO" or "still cleaning," indicating ongoing development.
  • Hugging Face auto-loading for checkpoints is also a TODO item.
  • The data generation process is complex, involving multiple external tools and significant computational effort.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
6 more.

threestudio by threestudio-project

0.2%
7k
Framework for 3D content generation from text/images using 2D diffusion
Created 2 years ago
Updated 9 months ago
Feedback? Help us improve.