3D-LLM by UMass-Embodied-AGI

3D-LLM injects 3D data into large language models

Created 2 years ago

1,167 stars

Top 33.2% on SourcePulse

Project Summary

3D-LLM is a novel Large Language Model capable of processing 3D object and scene data, enabling a deeper understanding of spatial information. It targets researchers and developers working with 3D computer vision and multimodal AI, offering a foundation for advanced 3D-aware reasoning and generation tasks.

How It Works

3D-LLM integrates 3D world representations into LLMs by leveraging a multimodal approach. It processes 3D data (point clouds, scene graphs) and converts them into a format understandable by LLMs, likely through feature extraction and projection techniques similar to existing vision-language models. This allows the LLM to reason about spatial relationships, object properties, and scene semantics.

Quick Start & Requirements

Installation: Requires conda environment setup and installation of salesforce-lavis and positional_encodings.
Prerequisites: Python 3.8, PyTorch, and potentially CUDA for GPU acceleration. Specific data downloads (Objaverse, ScanNet, HM3D) and pre-trained checkpoints are necessary.
Resources: Significant disk space for datasets (~250GB for scene data) and computational resources for inference and fine-tuning.
Links: Salesforce-LAVIS, DEMO.md

Highlighted Details

First LLM capable of taking 3D representations as input.
Supports both object (Objaverse) and scene (ScanNet, HM3D) data.
Released pre-training and fine-tuning checkpoints for various tasks (ScanQA, SQA3d, 3DMV_VQA).
Detailed instructions for generating 3D features from raw data using tools like Blender, ChatCaptioner, Mask2Former, and Segment Anything.

Maintenance & Community

Project associated with UMass Embodied AGI.
Mentions NeurIPS 2023 Spotlight publication.
Acknowledgements list several key open-source projects, indicating community reliance and contribution.

Licensing & Compatibility

The repository itself does not explicitly state a license in the README. However, it heavily relies on and acknowledges salesforce-lavis, which is typically released under a permissive license (e.g., MIT). Users should verify the licensing of all components and datasets.

Limitations & Caveats

Some data generation steps and features are marked as "TODO" or "still cleaning," indicating ongoing development.
Hugging Face auto-loading for checkpoints is also a TODO item.
The data generation process is complex, involving multiple external tools and significant computational effort.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

7 stars in the last 30 days

Explore Similar Projects

CE3D by Fangkang515

3D scene editor for interactive manipulation via LLM-driven chat

Created 1 year ago

Updated 7 months ago

SceneVerse by scene-verse

Scaling 3D vision-language learning for grounded scene understanding

Created 2 years ago

Updated 9 months ago

LL3DA by Open3DA

Large Language 3D Assistant for visual, textual interactions in 3D environments

Created 2 years ago

Updated 1 year ago

Starred by

Ajay Jain

Ajay Jain(Cofounder of Genmo).

Cap3D by crockwell

Research paper for scalable 3D captioning using pretrained models

Created 2 years ago

Updated 6 months ago

LLaVA-3D by ZCMax

LLM for 2D/3D vision-language tasks

Created 1 year ago

Updated 2 months ago

video2game by video2game

Code release for real-time interactive environment creation from video

Created 1 year ago

Updated 1 year ago

Starred by

Alex Yu

Alex Yu(Research Scientist at OpenAI; Cofounder of Luma AI).

kiuikit by ashawkey

3D computer vision and LLM agent toolkit

Created 4 years ago

Updated 20 hours ago

Point-Bind_Point-LLM by ZiyuGuo99

3D multi-modality model aligning point clouds with language models

Created 2 years ago

Updated 2 years ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

OpenLRM by 3DTopia

Open-source implementation of Large Reconstruction Models

Created 2 years ago

Updated 1 year ago

concept-graphs by concept-graphs

Code release for open-vocabulary 3D scene graphs

Created 2 years ago

Updated 2 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Luis Capelo

Luis Capelo(Cofounder of Lightning AI), and

6 more.

threestudio by threestudio-project

Framework for 3D content generation from text/images using 2D diffusion

Created 2 years ago

Updated 1 year ago

Starred by

Ishaan Jaffer

Ishaan Jaffer(Cofounder of LiteLLM),

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI), and

6 more.

Grounded-Segment-Anything by IDEA-Research

Framework for open-world visual tasks, combining multiple models

Created 2 years ago

Updated 1 year ago

Feedback? Help us improve.